Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Adaptive execution: improving performance through the runtime adaptation of performance parameters
(USC Thesis Other)
Adaptive execution: improving performance through the runtime adaptation of performance parameters
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
INFORMATION T O USERS This manuscript has been reproduced from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may b e from any type o f computer printer. T h e quality of this reproduction is dependent upon the quality of th e copy submitted. Broken or indistinct print, colored o r poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send UM I a complete manuscript and there are missing pages, these w ill be noted. Also, i f unauthorized copyright material had to be removed, a note will indicate the deletion. Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. Each original is also photographed in one exposure and is included in reduced form at the back o f the book. Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6” x 9” black and white photographic prints are available for any photographs o r illustrations appearing in this copy for an additional charge. Contact UM I directly to order. UMI A Bell & Howell Information Company 300 North Zeeb Road, Ann Arbor M I 48106-1346 U SA 313/761-4700 800/521-0600 ADAPTIVE EXECUTION: IMPROVING PERFORMANCE THROUGH THE RUNTIME ADAPTATION OF PERFORMANCE PARAMETERS by Daeyeon Park A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Computer Science) May 1996 Copyright 1996 Daeyeon Park UMI Number: 9636738 UMI Microform 9636738 Copyright 1996, by UMI Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. UMI 300 North Zeeb Road Ann Arbor, MI 48103 UNIVERSITY O F SO U TH ER N CALIFORNIA T H E GRADUATE SCHO O L U NIV ERSITY PARK LO S A N G ELES. C A LIFO R N IA 0 0 0 0 7 This thesis, w ritten by under the direction of h.\.fk....T hesis Com m ittee, and approved by all its members, has been pre sented to and accepted by the D ean of The Graduate School, in partial fulfillm ent of the requirements fo r the degree of oe~ pyV\LosoPH^ % 0 M S April 19, 1996 THESIS COMMITTEE A cknow ledgem ents First and foremost I wish to express my deepest gratitude from the bottom of my heart to my advisor, Rafael Saavedra for his guidance and support throughout this research. “He’s everywhere I want to be” . His tutelage, insight and intuitions have been invaluable. I also wish to thank my qualifying exam and dissertation committee members, Ellis Horowitz, Kai Hwang, Prasanna Kumar, Peter Danzig, and Timothy Pinkston for many helpful suggestions, criticism and advice on the thesis. I also thank my good friends and colleagues that made my stay at USC both enjoyable and enlightening. First, I would like to thank my group members, Weihua Mao, Jacqueline Chame, and Sungdo Moon, for making my working environment to be friendly and cooperative. I thank Shih-Hao Li and Dongho Kim for allowing me to bother them in difficult situations. Thanks also to our tennis club members for making me healthy and happy by sometimes letting me win: Sukhan Lee, Jongdae Jung, Haksung Lee, Dongin Chang, Yeonghan Chun, Yeongwoo Choi, Chunsik Yi, Bonghan Cho, and Hoh In. I also would like to thank my old friends for their many interesting discussions and encouragement: Sungbok Kim, Soowon Lee, Sanghoe Koo, Jongsuk Ahn, Kyoungmu Lee, and Juntae Kim. I thank my best friend, Sungkwon Noh for his support, encouragement, and friendship throughout all these years. I also would like to give my sincere gratitude to my most respecting senior, Heesoon Park for his guidance, support, and encour agement. Finally, I wish to thank my mother, brothers and sisters, for their love, support and understanding during the long course leading to this dissertation. C ontents Acknowledgements ii List Of Tables vii List Of Figures viii Abstract x 1 Introduction 1 1.1 M otivation........................................................................................................ 1 1.2 Overview of D issertation .............................................................................. 3 1.2.1 Adaptive Program Execution ........................................................ 4 1.2.2 Adaptive Control E x ecu tio n ............................................................ 5 1.3 C ontributions................................................................................................. 6 1.4 Organization of D issertation........................................................................ 7 2 Adaptive Execution 9 2.1 Issues for Adaptive E x ec u tio n .................................................................... 10 2.1.1 Stability and C onvergence............................................................... 10 2.1.2 Local vs. Global Adaptive O ptim izations..................................... 11 2.2 Adaptive Program Execution .................................................................... 12 2.2.1 Framework ......................................................................................... 12 2.2.2 Classification of Program Execution............................................... 14 2.2.2.1 Illustration of Program Execution................................. 14 2.2.3 Interaction between Agents and M onitors..................................... 16 2.2.3.1 Polling.................................................................................. 16 2.2.3.2 Performance-Driven Interrupts .................................... 17 2.2.3.3 Communication and Performance P ro c e sso r.............. 18 2.2.4 Hardware Support ............................................................................ 19 2.2.5 Compiler S u p p o rt............................................................................... 20 2.2.6 Overhead of Adaptive Program E x e c u tio n .................................. 20 2.2.7 Related W o rk ...................................................................................... 21 2.2.7.1 Profile-Based O p tim iz a tio n ........................................... 21 iii 2.2.7.2 Language Constructs with User Participation .... 21 2.2.7.3 Combination of Cost Model with Runtime Test . . . 22 2.3 Adaptive Control E x ec u tio n ........................................................................ 22 2.3.1 Related W o rk ...................................................................................... 23 2.3.1.1 Adaptive Cache Coherence P ro to co ls........................... 23 2.4 Chapter Summary ........................................................................................ 24 3 Adaptive Prefetching 25 3.1 Static and Dynamic Prefetching.................................................................. 25 3.2 The Effects of Prefetch Cancellation........................................................... 29 3.3 Adapting the Prefetch D ista n c e .................................................................. 31 3.4 Hardware Support for Adaptive P re fe tc h in g ........................................... 32 3.5 Algorithm for Adapting Prefetch D is ta n c e ............................................... 36 3.6 Simulation M ethodology............................................................................... 36 3.6.1 Architectural A ssum ptions............................................................... 37 3.6.2 Simulation E nvironm ent.................................................................. 38 3.6.3 Applications......................................................................................... 38 3.7 Experimental Results and D iscussion........................................................ 39 3.7.1 Controlled Experiments .................................................................. 39 3.7.1.1 Results Using a Single Control N o d e ........................... 41 3.7.1.2 Results Using Multiple Control N o d e s ........................ 46 3.7.2 Effectiveness of Adaptive Prefetching in Complete Applications........................................................................................ 48 3.7.3 Effects of Architectural V ariations.................................................. 54 3.7.3.1 Cache Size ......................................................................... 54 3.7.3.2 Cache Associativity ......................................................... 55 3.7.3.3 Machine S i z e ...................................................................... 55 3.7.3.4 Network L a te n c y ............................................................... 57 3.7.3.5 Memory Consistency......................................................... 57 3.7.4 Fixed and Adaptive P o llin g ............................................................ 59 3.8 Chapter Summary ......................................................................................... 61 4 Adaptive Granularity 62 4.1 Problem S ta te m e n t........................................................................................ 62 4.1.1 Fixed G ranularity............................................................................... 62 4.1.2 Arbitrarily Variable G ranularity..................................................... 62 4.2 Overview of Adaptive G ranularity............................................................... 63 4.3 Communication Alternatives and Granularity ........................................ 65 4.3.1 Message Passing Machines and Hardware DSM ............................66 4.3.2 Software D S M ..................................................................................... 66 4.3.3 Integrated D S M .................................................................................. 67 4.3.4 Very Large Scalable DSM ............................................................... 69 4.4 Adaptive G ranularity..................................................................................... 69 iv 4.4.1 Design C h o ic e s.................................................................................. 70 4.4.1.1 Static or adaptive g ran u la rity ....................................... 70 4.4.1.2 Destination of bulk d a t a ................................................. 71 4.4.1.3 Determination on transaction type and granularity . 71 4.4.2 Variable Grain Memory Replication............................................... 72 4.4.3 Directory and NCPT handling........................................................ 74 4.4.4 Adaptive Granularity Protocol........................................................ 75 4.5 Simulation M ethodology.............................................................................. 80 4.5.1 Architectural A ssum ptions............................................................... 80 4.5.2 Simulation E nvironm ent.................................................................. 81 4.5.3 Benchmark A pplications.................................................................. 81 4.6 Experimental Results and D iscussion........................................................ 81 4.7 Effect of Architectural V a ria tio n s.............................................................. 84 4.7.1 Cache S ize ............................................................................................ 84 4.7.2 Line S ize............................................................................................... 85 4.7.3 Number of Processors ..................................................................... 86 4.7.4 Network L a te n c y ............................................................................... 88 4.8 Related W o r k .................................................................................................. 89 4.9 Chapter Summary ........................................................................................ 90 5 T ro jan S im ulator 91 5.1 Limitations of Current Sim ulators............................................................... 91 5.2 Overview of Trojan S im u la to r..................................................................... 93 5.3 B ackground..................................................................................................... 93 5.3.1 Target Machines ............................................................................... 94 5.3.2 Execution-Driven Sim ulation........................................................... 94 5.4 Related W o r k .................................................................................................. 96 5.5 Simulation of Two Shared Memory M odels.............................................. 97 5.5.1 Process Model and Thread M o d e l.................................................. 98 5.5.2 Simulation of the Two M odels........................................................ 99 5.6 Virtual Memory S im u la tio n ..........................................................................101 5.6.1 Virtual Memory E nviro n m en t...........................................................102 5.6.2 Virtual Memory S im u la tio n ..............................................................103 5.7 Additional Functionality in Trojan ..............................................................104 5.7.1 Application A nnotations.....................................................................104 5.7.2 Extra Functionality..............................................................................106 5.8 Perform ance........................................................................................................107 5.8.1 Performance W ithout Virtual M e m o ry ...........................................107 5.8.2 Performance with Virtual Memory ................................................. 109 5.9 Chapter S u m m a ry ...........................................................................................110 v 6 Conclusions 112 6.1 S u m m a ry .......................................................................................................... 112 6.2 Future W ork.......................................................................................................114 Reference List 115 vi List Of Tables 3.1 Processor, cache, and interconnect parameters representing the base line architecture......................................................................................... 37 3.2 Application Characteristics....................................................................... 38 3.3 Performance comparison between static and adaptive prefetching . . . 49 4.1 Communication model and granularity for each parallel system. . . . 65 4.2 Message types used to communicate between local, home, and remote n o d e ............................................................................................................ 75 4.3 State transition table for the adaptive granularity protocol............... 76 4.4 Simulation p a ra m e te rs............................................................................. 80 4.5 Application Characteristics....................................................................... 81 4.6 Decomposition of cache misses. R: read, W: w r i t e ............................ 82 5.1 Examples of Execution-Driven Simulators and Programming Model . 99 5.2 Application Characteristics........................................................................107 5.3 Performance of Trojan without Virtual M em o ry ................................. 108 5.4 Performance of Trojan with Virtual M em ory........................................110 vii List Of Figures 1.1 General structure of a scalable m ultiprocessor....................................... 2 2.1 Structure of adaptive e x e c u tio n ................................................................. 10 2.2 Static, dynamic, and adaptive program e x e c u tio n ................................. 15 3.1 Two schemes for software pipelining based on the runtime value of argument n ..................................................................................................... 26 3.2 The effect of cache mapping conflicts on the effectiveness of prefetching 28 3.3 Static and adaptive software pipelining.................................................... 33 3.4 Remote latency pattern induced by interference no d es.......................... 42 3.5 Average stall time per prefetch as a function of prefetching policy . . 42 3.6 Effect of static and adaptive policies in the execution t i m e ................ 44 3.7 Effect of ftcancel-prefs &nd Plate-prefs.............................................................. 45 3.8 Results for multiple control nodes (4 control nodes and 60 interference n o d e s ) .............................................................................................................. 46 3.9 Results for multiple control nodes (8 control nodes and 56 interference n o d e s ) .............................................................................................................. 47 3.10 Results for multiple control nodes (16 control nodes and 48 interfer ence n o d e s ) ..................................................................................................... 47 3.11 Results for multiple control nodes (32 control nodes and 32 interfer ence n o d e s ) ..................................................................................................... 48 3.12 Overall Performance of Adaptive Prefetching.......................................... 49 3.13 The effectiveness of adaptive prefetching for J a c o b i .............................. 50 3.14 The effectiveness of adaptive prefetching for L U .................................... 50 3.15 The effectiveness of adaptive prefetching for MMul .............................. 51 3.16 The effectiveness of adaptive prefetching for Ocean .............................. 51 3.17 Cache Size V a ria tio n .................................................................................... 54 3.18 Cache Associativity V a ria tio n .................................................................... 56 3.19 Number of Processors V ariatio n ................................................................. 56 3.20 Network Latency Variation (N1 = default, N2 = doubled switch and wire delay) ..................................................................................................... 58 3.21 Memory Consistency V ariation.................................................................... 58 3.22 Effects of Adaptive P o llin g ........................................................................... 60 4.1 Oveview of adaptive g ran u larity .................................................................. 69 4.2 Memory replication......................................................................................... 73 4.3 Algorithm for directory se a rc h ..................................................................... 74 4.4 State transition diagram for the adaptive granularity protocol .... 77 4.5 Simulation results for default machine p aram eters................................. 82 4.6 Effect of cache size v aria tio n ........................................................................ 85 4.7 Effect of cache line size variation ............................................................... 86 4.8 Effect of number of processors variation..................................................... 87 4.9 Effect of network latency variation ............................................................ 88 5.1 NUMA A rc h ite c tu re ..................................................................................... 94 5.2 Compilation Steps of Execution-Driven S im u la to r................................. 96 5.3 Two models of shared memory applications.............................................. 97 5.4 Simulation of process model by a threads package using copy-on-write schem e.................................................................................................................100 5.5 Virtual memory sim ulation.............................................................................. 102 5.6 An Example of Code A ugm entation..............................................................105 ix A bstract On parallel machines, in which performance parameters change dynamically in complex and unpredictable ways, it is difficult for compilers to predict the opti mal values of the parameters at compile time. Furthermore, these optimal values may change as the program executes. This dissertation addresses this problem by proposing adaptive execution that makes the program or control execution adapt in response to changes in machine conditions, and applying it to software prefetching (for program execution) and the granularity of sharing (for control execution). Adaptive program execution makes it possible for programs to adapt themselves through the collaboration of the hardware and the compiler. The compiler relies on the hardware to provide the relevant information about the execution of pro grams and generates parameterized versions of the program. At runtime, based on the performance information the hardware provides, the relevant parameters are instantiated and adjusted at a certain frequency to reflect run-time variations. To explore the feasibility and effectiveness of adaptive program execution, we have developed adaptive algorithms for software prefetching (adaptive prefetching). The main problem in traditional software prefetching is that a fixed prefetch distance is used during the whole execution, even though machine conditions, which affect the effectiveness of prefetching, change during the lifetime of the program. In adaptive prefetching, the prefetch distance is adjusted dynamically based on the actual remote memory latency and amount of prefetch cancellation, in an attem pt to hide as much remote latency as possible without incurring too much cache interference. Simulation results show that on most programs adaptive prefetching is capable of improving performance over static prefetching by 10% to 45%. For adaptive control execution, we applied the adaptive scheme to the granular ity of sharing (adaptive granularity), to show that adaptive execution can provide performance gains on the node controller as well. Providing the fixed size granularity or an arbitrarily variable granularity to the user may not result in an efficient use of resources or sacrifice the programmability of the shared memory model. Adaptive granularity is a communication scheme that effectively and transparently integrates bulk transfer into the shared memory paradigm, with a varying granularity depend ing on the sharing behavior. Adaptive granularity provides a limited number of granularities, but efficient enough to achieve gains from bulk transfer, without sacri ficing the programmability of the shared memory model and requiring any additional hardware. Simulation results show that adaptive granularity improves performance up to 43% over the hardware implementation of DSM (e.g., DASH). C hapter 1 Introduction 1.1 M otivation In an attem pt to achieve scalability and programmability, most modern parallel machines rely on a distributed shared memory (DSM) organization [3, 18, 26, 63], as shown in Figure 1.1. It consists of N processing nodes, which are interconnected by a low-latency, high-bandwidth network like k-ary n-cube [35]. Each node consists of several components such as a processor, a cache, a portion of the shared memory, and a node controller. The node controller is implemented either in hardware or in software and processes messages from the processor and network. The physical memory is distributed among the nodes and cache coherence for shared memory is maintained using a directory-based protocol [18], where the directory is distributed along with the memory and contains information on which caches are sharing each memory block. Because these multiprocessors are usually composed of large numbers of high- performance processors, their peak performance is quite impressive. In practice, however, only a small fraction of the peak performance is achieved for many appli cations. One of the main reasons for limiting the performance is the long latency of remote memory operations. Various hardware mechanisms and software optimiza tions have been proposed to reduce or hide the effect of the memory latency [15, 28]. Some of these include (1) coherent caches [32], (2) prefetching [45], (3) multithread ing [53], (4) relaxed memory consistency [22], and (5) a memory replication [34]. Coherent caches try to avoid sending remote requests by allowing shared read-write data to be cached with a cache coherency protocol. Prefetching techniques attem pt 1 To or from Node. Node- Node: Cache Processor Local Memory Network Interface Node Controller Interconnection Network (e.g., Jt-ary n-cube network) Bus Network Figure 1.1: General structure of a scalable multiprocessor to hide the memory latency by bringing data close to the processor before it is needed. Multi-threading allows a processor to hide latency by switching from one context to another when a high-latency operation is encountered. A relaxed memory consistency model hides latency by allowing buffering and pipelining of write memory references. A memory replication tolerates the remote memory latency by changing remote accesses caused by conflict or capacity cache misses into local accesses. The effectiveness of some software-controlled (specially compiler-oriented) opti mization techniques such as data prefetching is often limited by the ability of the program (or compiler) to predict the optimal value for the relevant performance pa rameter affecting the runtime behavior of a program. The problem of predicting a good value at compile time is inherently difficult due to several reasons. First, some critical information may not be available at compile time. Moreover this information usually depends on the input data. Second, the dynamic behavior of the program makes compile-time prediction difficult. Third, some aspects of parallel machines such as network contentions, synchronization, cache coherency, and communication delays make it almost impossible for the compiler to predict these effects at com pile time. Finally, the optimal value of a parameter tends to vary as the program executes, and this variance can be quite large on some parallel machines. These arguments illustrate why it is becoming more difficult for compilers and programmers to rely solely on static program analysis when making decisions about 2 how programs should be executed. The realization of this have motivated the devel opment of new optimizations based on exploiting dynamic program behavior. For example, Cox et. al. have developed cache coherency protocols that dynamically detect and adapt themselves to the presence of migratory misses [13]. Another ex ample is the use by the operating system of hardware cache miss counters to detect an unusual amount of cache interference, which can then be eliminated by remapping memory pages to non-overlapping cache regions [8]. W hat these optimizations have in common is that both are based on using event counters and/or specialized state machines capable of recognizing simple access patterns to trigger run-time changes that affect the execution of a particular piece of code. As shown in 1.1, shared memory architectures consists of a main component (a processor), and sub components (e.g., a node controller) that helps in increasing the processor utilization or controls activities among the nodes. In this dissertation, we define program execution as the execution of an application code itself performed on the processor and control execution as the work performed on the node controller. Optimization schemes for program execution attem pt to increase directly the utiliza tion of the processor by applying several transformations such as a loop optimization and software prefetching. In these optimizations, compilers play a key role in achiev ing the goal. The final goal of optimization schemes for control execution is also to increase the utilization of the processor, but they achieve the goal through the effi cient implementation of the node controller, and include a cache coherence protocol. In these optimizations, compilers play an auxiliary role just by giving a hint or some information needed for the optimization. 1.2 Overview o f D issertation In this dissertation, we propose two adaptive schemes (one for program execution and the other one for control execution), based on the notion that both program and control execution should be adapted to dynamic changes in program and machine conditions to guarantee performance on parallel machines. We call both adaptive program and control execution together adaptive execution (AE). Adaptive execution makes it possible for programs to adapt themselves in response to changes in machine conditions. For adaptive program execution, we generalize it in a new framework, 3 and apply it to software prefetching to show its effectiveness. For adaptive control execution, we apply it to the granularity of sharing and show that we can also achieve gains from it. We argue that by using adaptive schemes a significant amount of improvement could be obtained on certain optimizations, specifically those that are highly sensitive to the behavior of hardware components such as caches, TLBs, network switches, etc. 1.2.1 Adaptive Program Execution In adaptive program execution, the compiler relies on the hardware to provide the necessary performance information about the execution of the program. The com piler optimizes programs based on the assumption that the information would be present at runtime. Performance monitors are responsible for collecting performance data and making it available to applications and/or compilers. The steps involved in this process are: (1) the compiler generates several parameterized versions of the same algorithm; (2) at runtime, baaed on the performance information that the hardware provides, the undetermined parameters are instantiated and adjusted with a certain frequency to reflect run-time variations in machine conditions; (3) the program executes with the new value of the parameter until it is re-adjusted again. To show that adaptive program execution is feasible and effective, we have ap plied this technique to software prefetching. Software prefetching represents a good candidate for adaptive execution because its effectiveness strongly depends on the ability of the compiler/programmer to predict, for a code excerpt, the particular data to prefetch and the value to use for the prefetch distance. The optimal value of the prefetch distance varies depending on dynamic behavior such as actual remote memory latency and the amount of prefetch cancellation. Because the actual latency experienced by prefetches is not constant, but depends on the run-time behavior of prefetching itself and other machine components, like caches, network interconnect, and memory modules, it is not always possible to minimize CPU stall time by using a fixed compile-time prefetch distance. The other important parameter affecting the optimal prefetch distance is prefetch cancellation. Prefetch cancellation has a significant effect on performance because the processor stalls for the whole duration of a miss when a prefetch is cancelled. Traditional static prefetching (where the 4 prefetch distance is fixed) suffers from two problems. First, even there was some amount of prefetch cancellation, the same distance is used during the whole execu tion. Second, the fixed prefetch distance may not be optimal because it is difficult to predict at compile time the actual value of the memory latency and the amount of cache interference caused by prefetching. To deal with these problems, we have developed adaptive algorithms for soft ware prefetching (adaptive prefetching) that use simple performance data collected by a hardware monitor. In adaptive prefetching, the prefetch distance is adjusted dynamically based 011 the actual remote memory latency and amount of prefetch cancellation, in an attempt to hide as much remote latency as possible without in curring too much cache interference. We present simulation results that show that our adaptive algorithm is capable of hiding significantly more latency than other prefetching algorithms where the prefetch distance is constant. In some cases the reduction of stall time can be as large as 45%. Furthermore, the only hardware support required by the best adaptive scheme is a pair of counters: one measuring the number of late prefetches (the ones arriving after the processor has requested the data) and another one measuring the number of prefetches killed as a result of cache conflicts. The simplicity of this scheme underlines the potential performance benefits could be obtained by adaptively exploiting run-time program and machine information. 1.2.2 Adaptive Control Execution For adaptive control execution, we applied the adaptive scheme to the granularity of sharing (adaptive granularity), to show that adaptive execution can provide perfor mance gains on the node controller as well. The granularity of sharing is one of the key components that affect the performance in distributed shared memory (DSM) systems. Providing only one or two fixed size granularities to the user may not result in an efficient use of resources. Providing an arbitrarily variable granularity increases hardware and/or software overheads. Moreover, its efficient implementa tion requires the user to provide some information on the granularity, sacrificing the programmability of the shared memory paradigm. 5 Adaptive granularity (AG) is a communication scheme that effectively and trans parently integrates bulk transfer into the shared memory paradigm. Adaptive granu larity provides a limited number of granularities, but efficient enough to achieve gains from bulk transfer, without sacrificing any programmability of the shared memory paradigm and requiring any additional hardware. An adaptive granularity protocol consists of two protocols: one for fine-grain data and the other one for bulk data. For small size data, the standard hardware DSM protocol is used and the granularity is fixed to a cache line. For large array data, the protocol for bulk data is used and the granularity varies depending on the sharing behavior at runtime. Simulation results show that AG improves performance up to 43% over the hardware imple mentation of DSM (e.g., DASH). Compared with the equivalent architecture that supports fine-grain memory replication at the fixed granularity of a cache line (e.g., Typhoon), AG reduces execution time up to 35%. 1.3 C ontributions The main contributions of this dissertation are the following: • The proposal of adaptive execution as a new framework for improving perfor mance on parallel machines. Adaptive execution makes a program execute efficiently on parallel machines, in which performance parameters change dy namically in unpredictable ways, by adjusting the parameters to get closer to the optimal running conditions. This dissertation provides a general frame work on how to instantiate adaptive program execution and investigates the relevant issues affecting the effectiveness of the adaptive algorithm. • The proposal and detailed performance evaluation of adaptive prefetching. Adaptive prefetching is proposed and evaluated as an example of applicabil ity of adaptive program execution to improve the effectiveness of software prefetching. Adaptive prefetching improves performance upon previous static prefetching, in which prefetch distance is fixed during the whole execution time, by adjusting prefetch distance depending on the runtime conditions. The pre vious prefetching algorithm proposed by Mowry [46] focused on the problems of 6 what data to prefetch and minimizing prefetch overhead. Adaptive prefetching focuses on the best time to prefetch and how to resolve prefetch cancellations. • The proposal and evaluation of a new communication scheme called adaptive granularity (AG). Adaptive granularity is proposed and evaluated to show that adaptive schemes can also be applied to the component controlling program execution. Adaptive granularity improves performance through the adaptive scheme on the node controller, not on the processor. Adaptive granularity achieves high performance through different granularity depending on the data type and reference behavior. It results in exploiting the advantages of both fine-grain and coarse-grain communications. The experimental results show that AG improves performance up to 35% over the equivalent system with a fixed granularity. • Development of a high performance simulator for shared memory architectures. To evaluate the performance of adaptive prefetching and adaptive granularity on parallel shared-memory machines, we developed an execution-driven simu lator called Trojan. The key features of Trojan are: 1) it simulates efficiently both process-model based (e.g., SPLASH [59]) and thread-model based ap plications (e.g., SPLASH2 [60]); 2) it provides support for virtual memory simulation, which is, to our knowledge, the first execution-driven simulator to offer this functionality; and 3) Trojan does not require making any modifica tion to applications, which results in increased accuracy and usability. 1.4 Organization o f D issertation Chapter 2 presents the general principles behind adaptive execution, mainly focusing on adaptive program execution. In the section of adaptive program execution, we generalize the adaptive scheme in a new framework and discuss the central role played by software agents and performance monitors. We also discuss in this section some relevant issues affecting the implementation and effectiveness of adaptive algorithms and present related work. Chapter 3 begins by discussing how to apply adaptive execution to software prefetching. We then discusses the effects of prefetch cancellations, which have a 7 significant effect on performance. We then discuss an adaptive prefetching algorithm and hardware support for adaptive prefetching. Finally, we present simulation results based on synthetic and real parallel benchmarks. Chapter 4 focuses on another adaptive scheme, called adaptive granularity (AG). We begin by presenting communication alternatives and a granularity for repre sentative parallel systems and discuss the advantages and disadvantages for each approach. We then describe AG in detail: design choice, memory replication, and protocol. Finally, we present the experimental results and discuss the performance of adaptive granularity. Chapter 5 presents an execution-driven simulator we have developed to evaluate the performance of adaptive prefetching and adaptive granularity. Finally, Chapter 6 presents a summary of this dissertation and discusses future work in this area. 8 C hapter 2 A daptive Execution Parallel machines consists of several components such as a processor, memory, a cache, a network, and a network controller. Each of these components is a po tential performance bottleneck. Performance improvements in the execution time of a parallel program are usually obtained by improving at all levels, from the source code level to the subcomponents such as a node controller. At each level, the avail able options are usually fixed and different. The source code level improvement tries to find a better algorithm, while compiler optimization tries to generate better code by applying several transformations such as loop optimization and common subexpression eliminations. In simple terms we can say that advances in computer architecture normally consist on incorporating new hardware mechanisms capable of providing higher lev els of sustained performance to a larger set of applications. Similarly, we can say that advances in compiler (or subcomponent) optimization technology consist on developing a new technique such as program transformations capable of exploiting a larger fraction of the performance space offered by high-performance machines. Al though ’’providing” and ’’exploiting” performance are both essential goals, and have guided most of the development of new ideas in the fields of computer architecture and compiler technology, just by themselves they cannot guarantee that applications will run close to their optimal point. To guarantee that an application will execute efficiently on parallel machines, where performance parameters change dynamically in complex and unpredictable way, this dissertation proposes adaptive execution. In adaptive execution, a perfor mance parameter is dynamically adapted at runtime, based on system characteristics 9 Adaptive Execution (Chapter 2) Adaptive Program Execution (Chapter 2) Adaptive Control Execution (Chapter 2) A daptive Prefetching (Chapter 3) A daptive Granularity (Chapter 4) Figure 2.1: Structure of adaptive execution such as memory latency. The two main ideas behind adaptive execution which set it apart from static schemes are: 1) that the determination at run-time of parameters having an effect in performance should be made based not only on program infor mation, but on machine performance information as well; and 2) that in order to maintain a high level of optimality, some of these parameters have to be re-evaluated based on measurements of the program execution. We argue in this chapter that the efficiency of certain optimizations, specially those dealing with memory latency, network contention, and remote communication will benefit from adaptive schemes. Figure 2.1 shows the structure of adaptive execution. As shown in the figure, adaptive execution is divided into two categories based on whether execution is performed on the processor or on the node controller. To achieve high performance, efficient execution of both the processor and the node controller is required. For adaptive program execution, we generalize the adaptive scheme in a new framework in Section 2.2 and apply it to software prefetching (discussed in detail in Chapter 3). For adaptive control execution, we discuss it in Section 2.3, and apply to the granularity of sharing (discussed in detail in Chapter 4). 2.1 Issues for A daptive E xecution In this section we discuss some relevant issues that are common for both adaptive program and adaptive control execution. 2.1.1 Stability and Convergence Stability and convergence are two related concepts that are useful in evaluating the effectiveness of any adaptive (control) system. We say that an adaptive algorithm 10 converges, if changes in the parameters controlling the execution bring its perfor mance closer to the optimal point. In contrast, we say that an adaptive algorithm is stable, if changes in the performance parameters controlling the execution of the algorithm keep the optimal point within a bounded region. Stability is a necessary, but not sufficient condition for convergence. The stability and convergence of an adaptive algorithm has to be taken into consideration especially on optimizations that suffer from feedback effects, like data prefetching and multithreading. These memory tolerating techniques take advan tage of the available network and memory bandwidth and eliminate CPU stall time by increasing the number of outstanding requests. A naive adaptive algorithm that increases the rate of requests, in an attem pt to cover more latency, without consid ering the amount of extra contention induced at the routing switches, directories, and/or memory modules, will eventually saturate one of these components which in turn will significantly increase the magnitude of the latency of each individual request. Subsequent attem pts at increasing the rate of requests to cover some of the extra latency will only exacerbate the situation. Therefore, adaptive execution schemes need to have a way of avoiding degrading performance as a result of their efforts at improving it. 2.1.2 Local vs. Global Adaptive Optimizations So far we have presented adaptive execution as basically a local optimization, where individual nodes monitor their own execution, evaluate the effectiveness of control parameters, and implement any necessary local changes to improve performance. All these steps are supposed to be carried out without having to communicate data or coordinate the actions of multiple nodes. Although this scheme works for some optimizations, others require some amount of inter-node communication and coor dination to correctly evaluate the expected effectiveness and to implement changes to program behavior. In the next chapter we will see that data prefetching is a good example of a local algorithm, which only requires estimates on the amount of local prefetch cancellation and the magnitude of the remote latency as seen from the local node. By using these parameters, each node can compute the necessary changes to the prefetch distance, 11 while ignoring the policies of other nodes. In contrast, optimizations that require making changes to the cache coherence protocol will necessarily require collecting global information on the amount data sharing and access patterns in order to make a reasonable cost-benefit analysis. Furthermore, implementing protocol changes must probably require the coordination of all nodes. Therefore, it appears that adaptive schemes that are completely local in nature should be easy to implement. On the other hand, optimizations that require global coordination, either at the detection or implementation level, will require the inter action between agents, libraries, and, in some cases, the operating system. It is not clear that such complicated schemes will provide sufficient improvement to justify their complexity. 2.2 A daptive Program Execution Adaptive program execution (APE) makes it possible for programs to adapt them selves in response to changes in machine conditions. This adaptation is achieved through the collaboration of the hardware and the compiler. This section discusses adaptive program execution in detail. 2.2.1 Framework Guaranteeing that an application will execute efficiently on the main processor re quires the run-time presence of two cooperating components: an inspector capable of observing the behavior of applications and an agent capable of modifying program behavior. Moreover, the inspector and the agent need to communicate in order to close the feedback loop, which is an essential requirement in any adaptive system. In principle, a hardware performance monitor could play the role of inspectors, but not of agents because they lack the ability to affect the program’ s behavior. At the same time, compilers might, at least in principle, play the role of agents, but given the way compilers normally operate, they lack the necessary information to detect when changes are required. In addition, all required adjustment would have to be made at run-time when traditional compilers can no longer act over programs. 12 The idea of using inspectors and agents (sometimes called executors) to complete, at run-time, the compilation process is not new in itself as it has previously been proposed either as a way of enforcing correctness or improving performance. An example of the former are languages with very liberal type systems which sometimes require inspectors to make type inferences at run-time in order to enforce type safety. With respect to performance, inspectors and executors have been used to parallelize and schedule loops whose dependences cannot be determined completely at compile time [55]. We consider these approaches as examples of dynamic program execution (sometimes referred as dynamic compilation), but from our perspective, these are not true examples of adaptive program execution because the effects of the compilation are not re-evaluated and modified continuously during the execution. It is clear that in many situations dynamic compilation will be enough to guarantee good performance, but in orders situations decisions taken at the beginning of the execution will not necessarily be optimal throughout the whole computation. In adaptive program execution, several parameterized versions of the same al gorithm coexist together under the control of a software agent. The software agent represents the extra code added by the compiler (or the programmer) with the ex plicit purpose of managing the run-time behavior of the computation. Specifically, the responsibilities of agents are: 1) to select the ’’best” version to execute according to program parameters and machine conditions; 2) to complete the compilation pro cess by instantiating parameters left unspecified at compile time; 3) to periodically reevaluate the effectiveness of the execution by interacting with the hardware mon itor; and 4) to make the necessary major or minor adjustments required to keep the execution as close as possible to its optimal point. Moreover, libraries and reusable code should also benefit from adaptive program execution by making it possible to customize an individual execution of a function to the specific set of arguments and machines conditions prevailing at invocation time. Furthermore, adaptive program execution should also improve the overall effectiveness of optimizing compilers by allowing them to generate various highly optimized but potentially unsafe versions of the same code. Performance could be improved using this approach as long as the software agents can ensure correctness by selecting and instantiating the ” best” safe version given the run-time values of the program variables. 13 2.2.2 Classification of Program Execution Let us be more formal in our definitions of static, dynamic, and adaptive program execution. Here we focus on performance, but the definitions also apply to other aspects, like correctness. We say that from the point of view of performance, the program execution is static, if all performance parameters are determined a t compile time and their values are not affected by run-time program or machine conditions. Here program conditions refer to the set of values that variables take during the execution, as well as other relevant program information, like array sizes, allocation addresses, etc. Similarly, machine conditions represent all the information related to the behavior of hardware components, like cache misses, number of local and remote messages, amount of synchronization, etc. We say that the execution is dynamic, if some performance parameters are determined at run-time, but once set, their value are unaffected by subsequent program or machine behavior. Finally, we say that the program execution is adaptive, if the performance parameters change over time as a result of observing their combined effects on performance. 2.2.2.1 Illustration of Program Execution Figure 2.2 illustrates the basic differences between the static, dynamic, and adap tive program execution models in the context of a simple loop nest. The static program execution is represented by Figure 2.2.(a). Here only one version of the code is executed for the whole duration of the loop nest. Dynamic program exe cution is illustrated in Figure 2.2. (b). Here three different versions, each optimized for a particular set of values and under the control of an agent (executor). At the beginning of the loop, the agent selects one of the versions and replaces the values of the unknown parameters based on program information, e.g., data dependencies. The important observation here is that once the parameters have been set, they do not change during the execution. In contrast, Figure 2.2.(c) shows how adaptive program execution uses program and machine information to optimize the set of parameters. The diagram shows two levels of adaptation. At the lower level a local agent is executed every k iterations to make adjustments only to version-dependent parameters, while at a higher level a global agent is executed every j ■ k iterations to evaluate whether the execution should continue on the same version or move to 14 executed n times Agent Parameterized Code (version 3) Parameterized Code (version 2) Static Code Parameterized Code (version 3) Parameterized Code (version 1) (a) Static Program Execution (b) Dynamic Program Execution To and From Globa Agent From Global Agent - w executed n/j*k limes Performance Data Global Agent evaluates execution and selects "best" version Local Agent evaluates execution and adjusts local parameters Local Agent Global Agent Parameterized Code (version 3) Parameterized Code (version I) Parameterized Code (version 2) Performance Monitor Parameterized % ■ Code # . (version 3) % . executed k times executed j times Performance Data Hardware Monitor collects performance data To Global Agent (c) Adaptive Program Execution Figure 2.2: Static, dynamic, and adaptive program execution In static program execution (a) the compiler generates a single fix version of the code. In dynamic program execution (b) one or several parameterized versions are generated at compile time. A run-time agent (executor) selects one based on program information and instantiates its relevant parameters. Once set, the parameters Me not modified. In adaptive program execution (c) a two-level adaptive scheme is implemented with the help of global and local agents. At the higher level the global agent selects the most profitable version to execute based on existing program and machine run-time conditions. At a lower level a local agent makes minor adjustments to the version-dependent parameters using data collected by the performance monitor. 15 a different one.1 Both the global and local agents rely on the performance monitor to collect and provide information about the execution. In Section 3.5 we show how adaptive prefetching can be implemented using a similar scheme as the shown in Figure 2.2.(c). At the higher level two different ver sions, each prefetching a different amount of data, are selected according to the size of the localized space (the subregion of the iteration space whose data footprint fits in the cache size). At the lower level, information provided by the hardware monitor about the amount of prefetch cancellation and memory latency determines the mag nitude of the prefetch distance, which is adjusted dynamically without disrupting the software pipeline. 2.2.3 Interaction between Agents and M onitors In adaptive program execution software agents and hardware monitors work to gether to improve performance. It is clear that the main responsibility of the per formance monitor should be to count hardware events, while that of the software agent should be to make the necessary changes to the relevant program parameters (enforce adaptive performance policies). In addition, there has to exist a mecha nism that periodically reads the values of the hardware event counters, computes algorithm- dependent performance metrics and determines whether or not an explicit action from the agent is required. Different trade-offs between cost and performance exist depending on whether these responsibilities are implemented in hardware or in software. There are basically three ways in which this can be done: polling, performance-driven interrupts, or active messages, 2.2.3.1 Polling In polling the responsibility of interfacing with the hardware monitor is placed com pletely in the software agent. The main advantage of polling lies in not requiring any extra hardware support for its implementation. When using polling, however, 1It is clear the adaptive scheme in Figure 2.2.(c) is cost-effective only when n is sufficiently large to justify the extra overhead; otherwise executing the static version or one of the dynamic one will be better. If n is unknown at compile time, then the simple static version could be included in the adaptive scheme and the global agent can select this one when n too small to justify the extra complexity of the adaptive scheme. 16 the effectiveness of an adaptive algorithm is strongly determined by the ability of the compiler to predict, using only static machine and program information, the adequate run-time value for the polling frequency. If this frequency is set too high while machine conditions remain relatively constant, the unnecessary overhead in curred will tend to offset any possible gains. On the other hand, agents failing to respond to sudden changes in machine conditions, due to infrequent polling, will unnecessarily diminish the potential effectiveness of adaptive program execution. Some form of adaptive polling that reduces and/or increases the polling frequency based on the variability of machine conditions can be used to control the response of the adaptive scheme. However, doing this increases the complexity of software agents as now they are required not only to adapt the algorithm but themselves as well. 2.2.3.2 Performance-Driven Interrupts Interrupts eliminate the overhead of polling by shifting the responsibility of hav ing to check when adaptive changes are required from the agent to the hardware monitor. When using performance-driven interrupts, threads execute unaware of the monitoring hardware until an explicit interrupt is received indicating that an action is needed. Here, the software agent is implemented as a collection of user- programmable interrupt handlers that respond to specific performance conditions. The main disadvantage of interrupts lies in the extra hardware required, not only to allow the performance monitor to interact with the interrupt facilities, but in the monitor itself. Now, a simple set of event counters is not longer sufficient to detect that an action is required. At a minimum the hardware monitor must contain a set of programmable threshold registers associated with event counters and under the control of agents. These threshold registers determine the set of conditions, represented as either upper or lower bounds on the number of events, that will trigger a performance interrupt. For example, if it is known at compile time that the number of cache accesses issue by a subset of loops inside a loop nest is n, then, at the beginning of the subset, the software agent can clear the register counting cache misses and set as threshold register a some fraction of n, e.g., a • n. In the event that the number of actual misses matches reaches the threshold an interrupt 17 will be sent to the program. The appropriate handler could then try attem pt to eliminate the cross interference by remapping some of the arrays to non-conflicting physical frames, in a similar way as it is done in [8]. 2.2.3.3 Communication and Performance Processor Recent proposals for multiprocessor architectures, like the Stanford FLASH and Wis consin Typhon, have incorporated in the design of the processing nodes a dedicated programmable processor (communication controller) for the efficient implementa tion of flexible cache coherence and message passing protocols. These communica tion controllers send, receive, and process all intra- and inter-node messages (active messages) by interacting with the other node components (CPU, cache, memory, directory, network interface, and I/O ports) using input and output queues. A hardware-based dispatcher removes messages from the input queues and uses infor mation stored in the headers to invoke the appropriate handlers to execute. In this way by using message type information it is possible to support efficiently different families of protocols. The appeal of using communication controllers for adaptive program execution resides in the ability to execute arbitrary handlers and in having a fully functional processor. Associating different low-level handlers to the same command makes it possible to count events and to compute statistics at arbitrary frequencies without disrupting the computation. Because handlers are written in a high-level program ming language is quite easy to write different versions of the same handler, where in addition to doing the specific work associated with the handler, one version could also count events, another synthesize statistics, and another one check whether or not an action from the software agent (high-level handler) was required. By changing the distribution of messages of each type that are executed it is possible to efficiently implement adaptive polling schemes. For example, for data prefetching, the arrival of a reply associated to a previously executed prefetch could result in the invocation of one of two possible handlers according to information in the message type. Both handlers update the appropriate event counters, but only one of them contains the extra code that compares the counters against the threshold values and send an in terrupt to the software agent for further action. The decision of which message type 18 to associate to each prefetch can be made at issue time using a simple counter. This extra code is part of the handler that implements the prefetch request. Whether or not future parallel and distributed machines will support general-purpose commu nication processors is still a better of intensive research. 2.2.4 Hardware Support To alleviate the burden of programmers in tuning their applications to take advan tage of the performance offered by parallel machines, hardware performance monitors have been incorporated in several multiprocessors. These have been used successfully to analyze the complex interactions affecting the run-time behavior of applications. Some parallel computers such as Stanford DASH, Sun’s SPARCcenter 2000, and KSR, make this information available to the programmer through a collection of library routines [19, 36, 58]. These performance monitors provide measurements about the system without perturbing the execution of the program. In most cases, however, only postmortem performance data (analyzed after the program has fin ished) has been used. The performance monitor is the hardware device that observes the program exe cution and gives performance data to the user/compiler. The information provided by the performance monitor plays an important role in adaptive program execution. There are three important things in the performance data. First, the performance data should be collected and provided without perturbing the program being exe cuted. Second, the data should be useful so as to give a clue about how to improve the program. This should be accomplished without increasing hardware complexity. Third, the overhead of accessing the performance data should be low because the data is accessed at runtime. Parallel machines are composed of several subsystems such as memory, cache, and network. Each of these subsystems is a potential performance bottleneck. To determine individual bottlenecks, we need event counters in each of these subsys tems. The most common performance problem is related to the poor cache behavior of the application that is not directly visible to the programmer. Thus, we need event counters about cache behavior like the number of cache misses, the number 19 of prefetches, and the number of prefetch cancellations. Functionality needed from performance monitor depends on the problem APE is applied. 2.2.5 Compiler Support In APE, some program transformation is needed to generate adaptive code and the transformation is dependent on the problem. By having the compiler perform this transformation, the efficiency of the optimization can be increased and the programmer is relieved of the burden of transforming the code. One way of generating adaptive program execution code is by a loop split. As suming there are a reasonable number of loop iterations, the loop is split into several minor loops to adapt to the performance changes. At the beginning of each minor loop, the performance monitor is checked and adaptation is done according to the performance data. How to split a loop (e.g., frequency of performance data check) is related to the overhead of performance monitor checking and the variance of the performance data. An example of a program transformation is shown in the next chapter for prefetching. 2.2.6 Overhead of Adaptive Program Execution Since adaptive program execution has a cost as well as a benefit, care must be taken so that the cost does not offset much of the benefit. The first step towards minimizing the overhead is to characterize and quantify the overhead involved. The overhead of APE can be decomposed into the following: (1) performance monitor checking overhead (i.e., the number of cycles needed to read the perfor mance data from the performance monitor); (2) performance parameter adjustment overhead; and (3) loop split overhead (i.e., additional instructions for loop splitting). The exact amount of overhead depends on the way of checking performance monitor (i.e, polling, interrupt, or active message) and the frequency. There is a tradeoff between measurement period and overhead: a longer measurement period means less adaptation with less overhead; a shorter period adapts a faster response, but with more overhead. 20 2.2.7 Related Work In traditional compiler optimizations, where important runtime parameters are not known to the compiler, these can be either given by the programmer [15] or obtained using profiling techniques [27]. There have been basically three approaches to re flecting dynamic behavior of a program into the compilation process: profile-based optimizations [10, 27], explicit programmer hints with the support of language con structs [15], and the combination of cost model with runtime test. Although some of these schemes such as a profile-based optimization can help find bottlenecks and im prove the accuracy of the prediction, they suffer fundamental problems of efficiency and adaptability. 2.2.7.1 Profile-Based Optimization In the profile-based optimization, programs are executed with training sets of test data and behavioral aspects of the programs are analyzed. This postmortem analysis (the analysis is made after the program finishes) may reveal bottleneck in the code which can be replaced with more efficient routines by the programmer. The main disadvantages of this approach are that (1) a program must be run first to capture the bottleneck, and (2) the performance data can be changed in the real execution time due to the changes of execution environment. 2.2.7.2 Language Constructs with User Participation This scheme addresses data locality and load balancing with runtime system, lan guage constructs, and explicit programmer participation. For example, COOL, which is a concurrent object oriented language developed at Stanford, provides ab stractions for the programmer to supply hints about the the data objects referenced by parallel tasks [15]. These hints are used by the runtime system to distribute tasks and migrate, so that tasks execute close to the objects they reference. The main disadvantage of this scheme is that (1) a programmer must participate in the anticipation of the dynamic behavior which can be a big burden, and (2) the hints supplied by the programmer may be incorrect. 21 2.2.7.3 Combination of Cost Model with Runtime Test This approach combines the cost model that can be used to estimate the cost of straight line code with runtime test for unknown values like control structures [67]. This scheme is different from APE in that (1) the former does runtime test mostly for the unknown control structures for instruction scheduling in superscalar machines, while APE does for performance data, and (2) the test is not adapted in the former whereas APE adapts performance parameters depending on the test. 2.3 A daptive Control Execution In this dissertation, control execution is defined as the work performed on the node controller. Similar to the definitions in program execution, we say that the control execution is static, if all performance parameters are determined before running the program and their values are not affected by run-time program or machine condi tions. We say that the control execution is adaptive, if the performance parameters change over time as a result of observing the program execution. In adaptive pro gram execution, the compiler plays a key role that transforms original code into adaptive code. In adaptive control execution, the compiler plays an auxiliary role that gives some information for efficient execution on the node controller because the work done on the node controller is not directly visible to the program execution. One of the main jobs performed on the node controller is to maintain cache coherence. While maintaining the cache coherence, one of the key components that affect performance is the granularity, the unit of communication. Providing the optimal granularity is essential for achieving high performance in multiprocessor systems, but it can be varied depending on the applications. Furthermore, the optimal granularity may change within the same program as the program executes. For fine-grained sharing, traditional hardware DSM provides efficient communication through the fixed granularity of a cache line. For coarse-grained sharing, however, this fine-grained communication can be less efficient than bulk transfer. The main advantage of bulk transfer is that we can reduce communication costs through fast data transfer and replication. 22 To show that control execution can also gain benefits from the adaptive scheme, we applied it to the granularity problem. Adaptive granularity provides a variable size granularity which can be varied at runtime, depending on the reference behavior. When the node controller receives a command (e.g. load or store) from the local cache, it determines the type of communication (bulk transfer or standard load-store communication) based on the data type and sends the request to the home node. When the node controller on the home node receives the bulk request, it determines the granularity based on sharing behavior. In Chapter 4, we discuss the adaptive granularity in detail. 2.3.1 Related Work There have been several adaptive approaches to improving the performance of control execution, mainly focusing on cache coherence protocols. Data objects in parallel programs can be distinguished by a small number of distinct data sharing patterns [7, 13, 29]. For example, Gupta and Weber [29] classifies data objects into the fol lowing five categories: (1) read-only data objects, (2) migratory data objects, (3) mostly-read data objects, (4) frequently read/written objects, and (5) synchroniza tion objects. These classifications can be used to perform adaptive cache coherence protocol. We discuss briefly some of the adaptive coherence protocols in this section. 2.3.1.1 Adaptive Cache Coherence Protocols In general, cache coherence protocols use either a write-invalidate or a write-update strategy. In write-invalidate, a processor can update its own copy after invalidating all other cached copies of shared data. In write-update, a processor broadcasts up dates to shared data to other caches. Because the write-update causes interprocessor communication on every write access to shared data, write-invalidate is generally re garded as more suitable than write-update in a scalable multiprocessor. However, for some data types such as frequently read/written objects, write-update can be more efficient than write-invalidate because invalidating and re-requesting them in write-invalidate incur more overhead than in write-update. To exploit the advantages of write-invalidate and write-update (or some other schemes such as migratory-on-read-miss), several adaptive coherence protocols have 23 been proposed [7, 13, 4]. These adaptive coherence protocols are motivated by the observation that sharing pattern vary substantially between different programs, and even within the same program as the program executes. Since the protocol that supports only one kind of policy cannot perform uniformly well for all data sharing patterns, several policies coexist in adaptive coherence protocols. For example, the adaptive coherence protocol [13] proposed by Cox and Fowler dynamically identi fies migratory shared data and uses this information to adaptively switch between replicate-on-read-miss and migratory-on-read-miss policy. By reacting quickly to changes in data-sharing patterns, adaptive coherence protocols can provide better performance than static ones. 2.4 C hapter Sum m ary This chapter discussed adaptive execution in detail. We classified adaptive execu tion into adaptive program execution and adaptive control execution. For adaptive program execution, we generalized the adaptive scheme in a new framework and discussed some relevant issues affecting the adaptive algorithm. In the next two chapters, we apply adaptive execution to software prefetching and the granularity of sharing, and show that the adaptive schemes improves performance significantly. 24 C hapter 3 A daptive Prefetching In this chapter we discuss how to enable adaptive behavior to software prefetching. We start by presenting a very simple dynamic scheme in which one of two versions, each prefetching a different amount of data, is executed depending on the value of a run-time argument. We then introduce an adaptive scheme to adjust the value of the prefetch distance based on information collected using a hardware monitor. We show that when there are no cache mapping conflicts, the prefetch distance should be increased to cover as much memory latency as possible, but when some of the prefetching streams exhibited cache conflicts, the distance should be reduced to avoid destroying prefetches even if this means not covering all the latency. In Section 3.4 we show that this simple policy maximizes (alternatively minimizes) the amount of latency (stall) that can be covered (eliminated). 3.1 Static and D ynam ic Prefetching Consider the code excerpt given in Figure 3.1.(a). This is a simple kernel in which the sum of a sliding window of elements of array b is accumulated on array a. First, let us assume that the compiler after some analysis decides to use software prefetching to eliminate as much CPU stall time as possible. In order to compute the correct prefetch predicates, i.e., the subset of iterations that require inserting explicit data prefetches in the code, the compiler needs to know the value of parameter n. It is clear that if n is small enough so that the data accessed inside the innermost loop by both references is guaranteed to fit in the cache without cross-interference, then all the prefetches for a and almost all for b have to be issued during the first 25 void fline (double a [ ] , double b [] , in t n , in t m) { in t i , j; fo r (i=0; i<m; i++) fo r (j= 0 ; j<n; j++) a [j] += b C i+ j]; (a) Original source code fo r (j=0; j<PF_DIST; j+=4) { P refetch ( a [ j ] ) ; P refetch ( b [ j ] ) ; > fo r (j=0; j<n-PF_DIST; j+=4) { a [j] += b [j] ; a [j+ l] += b [ j + l ] ; a[j+2] += b [j+ 2 ]; a[j+3] += b [j+ 3 ]; Prefetch(a[j+PF_DIST]); Prefetch(b[j+PF_DIST]); > fo r (j=n-PF_DIST; j<n; j++) a [j] += b [j] ; fo r ( i= l; i<m; i++) { i f (& b[i+n-l] * /. 4 == 0) P r e fe tc h (b [i+ n -l]); fo r (j=0; j<n; j++) a [j] += b [i+ j] ; fo r (i=0; i<m; i++) { fo r (j=0; j<PF_DIST; j+=4) { P r e fe tc h (a [j]); P r e fe tc h (b [i+ j]); > for (j=0; j<n-PF_DIST; j+=4) { a [ j ] += b [ i + j ] ; a [j+ l] += b [i+ j+ l] : a[j+2] += bCi+j+2] a[j+3] += b[i+j+3]; P r e fe tch (a [j +PF.DIST]); Prefetch(b[i+j+PF_DIST]); } for (j=n-PF_DIST; j<n; j++) a [ j ] += b [i+ j] ; (b) Prefetching on first iteration o f i (c) Prefetching on all iterations of i Figure 3.1: Two schemes for software pipelining based on the runtime value of argument n. The original code without data prefetching is represented in (a). (b) shows the amount of prefetch ing needed to hide the memory latency when the localized space is assumed to cover both loops. In (c) the localized space is assumed to cover only the innermost loop. 26 iteration of the outermost loop as it is shown in Figure 3.1.(b).' If this not the case, the compiler has to assume that the data loaded into the cache at the beginning of the innermost loop will be displaced by the data loaded at the end and vice versa. Therefore, data prefetches have to be inserted on all iterations of the outermost loop, as Figure 3.1.(c) shows. In a traditional compiler, where all decisions are made based on static analysis, the compiler could apply inter-procedural data flow analysis in an attem pt to deter mine the value of argument n. Several situations can make the compiler fail: 1) the function is called from an external module which is compiled separately; 2) the value of n is determined at run-time (either by reading it or by explicitly computing it); or 3) the function is called at different places, each time using different arguments. If any of these conditions is true, the compiler cannot decide which version is the best to execute on all possible situations. In dynamic execution, on the other hand, the compiler can generate both versions plus extra code representing the software agent whose responsibility is to test n at the beginning of the function and then select the version which better matches the actual argument. The important observation here is that much less analysis is required to find out the run-time conditions needed to determine the best version to execute than what inter-procedural data flow analysis requires. Moreover, an agent can always make sure that the most efficient version is executed independently of how many times the function is called and the actual values of the arguments. It is im portant to note that other optimization approaches, like those based on profiling information, cannot do better than this. The use of statistics obtained through program instrumentation to decide which version to execute can only help improving the average case, but it is only by generating both versions and by delaying the decision until all the required information is completely defined, as it is done in in Figure 3.1, that we make sure that the ’ ’best” version is always executed. 1 As the outermost loop advances a new element of b is read which was not referenced in previous iterations. Hence, an explicit prefetch has to be issued to cover its latency. 27 prefetch is issued a[0 |f data arrives | maps to cache line k b[i+12| bli+161 processor (time •>) b|i+20] .j maps to cache line I (II - kl > prefetch distance) maps to cache line k + I maps to cache line I + I maps to cache line k + 2 maps to cache line I + 2 maps to cache line k + 3 maps to cache line I + 3 (a) prefetch distance ~ 4 iters, tin prefetch cancellation 1 — 1 covered latency 1 — \ uncovered latency 1 1 occupancy time normalized tim e = 1.0 prefetch is issued a|O l|--------------- data arrives h — y maps to cache line k maps to cache line k + 2 maps to cache line k + I maps to cache line k + 3 (h) prefetch distance - 4iters. cache mapping conflicts prefetch cancellation maps to cache line k + 2 (conflict with b(i)) maps to cache line k + 3 (conflict with b[i+4|) a[8] |— | cancelled b[i+8]| a| 1 2 ) j | cancelled b(i+12] a| 16) |— f cancelled b[i+16] I a(2()| |— [ cancelled b[i+20) processor (time ->) 2.35 prefetch is issued a[0| | f data arrives maps to cache line k maps to cache line k + 2 maps to cache line k + 1 maps to cache line k + 3 maps to cache line k + 2 leading reference b[i+12] leading reference a|l6] bli+161 processor (tune ■>) b|i+20| (c) prefetch distance = 2 iters, cache mapping confiictes no prefetch cancellation maps to cache line k + 3 □ Prefetch Overhead H H Computation H Stall Time Figure 3.2: The effect of cache mapping conflicts on the effectiveness of prefetching (a) shows that without cache mapping conflicts, it is possible to eliminate most of the stall time by setting the prefetch as large as possible, (b) shows that the same approach is ineffective when the prefetching streams for a[0] and b[i] exhibit cache conflicts within the window of vulnerability of one of the streams. In contrast, (c) shows that by reducing the prefetch distance, it is possible to eliminate a significant amount of stall time even if not all latency is covered. 28 3.2 The Effects of Prefetch Cancellation The next problem for the compiler is to decide which value to use for the prefetch distance (variable PFJDIST in Figure 3.1.(b-c). It is clear that the best prefetch distance is the one that allows the data to arrive ’ ’just in time” to be used by the processor. Finding this ’ ’optimal value” requires predicting the run-time magnitude of the latency which itself depends on many factors2: 1) whether the data is local or remote; 2) number of messages required to satisfy prefetches, e.g., if data is not available at the home node, the request is re-routed to other nodes; 3) distance, mea sured in number of hops, between the request node, home node, and dirty node(s); 4) amount of contention in the network switches, memory modules, directories, etc; 5) fraction of misses covered by prefetches and number of redundant prefetches (those covering a hit); and 6) amount of prefetches cancelled. To illustrate the inherent difficulties associated in predicting the best value for the prefetch distance and to motivate the need for an adaptive scheme, let us focus on how prefetch cancellation affects the determination of the optimal prefetch distance. Prefetching is effective because processors do not have to wait for prefetches to complete and can proceed as long as the data they need is already present in their respective caches. This implies that subsequent memory operations, whether these are loads, writes, or other prefetches can potentially collide in the cache with an in-flight prefetch or with data loaded by another prefetch that has not yet been referenced by the processor. This ’’window of vulnerability” associated with every prefetch is called the interference window and starts when the prefetch is issued and ends when the data is used. It is clear that the size of the interference window is completely determined by the number of instructions executed in between the prefetch and its associated first use, and the inherent mapping conflicts of all the data references issued within the window. For the specific code excerpt given in Figure 3.1, in addition to the prefetch dis tance, the other relevant factor that determines whether or not the prefetch streams of array a and b will cancel each other depends on the particular cache positions of 2To make the discussion more concrete, we assume the existence of a distributed, directory- based, and write-invalidation cache coherency protocol similar to the one used in DASH [18]. The discussion, however, is not dependent on the particular details of the protocol or its implementation. 29 references a[0] and b[i]. If the distance between these two references, measured in cache lines, is larger than the prefetch distance, then no prefetch cancellation will occur. It is clear that as the outermost loop advances the cache mapping of b[i] moves relative to a[0], so unless m is very small, the prefetch streams will even tually have to interference with each other. Figures 3.2.(a-b) illustrates these two scenarios. The assumptions made in the figure are: a) the cache is direct-mapped and of an unspecified size3; b) cache lines can hold four consecutive elements; and c) the base memory latency (no network, bus, or directory contention) requires sixteen iterations of prefetch distance (four cache lines per reference). In Figure 3.2.(a) we first assume that the cache mapping distance between a[0] and b[i] is much larger than the prefetch distance. Hence, none of the prefetches issued inside the innermost loop are cancelled, consequently most of the latency can be hidden from the processor. In contrast, Figure 3.2.(b) shows what happens when the mapping distance is less than the prefetch distance. On a direct-mapped cache it is not possible to satisfy two prefetches mapping to the same line. Therefore, one of the two prefetches is always cancelled which in turn will cause the processor to suffer stall time when it tries to read the missing data. It is clear that better performance can be obtained by reducing the prefetch distance even when doing this makes it impossible to hide all the latency. As Figure 3.2.(c) shows, if instead the prefetch distance is reduced to only eight iterations (prefetching two cache lines ahead of time), then all the prefetch cancellation is eliminated at the cost of incurring a small amount of stall time. Two observations can be made here. First, only the first prefetch in each group of four (two for each array) suffers from uncovered latency. This is because the other prefetches in the group ’ ’take advantage” of the stall time incurred by the leading prefetch (the first one in each group) to complete4. Second, it should be clear that any adaptive scheme that changes the value of the prefetch distance at run-time so that no amount of 3The loop nest presented in Figure 3.1 contains only two array references, therefore it is easy to eliminate the prefetch cancellation by increasing the cache associativity. More complex loop nests, however, may suffer from prefetch cancellation even on caches with higher associativity. Moreover, high performance machines tend to have caches with low degree of associativity. 4Here, for the sake of argument, we assume that prefetches arrive in the same order that they were issued. Non-leading references can also induce a small amount of stall time when for some reason their relative distance from the leader increases. 30 prefetch cancellation occurs while attempting at the same time to maximize the amount of memory latency covered should outperform any other scheme that uses a fixed prefetch distance. 3.3 A dapting th e Prefetch D istance In the previous section we showed that when the cache positions of elements a[0] and b[i] are close enough to each other to fall within their respective windows of interference, then the innermost loop suffers a significant amount of prefetch can cellation. From this we can conclude that a simple adaptive scheme in which the prefetch distance is computed by the software agent at the beginning of the loop by using the mappings a[0] and b[i] can do better than any static scheme. This ap proach, however, works well only on caches that are accessed using virtual addresses or when the compiler (through the agent) can obtain the actual physical mappings. When caches are accessed using physical addresses or when arrays are not allo cated in a continuous region of physical memory, then it is not always possible for the agent to determine the best prefetch distance that will minimize, at the same time, stall time and prefetch cancellation. The following loop nest illustrates this problem. f o r ( i = 0; i < m; i ++) fo r (j = 0; j < n; j ++) a [j] += b [j] [i] ; Here array b is not traversed sequentially, so it is not possible to determine whether collisions will occur by only inspecting the addresses of a[0]and b[j] [0]. Finally, most loop nests contain several references which are traversed using different patterns. It is very difficult, if not impossible, to characterize all the pos sible patterns that can result in prefetch cancellation when computing the optimal prefetch distance. This means that a general and effective adaptive scheme for the prefetch distance should not rely on always knowing the exact cache mappings of the data. We now present a general adaptive scheme for software pipelining that minimizes the amount of stall time by dynamically changing the value of the prefetch distance. 31 The main idea here is to change the value of the prefetch distance when performance information indicates that the local ’ ’optimal” prefetch distance (as observed from each thread) changes. This can be accomplished by checking the relevant event counters at a certain frequency and using them to quantify how much and in which direction the prefetch distance should be changed. It is clear that any adaptive scheme has to react quickly when prefetch cancella tion occurs or when the magnitude of the memory latency suddenly changes. This implies that the agent may need to modify the prefetch distance several times during the software pipeline. To accomplish this we split the loop containing the software pipeline body (the prologue and epilogue remain the same) into a two-loop nest in the similar way as it is done in other loop optimizations like tiling (blocking). Fig ure 3.3.(c) shows how this could be done on the software pipeline given in Figure 3.1.(c).5 The number of iterations in the controlling loop (the one for index jl) deter mines how frequently the prefetch distance is adjusted. In Section 3.7.4, we discuss how this frequency can be dynamically changed in order to minimize the overhead generated by executing the agent while attempting to maximize its response due to changes in machine conditions. This controlling loop evaluates the effectiveness of the current prefetch distance, computes the new distance for the next subpipeline interval, and increases or decreases the number of outstanding prefetches in order to match the new distance. 3.4 Hardware Support for A daptive Prefetching As we argued in the last section all that is needed to evaluate the effectiveness of a particular prefetch distance is information about how many useful prefetches are cancelled and an estimate of the memory latency. Here we show that keeping track of the former is straightforward, while estimating the latter is possible but requires keeping track of several events (using four counters). Moreover, we also show that we can get away without knowing the actual magnitude of the memory latency by computing the amount of processor stall time induced by late prefetches instead. 5In order to highlight only the important aspects of the algorithm, Figure 3.3 assumes a cache line size of one element. 32 for (i=0; i<m; i++) for (j=0; j<n; j++) a[j] += b[i+j] ; (a) Original Code for (i=0; i<m; i++) { /* prologue */ for (j=0; j<PF_DIST; j++) { Prefetch(a[j] ) j Prefetch(b[i+j] ) ; } /* body */ for (j=0; j<n-PF_DISTj j++) { a[j] += b[i+j] ; Prefetch(a[j+PF_DIST] ); Prefetch(bti+j+PF_DIST]); > /* epilogue */ for (j=n-PF_DIST; j<n; j++) a[j] += b[i+j] ; } (b) Static Software Pipelining pf.dist = /* first approx. dist. */ for (i=0; i<m; i++) { /* prologue */ for (j=0; j<PF_DIST; j++) { Prefetch(a[j] ) ; Prefetch(b[i+j] ) ; > /* body */ for (j1=0; jl<n-pf_dist; jl+=INC) { d_pf_dist = /* change in pf dist */ if (d_pf_dist < 0) { /* decrease pf dist. */ for (j=jl; j<jl+abs(d_pf_dist); j++) a[j] += b[i+j] ; disp = abs(d_pf_dist); } if (d_pf_dist >= 0) { /* increase pf distance */ for (j=jl; j<jl+d_pf_dist; j++) { Prefetch(a[j+pf _dist] ) ; Prefetch(b[i+j+pf_dist]); } disp = 0; > pf_dist += d_pf_dist; ju = min(n-pf_dist, jl+INC); for (j=jl+disp; j<ju; j++) { a[j] += b[i+j] ; Prefetch(a[j+pf_dist]); Prefetch(b[i+j+pf_dist]); } > /* eplilogue */ for (j=n-pf_dist; j<n; j++) a[j] += b[i+j] ; } (c) Adaptive Software Pipelining Figure 3.3: Static and adaptive software pipelining Part (a) shows the original loop nest (Figure 3.1.(a)). P art (b) illustrates the use of static software pepelining to prefetch arrays a and b on all iterations of the outermost loop (Figure 3.1.(c)). Note that the prefetch distance P F -D IST remains constant for the whole duration of the pipeline. In adaptive software pipelining (Part (c)) the loop implementing the pipeline body is broken into several execution segments (each of length IN C). At the beginning of each segment the effectiveness of the prefetch distance is evaluated and based on this an an increment or a decrement offset to the distance is computed (D_PF_DIST). The first two loops modify the prefetch distance in the direction of the change, while the third loop executes the rest of the computation under a new prefetch distance. For simplicity, we assume a cache line size of one array element. 33 Counting the fraction of cancelled prefetches is easy because the conditions deter mining these events are well defined. Hardware support for data prefetches normally consists of an explicit prefetch machine instruction, a lock-up free cache, and some kind of structure to keep track of outstanding prefetches. Whether a prefetch buffer is used to store the identify of pending prefetches, as it is done in the KSR1 and Cray T3D, or by using the cache itself to accomplish the same goal, it is just a m atter of implementation. Because using a prefetch buffer tend to limit the num ber of outstanding prefetches to a small number, our implementation of prefetching assumes the existence of control bits in the cache tags to indicate that the cache line has been reserved and that no other prefetch mapping to the same line can be issued until the prefetch completes. The data already residing in a reserved cache line is not displaced until the prefetched data arrives, so it continues to be available to the CPU. Hence, counting cancelled prefetches only takes one register which is incremented every time a prefetch instruction is executed whose associated address maps to a cache line previously reserved by an in-flight prefetch. An extra counter keeps track of the total number of useful prefetches executed. A useful prefetch, in contrasts to a redundant one, is one that is necessary in order to cover a pos sible miss, i.e., the associated data is not present in the cache when the prefetch is executed. The software agent controlling the prefetch distance reads these two counters at the end of a software subpipeline and computes the fraction of cancelled prefetches (frac_cancel_pref). Estimating the magnitude of the memory latency is relatively easy when the cache is not lock-up free. In this case there is at most one outstanding prefetch per processing node, so only one register (latency register) is needed to accumulate the individual times required to service each memory requests. This is done by subtracting the value of clock time when the prefetch is issued and then adding it when the data arrives. By taking the ratio between the latency register and a second register that counts the total number of requests it is possible to estimate the remote memory latency. The situation gets complicated with a lock-up free cache because there can be an arbitrary number of in-flight prefetches at any given time. The use of a latency counter, similar to the one proposed above, will contain, at a given time, not only the 34 sum of the latency of all prefetches that have already finished but also the negative sum of issue times of all the in-flight prefetches6. This means that the only time when the latency counter can be used directly to compute the average latency is when the counter is started and stopped at regenera tion points, i.e., zero outstanding prefetches. A regeneration point can be created by simply stalling the processor and then waiting for all prefetches to complete. Doing this, however, does not make sense, because it defeats the purpose of prefetching, which is to eliminate CPU stall time. Another alternative is to keep information on the latency counter only for requests that have completed. Unfortunately, this requires using some kind of buffer to keep track of all issue times of in-flight re quests. If the number of entries in the buffer is not large enough to keep track of all outstanding prefetches, it is possible to overflow this buffer. Furthermore, a simple FIFO queue is not sufficient for this purpose, because prefetches do not necessary arrive in the same order in which they were issued. Although memory latency can be estimated without much hardware support, it is easier, and equally effective, to use instead the amount of stall time to adjust the prefetch distance. In machines that implement data prefetching in a similar way as our architectural model, a processor stalls only when it executes a read that misses in the cache7. Because there is at most one outstanding read, then it is possible to count the average stall time per miss by using two registers: one that counts the number of misses, and another that counts the total stall time. Furthermore, by using the in-flight control bits it is possible to distinguish between misses which are not covered by prefetches and misses with residual latencies resulting from delayed prefetches. Therefore, independent set of counters can be used to keep track of how much stall time is caused by either source. In the next section we show how the fraction of delayed prefetches (fracJate.pref) and the fraction of cancelled prefetches (frac_cancel_pref) are used to adjust the prefetch distance. 6 The situation gets even more complicated when the latency counter is re-initialized in the middle of the software pipeline. In this case the latency counter will also contain the sum of arrival times of all prefetches which were outstanding when the counter was cleared. 7We assume that writes go directly to a write buffer where they are retired without having to stall the processor. 35 3.5 A lgorithm for A dapting Prefetch D istance In Section 3.2 and in Figure 3.2.(b) we illustrate why in software prefetching it is more important to decrease the prefetch distance to avoid experiencing prefetch cancellation than it is to increase it to cover more memory latency. Therefore, any effective adaptive algorithm for the prefetch distance should give priority to the former over the latter as shown in the following function shows. int Delta_Pref_Distance (frac_cancel_pref, frac_late_pref) double frac_cancel_pref, frac_late_pref; { if (frac_cancel_prefs > otcanceljpTefs) return ( ftcancel-prefs * PF_DIST)$ if (fracJate.prefs > a late_p r e fs ) re tu r n (filate-prej')'i } This function computes the magnitude of the increment/decrement to use to adjust the prefetch distance. The particular values for a cancei_pref S, aiate_prefs, Pcancei-prefs, and Piate-prefs are statically determined during compile time. In Section 3.7, we show results that indicate that the best strategy to follow is to reduce the prefetch distance by half (f3 C ancei-prefs = 1/2) when prefetch cancellation occurs and to increase it by only one iteration {(3 iatepref = 1) when prefetches arrive late. The values for ctcanceijpre}s and aiate-prefs determines how sensitive is the adaptive scheme. The exponential decrease policy for the prefetch distance highlights the importance of eliminating prefetch cancellation even if the amount of covered latency is reduced. 3.6 Sim ulation M ethodology This section presents the simulation framework used to study the performance of adaptive prefetching. In Section 3.6.1 and 3.6.2, we present the architectural as sumptions and simulation environment, respectively. In Section 3.6.3 we describe the benchmark programs. 36 M achine P aram eter Value O u tp u t buffer 32 elem ents Sw itch delay 2 cycles W ire delay 2 cycles C hannel W idth 2 bytes M achine P aram eter Value M em ory access time 35 cycles Cache line size 16 bytes Cache size 4 Kbytes Set associative 1 Table 3.1: Processor, cache, and interconnect parameters representing the baseline architecture 3.6.1 Architectural Assumptions We simulate a system of the architectural model of NUMA multiprocessor consisting of 16 processing nodes. Processing nodes are interconnected by a bi-directional mesh. Each node consists of a processor, a cache (direct-mapped), local portion of shared memory, a directory, and a network interface linked by a bus. The physical memory is equally distributed among the nodes and relaxed memory consistency model [22] is enforced by hardware. Cache coherence for shared memory is maintained using a distributed directory-based and invalidation-based protocol. Because the use of a prefetch buffer limits the maximum number of outstanding prefetches to a small number, we decided to modify the cache in a small way to maintain information about in-flight prefetches. This is done by allocating a cache frame to a prefetch at issue time and setting the status bits in the frame itself to indicate that the prefetch is in-flight (i.e., a transit state). Reserving a cache frame does not invalidate the information residing in the frame; it is still available to the processor until the prefetch completes. If a new prefetch with a different addresses map to the same cache frame, the second prefetch is cancelled, and if a load maps to the same frame, the original prefetch is cancelled. Because the number of cache frames can be much larger than the number of entries in a fully associative prefetch buffer, our scheme can support a larger number of outstanding prefetches. This is important in order to evaluate the full potential of static and adaptive prefetch schemes. Table 3.1 summarizes the default processor, cache, and interconnect parameters that were used in the simulations. W ith these parameters, the average read latencies without contention are: 1 cycle for a cache hit, 120 cycles for a 2-hop global access, 37 Program Description Input D ata Set Jacobi Successive Over-Relaxation 522x1000 LU LU decomposition 260x260 MMul Matrix multiplication 260x260 Ocean Ocean simulation with Multi-grid Solver 258x258 Table 3.2: Application Characteristics and 140 cycles for a 3-hop global memory access. This parameters represent our baseline architecture. In Section 3.7.3, we investigate the effect of various archi tectural variations such as larger caches, different associative, a larger machine (64 processors), a larger network latency, and a sequential consistency model. 3.6.2 Simulation Environment Non-memory machine instructions are assumed to take one cycle. Prefetch instruc tions, which were inserted manually, take 2 cycles to execute. We used an execution- driven simulator called Trojan [66], which is an extended version of MIT Proteus [52]. It allows the detailed modeling of the various hardware components. All components, except the network, are explicitly simulated; the amount of network contention suffered by messages is computed analytically using the model proposed by Agarwal [2], 3.6.3 Applications In order to understand the relative performance of adaptive prefetching, we use two different types of applications. In the first type, we use a synthetic benchmark to investigate in detail the effectiveness and limitations of adaptive prefetching. In these experiments, in particular, we are interested in obtaining good approximations for parameters ficancei-pre/s and Aate_pre/s- In the other experiments we use real applications, some of which are taken from SPLASH2 [60], to quantify the amount of improvement over static approach that our algorithm offers. 38 Table 3.2 presents the programs and their input data sets8. Each program ex hibits different class of computations in a sense that each program suffers from different types of cache interferences. Jacobi program suffers from self-interferences. Mmul and LU programs occasionally suffer from cross-interferences in pairwise ar rays, while Ocean suffers from cross-interferences on several arrays. To avoid patho logical cache conflicts, we manually changed the alignment of some matrices. 3.7 Experim ental R esults and D iscussion In this section we present experimental results that show that an adaptive scheme controlling the prefetch distance is significantly more effective in eliminating stall time than any other approach based on static information. The results we present here were obtained using two different types of experiments as discussed in Section 3.6. Section 3.7.1 presents our experimental results for a synthetic benchmark. In Section 3.7.2, we give experimental results for real applications with default ma chine parameters. In Section 3.7.3 and 3.7.4, we discuss experimental results with architectural variations and adaptive prefetching variations, respectively. 3.7.1 Controlled Experiments The synthetic benchmark partitions the machine nodes in two disjoint set: the control group and interference group. Nodes in the control group (control nodes) execute a loop nest that generates remote requests at a constant rate and use software pipelining, either with a static or adaptive prefetch distance, to hide as much latency as possible from the processors. In addition, as the execution proceeds some of the array references in the loop nest suffer from cache cross-interference at arbitrary array positions. The control nodes cannot remap the arrays to avoid the interference, but can only react to this interference by changing the magnitude of the prefetch distance. 8for Jacobi and MMul, different size of data sets can result in quite different (better or worse for adaptive prefetching) results. However, we chose these sizes to show the extremely good case (Jacobi) and not so good case (MMul). 39 In contrast, nodes in the interference group (interference nodes) work together to generate a certain amount of bisection network traffic by controlling the rate at which they generate remote requests. The traffic is maintained constant for some interval of time and then is changed to a new value, which is unknown to the control nodes, according to a predefined script running on the interference nodes. The key observation here is that by changing the amount of traffic in the interconnect the interference nodes modify the amount of latency of the remote requests issued by the control nodes. Furthermore, the control nodes are unaware of the collective activities of the interference nodes and cannot anticipate their actions. By partitioning the nodes into two disjoint groups, we introduce not only a degree of freedom in the experiments, but it also allows us to explore how adaptive prefetch ing would behave in an heterogeneous environment. Some people see the future of parallel computing not as programs executing in a single multiprocessor, but as a collection of threads and processes running on a network of machines (uniprocessors and multiprocessors). Although it is unclear today whether peer to peer latency and available bandwidth will make this kind of parallel computing cost effective, current and near future technological advances seem to point to this direction. Adaptive execution represents an attractive approach in an environment like this because it allows threads/processes to adapt their execution based only on the response of the environment without having to know any details about the machines of their interconnections. Using the synthetic benchmark we want to answer the following questions: 1) how much stall time can be eliminated using the static and adaptive schemes; 2) what are the best values for the adaptive parameters (oicancei-pre/sj ftcanceijprefsi ^iatep rej i and A o te _ p r e /s ) ; 3) how fast can the adaptive algorithm react to changes in program and machine conditions; and 4) how much feedback effect exhibited by prefetching affects the convergence and stability of an adaptive scheme. The last point requires further discussion. First, consider the following situation. Let Ni and N 2 (N = N i + N 2) be the number control and interference nodes, respec tively. It is clear that when N i = 1 and N 2 = N — 1, the effect of the interference nodes on the control node (one in this case) is maximized, while the feedback effect coming from the latter is minimized. This allows us to test the convergence of adap tive prefetching independently of stability effects. This is because the actions of the 40 control node have a negligible effect on the operating conditions. As the number of control nodes increases, however, the feedback effect induced by the collective actions of the control nodes increases, while the effects of the interference nodes decreases. Therefore, the ability of the control nodes to converge to the optimal prefetch distance and the variability that these actions have on the optimal point can be measured. 3.7.1.1 Results Using a Single Control Node The results shown in this section are for a single control node that is under the control of various static and adaptive policies. Here the number of interference nodes is 63. The static policies (denoted in the figures as SP-i, where i is the prefetch distance) use a constant value ranging from 50 to 1000 cycles for the prefetch distance. We also simulate two adaptive policies. The first policy (AP-1) ignores cancelled prefetches and only uses the magnitude of the latency, while the second (AP-2) takes into account both the amount of stall time and prefetch cancellation to adjust the prefetch distance using the algorithm given in the last section. In particular AP-2 reduces the prefetch distance to one half of its current value (flcancei-pre/s = 0.50) when a significant number of prefetches were cancelled. At the same time if prefetches arrive late and there is no prefetch cancellation, then the prefetch distance is increases by one iteration of the innermost loop (/?j„«e_pre/s = 1 )- These parameters were the ones that gave the best results. Figure 3.4 shows the remote latency pattern generated by the interference nodes as a function of the channel utilization. As explained above, the interference nodes cooperate between themselves to maintain a constant channel utilization for some amount time by controlling their remote request rate. As the figure shows varying the channel utilization within the range of 0.18 to 0.90 causes the remote latency to vary from 100 to 425 cycles. Figure 3.5 shows the average stall time that static and adaptive prefetching schemes experience as a function of the interference pattern. The results clearly show that no static prefetch distance can perform well over all possible latency values. The adaptive schemes, however, are capable of adjusting to the changes to keep the stall time close to its minimum. This effect is more evident on AP-2 41 450 400 350 S' 3 0 0 c < u to 250 —I a ) O 200 E < u D C 150 100 50 0 .49 .83 .89 .87 .74 .18 .49 .83 .89 .87 .74 .18 .18 .74 .87 .89 .83 .49 .18 .74 .87 .89 .83 .49 Channel Utilization Figure 3.4: Remote latency pattern induced by interference nodes SP-1000 b O SP-500 o oSP-100 A-— A AP*1 AP-2 100.0 90.0 80.0 © 70.0 E ( f 60.0 3 « 50.0 a > I ” 40.0 I 30.0 20.0 10.0 0.0 .83 .49 Channel Utilization Figure 3.5: Average stall time per prefetch as a function of prefetching policy 42 than it is on AP-1. Because AP-1 ignores prefetch cancellation it can only perform as well as the best static scheme. Scheme AP-2, on the other hand, is capable of detecting when attempts to increase the prefetch distance results in higher levels of prefetch cancellation. The correct action here is to reduce the distance until no significant amount of cancellation occurs. Furthermore, it is well known that the amount of latency experience by remote requests is a convex function of the amount of contention in the network.9 Figure 3.5 shows that the amount of stall time induced by all static prefetching schemes and by AP-1 is also a convex function of network contention. In contrast, the stall time induced by AP-2 is a concave function. This implies that even when the network is close to saturation, the amount of stall time suffered by a node under AP-2 approaches a maximum value. The im portant conclusion to draw from Figure 3.5 is not that adaptive schemes are better than static ones just because the former are able to adjust to large vari ations of the latency. In reality the latency suffered by a program can be either small or large, but in most cases it tends to stay within a narrow gap. The real problem is that compilers cannot predict sufficiently well the actual runtime value of the latency. Furthermore, using prefetching only makes things worse. To illustrate this last point, consider using prefetching on a program that spends 80 percent of its time waiting for remote requests to complete. Assume that under these conditions the average channel utilization and remote latency are 0.30 and 110 cycles, respec tively. Now, assuming everything else being equal, if prefetching is 50% effective in reducing stall time, then the channel utilization and remote latency will increase to 0.50 and 135 cycles. But if prefetching is 80% effective, the channel utilization and remote latency will jump to 0.85 and 300 cycles. It is this uncertainty on the effectiveness of prefetching what makes it impossible for the compiler to select the correct prefetch distance in every case. Adaptive prefetching, on the other hand, is guaranteed to produce the best performance because it adjust itself to the particular runtime conditions. Figure 3.6 gives results on the normalized total execution time. Here, the stall time has been broken into two components: delay caused by the residual latency (pfjissue: stall time induced by late prefetches) and delay caused by cache misses 9A function is convex is the second derivative is positive. As the network approaches saturation the resulting latency grows exponentially. 43 apf-overhead pf-overhead pf-cancel Figure 3.6: Effect of static and adaptive policies in the execution time resulting from cancelled prefetches (pficancet). In addition, the adaptive schemes include the extra overhead involved in adjusting the prefetch distance, which con sists of two parts: reading the counters in the performance monitor, and increas ing/decreasing the prefetch distance. The former is strongly dependent on the delay incurred by a thread in reading the performance monitor counters. Here we assume a tightly coupled hardware monitor whose counters can be read directly from a user thread without the intervention of the operating system. The results show that by using a complete adaptive policy (AP-2) we can minimize both components of the stall time. Furthermore, the overhead incurred in running the adaptive algo rithm does not overly degrade performance. A partial adaptive scheme (AP-1) is able to eliminate the stall time induced by late prefetches, but not that of cancelled prefetches. The static schemes, on the other hand, can either eliminate the residual delay or the amount of cancelled prefetches, but not both. The effective of a full adaptive algorithm is strongly dependent on parameters Pcancei-prefs and /?(a(e_P re/s • Intuitively reducing the occurrence of cancelled prefetches 44 1 0 0 . 0 p f-lissu e pf-cancel (fraction of pref.dist.) Platepref a (iterations) F i g l U C 3.7. Effect of 0cancel^trejs and Plate-prefs is more important than trying to cover all the latency. This was illustrated in Fig ure 3.2.(c), where we showed that when prefetches suffer from residual latency, the overlapping of prefetches amortizes the corresponding stall time of the leading ref erences amongst all outstanding prefetches. This is why the algorithm we presented in Section 3.3 increases the prefetch distance conservatively by a small constant, but decreases it rapidly when a significant amount prefetch cancellation occurs. We ran experiments varying both 0 cancel.prefs and 0 iate_Pre}s and the results are shown in Figure 3.7. As expected, the best results were obtained when the prefetch distance is increased as little as possible by only one iteration in the former and when it is decreased to 50% in the latter. The results on Figure 3.7 show that small deviations from these values have a small effect on the overall effectiveness of the adaptive scheme. 45 50.0 40.0 a > E F apl-overtiead pN overhead pMisue pf-cancel busy time Figure 3.8: Results for multiple control nodes (4 control nodes and 60 interference nodes) 3.7.1.2 Results Using M ultiple Control Nodes The results of the previous section show that a single control node implementing an adaptive scheme can effectively adjust the prefetch distance to account for changes in the remote memory latency and amount of prefetch cancellation. To investigate whether the collective actions of the control nodes makes the system unstable and prevent them from converging to their local optimal prefetch distance, we ran ex periments in which we increase the number of control nodes from one to 32 (on a 64-node multiprocessor). The results are summarized in Figure 3.8-3.11. The figures show that increasing the number of control nodes does not reduces in any significant way the amount of stall time that AP-2 is capable of eliminating. In fact, static prefetching schemes suffer more from the collective feedback effect than the adaptive schemes. This is because adaptive schemes are quite effective in reacting to changes to the memory latency independently of the source of the change; the feedback effect is just one extra factor. 46 50.0 apf-overhead pi-overhead pMfsue pf-cancel busy time Figure 3.9: Results for multiple control nodes (8 control nodes and 56 interference nodes) 50.0 apf-overhead pf-overtiead Figure 3.10: Results for multiple control nodes (16 control nodes and 48 interference nodes) 47 apf-overtiead pf-overhead pf-lisue pl-cancel busy time Figure 3.11: Results for multiple control nodes (32 control nodes and 32 interference nodes) 3.7.2 Effectiveness of Adaptive Prefetching in Complete Applications In this section, we evaluate the effectiveness of the adaptive prefetching (APF) for real benchmarks by comparing their performances with those of static prefetching (SPF) with various prefetch distances. We start by presenting in Table 3.3 some relevant cache and network statistics on the behavior of prefetching. The table shows miss rates, prefetch cancellation rates, average miss penalties, and speedups. The latter are reported relative to policy SP-200. The miss rates account for all load references independently of whether or not a prefetch was issued in an attem pt to eliminate a miss. Prefetch cancellation pf-cancel corresponds to the fraction of all prefetches issued that are ineffective in covering latency due to cache conflicts. The table clearly shows that adaptive prefetching minimizes not only cache miss rates, but also the prefetch cancellation rates. Consequently, adaptive prefetching manages to improve performance relative to the static policies from 10% (MMul) to more than 60% (Jacobi). 48 1 0 0 . 0 1 0 0 . 0 1 0 0 . 0 1 0 0 . 0 pf-lissue pf-cancel spf apf spf apf spf apf spf apf Jacobi Lu MMul Ocean Figure 3.12: Overall Performance of Adaptive Prefetching Benchm ark prefetch distance Miss rate (%) Pf-cancel rate (%) Average m iss penalty (cycles) Speedup Jacobi SP-100 8.3 17.3 219 1.14 SP-200 11.9 66.0 179 1.00 SP-300 11.9 66.0 165 1.08 AP-2 6.2 5.8 192 1.63 LU SP-100 10.8 2.1 67 0.95 SP-200 6.7 3.5 96 1.00 SP-300 5.1 5.1 131 0.99 AP-2 3.6 1.3 131 1.16 M M ul SP-100 6.3 2.8 132 0.95 SP-200 3.6 4.5 211 1.00 SP-300 3.0 5.6 256 1.00 AP-2 2.8 2.6 229 1.09 Ocean SP-100 13.6 15.5 154 1.00 SP-200 12.3 16.7 169 1.00 SP-300 12.7 19.2 172 0.97 AP-2 11.3 7.0 146 1.18 Table 3.3: Performance comparison between static and adaptive prefetching 49 apt-overhead pf-overhead obuffer-full sync pM lssue pf-cancel no-pf busy No-pf SP-100 SP-200 SP-300 AP-2 Figure 3.13: The effectiveness of adaptive prefetching for Jacobi 100.0 apt-overhead pf-overhead obuffer-full sync pf-iissue pf-cancel no-pf busy No-pf SP-100 SP-200 SP-300 AP-2 Figure 3.14: The effectiveness of adaptive prefetching for LU 50 100 80 © £ F ‘ 1 6 0 8 UJ ■ g % 40 (0 apf •overhead pf-overhead obuffer-full pf-lissue pf-cancel No-pf SP-100 SP-200 SP-300 AP-2 Figure 3.15: The effectiveness of adaptive prefetching for MMul 100 80 ® E F •2 60 8 ■ o < D N 40 O z 56.3 56.3 mmm apf-ovem ead pf-overhead obuffer-full pf-lissue pf-cancel No-pf SP-100 SP-200 SP-300 AP-2 Figure 3.16: The effectiveness of adaptive prefetching for Ocean 51 Measuring the effectiveness of adaptive prefetching only by looking at overall speedups is somehow misleading because the maximum amount of improvement that prefetching can produce depends on many factors not affected by whether prefetching is static or adaptive. For example, if the original program does not suffer from high miss ratios or if not enough prefetches are inserted to cover misses, adaptive prefetching cannot overcome this deficiency. Figure 3.12 tries to factor out this by focusing on the fraction of stall time not covered by static prefetching that can be successfully eliminated by using an adaptive scheme instead. The stall time induced by prefetching has two sources: 1) cache misses occurring as a result of canceled prefetches (pf-cancel) and 2) prefetches that complete but arrive late (pf- lissue). In Figure 3.12 we see that, relative to SP-200, AP-2 can eliminate between 20% to 45% of the stall time suffered by static prefetching. Moreover, adaptive prefetching reduces not only the stall time coming from canceled prefetches, but also that induced by late prefetches as well. Figures 3.13-3.16 show the relative execution times for all the prefetching schemes normalized with respect to no prefetching (no-pf). The normalized execution time is broken down as follows: (1) busy time spent executing instructions (busy), (2) stall time, and (3) overheads of prefetching and adaptive prefetching. The stall time is further decomposed into the following components: (2.1) stall time caused by not issuing prefetch (no-pf), (2.2) time stalled due to the prefetch cancellations (pf-cancel), (2.3) stall time because of late prefetch issue (pf-lissue), (2.4) stall time for synchronization operations such as barriers and locks (sync), and (2.5) stall time due to the output buffer full (obuffer-full). As shown in the figures, compared with the SP-200, the speedup of adaptive prefetching ranges from 9% (MMul) to 63% (Jacobi). Table 3.3 indicates that this is accomplished by reducing both the cache miss-rate and prefetch cancel rate. More over, although, in most cases, static prefetching can hide the long memory latency, it induces significant stall time resulting from prefetch cancellation. Furthermore, as the figure shows, the effectiveness of prefetching is dependent on the prefetch distance and this changes from program to program. For example, the best static prefetch distance for Jacobi is 100 cycles, whereas for LU and MMul it is 200 cycles. Having analyzed the benefits of adaptive prefetching, we now focus on the costs. The overhead of adaptive prefetching can be decomposed into the following: (1) the 52 overhead of reading the performance monitor registers; (2) the overhead of adjust ing the prefetch distance; and (3) the overhead of executing the software agent. As mentioned in Section 2.2.3, the actual cost of adaptive prefetching depends primar ily on how the software agent and the performance monitor interact, i.e., polling, interrupt, or active message, and the frequency of interaction. There is a trade off between measuring period and overhead: a longer (alternatively shorter) period means less (more) adaptation with less (more) overhead. In our experiments, we used a static polling scheme for checking the performance monitor with an overhead of 30 cycles. Our simulation results show that this scheme incurs an overhead of no more than 5% of the total execution time. This is quite low considering that adaptivity improves performance over the static schemes by a much larger amount (10% to 40%). The good news for the overhead is that it could be completely or partially over lapped by the stall time components. If the processor is blocked as a result of a cache miss or a synchronization operation, it is just stalled 1 0 because we use static polling. Even so, the stall time due to the late prefetch issue can help in reducing the overhead to some extent. For example, if the processor has to stall 20 cycles because of late prefetch issue, the overhead is overlapped by the stall time and only 10 cycles is added to the adaptive prefetching overhead since after 30 cycles the processor does not stall any more. The stall time caused by a prefetch cancellation does no help in reducing the overhead because after finishing the extra work, the processor can issue the cancelled memory access. In our experiments, around 30% of the overhead is overlapped by pf-lissue. In summary, our results show: (1) if the prefetch distance is too large (alter natively short), the number of cancelled prefetches increases (decreases), while the number of late prefetches decreases (increases); (2) the amount of a prefetch cancel lation has a significant effect on performance; (3) the best static prefetch distance is application dependent and it is affected by several machine parameters; (4) the overhead of adaptive prefetching is small; and (5) adaptive prefetching works well when it is difficult to predict the best static prefetch distance. 10the processor is not utilized by reading the performance monitor and adjusting the prefetch distance. 53 Jacobi Ocean 1 0 0 .0 1 0 0 . 0 100.0 1 0 0 . 0 apf-overhead pi-overhead obuffer-full sync pf-llssue pf-cancel no-pf busy spf apf spf apf spf apf spf apf 4K 16K 4K 16K Figure 3.17: Cache Size Variation 3.7.3 Effects of Architectural Variations In this section, we report on the impact of several architectural variations on the performance of adaptive prefetching. We focus on the effects of changing the cache size, set associative, machine size, network latency, and memory consistency model. Here we compare the performance of adaptive prefetching relative to that of static prefetching directly (i.e., the execution time of adaptive prefetching is normalized to that of static prefetching) to see more clearly how much performance improvement is achieved by adaptive prefetching. For static prefetching, because the prefetch distance 200 was best in most cases as shown in the previous subsection, this distance is used except for the variations of network latency and machine size where 300 is used to cover the increased latency. 3.7.3.1 Cache Size In our baseline machine the cache size was set to 4 Kbytes. In order to test if adaptive prefetching continues to work well on larger caches, we ran additional 54 experiments where the cache size was increased to 16 Kbytes. For larger cache size, we used the same size of data sets because we want to see the performance changes with larger caches. The results of our experiments are presented in Figure 3.17. The results show that for Jacobi, AP-2 reduces the execution time by 39% on a 4-Kbyte cache and that this number drops to 26% on a 16-Kbyte cache. This reduction is the result of a corresponding reduction on the amount of self-interferences 1 1 in the program, which in turn reduces the number of canceled prefetches for both static and adaptive prefetching. Because static prefetching suffers more from self-interference, it benefits more from a larger cache. On Ocean, the relative performance does not change significantly when the cache size increases. The reason is that pf-cancel component caused by cross-interferences is not sensitive to the size of cache because the relative positions of the array in memory continues to be the same. 3.7.3.2 Cache Associativity The next variation is changing the cache associativity from direct-mapped to 2-way set associative with a random replacement policy. Figure 3.18 shows the experimental results. Because increasing the set associativity decreases the proba bility of cache conflicts, pf-cancel component goes down on all programs resulting in less performance improvement for AP-2. For Jacobi, however, there is still a lot of prefetch cancellations on static prefetching, which is caused by cache conflicts among more than two array references. Further increases in the associativity may eliminate these conflicts on Jacobi, but a corresponding increase on hardware complexity and performance sacrifice for other applications [69].1 2 3.7.3.3 Machine Size Increasing the machine from 16 to 64 nodes has a direct impact on latency by both increasing the average distance between nodes and the bisection bandwidth. We used larger data sets on these experiments to make sure that every processor is assigned u note that as discussed in Section 3.6, Jacobi suffers from self-interferences while Ocean does from cross-interferences. 1 2 Wilton and Jouppi have reported that increasing the associativity from direct-mapped to 2-way and 4-way increases the access time of a 16-KBytes cache by 24% and 30%, and for a 64-KByte cache the corresponding increases are 19% and 23%, respectively [69]. 55 Normalized Execution Tim e Normalized Execution Time Jacobi Ocean 100 80 60 40 20 1 0 0 . 0 1 0 0 . 0 1 0 0 . 0 1 0 0 . 0 61.4 84.8 96.5 70.9 Mill apf-overhead pf-overhead obuffer-full sync pf-lissue ^-cancel no-pf busy spf apf spf apf spf apf spf apf S=1 S=2 S=1 S=2 Figure 3.18: Cache Associativity Variation Lu Ocean 100 80 60 40 20 100.0 100.0 100.0 100.0 I apf-overhead pf-overhead obuffer-full sync pf-lissue pf-cancel no-pf busy spf apf spf apf spf apf spf apf P=16 P=64 P=16 P=64 Figure 3.19: Number of Processors Variation a similar amount of work. For LU and Ocean, 460 by 460 and 514 by 514 matrices are used instead of 260 by 260 and 258 by 258 for 16 processors, respectively. As the number of processors increases, the variance of network latency is becoming large since average hop count goes up. Thus, it is interesting to see how the effectiveness of APF is affected by the variation in the number of processors. Figure 3.19 shows the results. We see that pf-cancel component is increased on static prefetching (from 31% to 44% for LU and from 47% to 49% for Ocean) with 64 processors, which means th at prefetch cancellation is more critical for larger number of processors. In contrast, pf-lissue component increases for AP-2 (from 21% to 27% for LU and from 13% to 15% for Ocean) with 64 processors. This results from the fact that AP-2 decreases the prefetch distance so that prefetch cancellation does not occur, which can result in increasing pf-lissue. Due to the longer network latency, this strategy affects pf-lissue more with 64 processors than with 16 processors. Overall, even if pf-lissue increases on a larger machine, AP-2 is more effective as the number of processors increases. 3.7.3.4 Network Latency Another machine variation that also affects latency is changing the characteristics of the interconnect. Here, instead of assuming a switch and wire delay of 2 cycles each, we double the delay to 4 cycles. Doing this results in increases of a no contention 3-hop remote memory access from 140 to 230 cycles. Figure 3.20 shows the effect of this change on the execution time for Ocean and LU. We see that AP-2 works better under longer network latency as was the case with 64 processors. As the network latency increases as a result of larger hop count or increased wire (or switch) delays, the component of pf-cancel becomes larger and thus there is more room for improvement. 3.7.3.5 Memory Consistency Finally, we study the execution time effects under sequential consistency. The main difference between sequential and release consistency is that, under release consistency, write stall time can be overlapped with computations by using a write buffer. Because the processor is stalled on a write miss under sequential consistency, 57 Jacobi Ocean 100 so 100.0 100.0 ID E P C o 3 60 8 X ■ o < D N 15 o 40 20 1 0 0 .0 1 0 0 .0 m ■■■ 84.8 l m i spf apf spf apf spf apf spf apf apf-overhead pf-overhead obuffer-full sync pf-lissue pf-cancel no-pf busy Figure 3.20: Network Latency Variation (N1 = default, N2 = doubled switch and wire delay) 100 < D E 80 3 60 8 U J T3 8 40 20 Jacobi Lu 100.0 100.0 100.0 100.0 apf-overhead pf-overhead obuffer-full sync pf-llssue pf-cancel no-pf busy spf apf spf apf spf apf spf apf REL SEQ REL SEQ Figure 3.21: Memory Consistency Variation 58 stall time components (pf-lissue and no-pf) should increase. Figure 3.21 presents the resulting execution times for Jacobi and LU. These show that, for Jacobi, there is little difference between release and sequential consistency. On this program prefetches are issued for all memory references and most of the stall time is caused by prefetch cancellation, which is not affected by the memory consistency model. For LU, however, the performance improvement AP-2 under sequential consistency is not as good as it is under release consistency. The reason for this is that because pf-cancel decreases 1 3 but no-pf and pf-lissue increase, there is not much room left for AP-2 to improve performance. If we increase prefetch distance to cover pf-lissue, the performance is getting worse because of the increased pf-cancel. In summary, the effectiveness of AP-2 is affected by the architectural variations. Changes to the cache size and set associativity reduces the need for adaptivity be cause there are less cache conflicts. On the other hand, increasing network latency, either by increasing the machine or slowing the network, improves the effectiveness of adaptivity. 3.7.4 Fixed and Adaptive Polling So far we assumed a fixed frequency of polling in all our experiments, The main advantage of this static polling is its simplicity. The disadvantages of the scheme are: (1) overhead incurred even when machine conditions remain constant, and (2) delayed response to sudden changes in machine conditions. To solve these problems, we can consider either interrupt-driven or adaptive polling. As discussed in Section 2.2.3, interrupts increases significantly the hardware complexity and make it difficult to implement the software agent using interrupt handlers. Here we discuss a semi- adaptive polling scheme where the frequency of polling is dynamically changed. Adaptive polling has two goals: to minimize the overhead and to adapt its re sponse to sudden changes in machine conditions. One way of achieving these goals is to check performance monitor more frequently when there are changes in machine conditions, and reduce it as long as conditions remain constant. Specifically, every time we detect an increase in the number of late or cancelled prefetches we double 13absolute amount of pf-cancel is the same, but its portion is decreased because of the increase of the other components 59 Lu MMul 100.0 100.0 100 I 60 E F c o S 8 N 40 20 II i " apt-overhead pf-overhead pH issue pt-cancel spf apf apf-ap spf apf apf-ap Prefetching Type Figure 3.22: Effects of Adaptive Polling the polling frequency, and if conditions are constant we halve it. To avoid thrashing, i.e., excessive polling, we set limits to the maximum polling frequency. Likewise we maintain a minimum polling frequency even when machine conditions remain constant for large intervals. We also use a heuristic designed to help detecting and reacting to prefetch cancellation early in the execution of the innermost loop. The comparison between static and adaptive polling are shown in Figure 3.22. The performance of adaptive polling is very similar to that of static polling. On LU and MMul, adaptive polling decreased the amount of pf-cancel by 0.5% and 0.2%, and it increased pf-lissue by 0.4% and 0.3%, respectively. The reduction in overhead resulting from using the adaptive scheme was only 1.0% and 1.7%. We believe the main reason why adaptive polling was not so effective in improving behavior has to do with the behavior of our benchmarks. These programs tend not to change rapidly from one operating regime to another and stay on steady-state for a sufficient amount of time. Consequently a fixed-polling scheme has sufficient time to react without incurring to much overhead. 60 3.8 Chapter Summary In this chapter we have studied how adaptive execution can be applied to software prefetching algorithms to make them more effective in covering latency and reducing cache pollution. We have presented a practical adaptive prefetching algorithm capa ble of changing the prefetch distance of individual prefetch instructions relative to their corresponding loads by combining information collected about the number of late prefetches and the number of cancelled prefetches. We have shown that in order to minimize stall time it is imperative to give priority to the latter over the former. Our simulation results clearly show that adaptive prefetching can improve the per formance of static schemes by a significant amount ranging from 10% to 40%. Even more important is the 20% to 45% reduction in stall time that adaptive prefetching provides over the static schemes. 61 Chapter 4 A daptive Granularity 4.1 Problem Statem ent 4.1.1 Fixed Granularity The shared memory paradigm provides a simple communication abstraction and thus greatly simplifies parallel programming, particularly problems that exhibits dynamic communication patterns or fine-grain sharing. Furthermore, most hardware distributed shared memory (DSM) machines [3, 18] achieve high performance by allowing the caching of shared writable data as well as read-only data and thus exploiting more locality. One disadvantage of that kind of cache-coherent machines (e.g., Stanford DASH [18], MIT Alewife [3]) is that they are restricted to using only the fixed fine-granularity (i.e., a cache line for loads and stores) as a way of a communication. While this works well for fine-grain data, bulk transfer of data can sometimes be more effective for some applications. Bulk transfer has several advantages over fine-grain communications: fast pipelined data transfer, overlap of communication with computation, and replication of data in local memory [71]. 4.1.2 Arbitrarily Variable Granularity To exploit the advantages of both fine grain and coarse grain communications, more recent shared memory machines such as Stanford FLASH and Wisconsin Typhoon have begun to integrate both models within a single architecture and to implement a coherence protocol in software rather than in hardware. In order to use the bulk 62 transfer facility on the machine, several approaches such as explicit messages [30, 71] and a new programming model [33] have been proposed. With the explicit message approach, message passing communication primitives such as send-receive or memory-copy are used selectively to communicate coarse grain data and load-store communications are used for fine grain data communications in an application [71]. In other words, two communication paradigms coexist in the program and it is the user’s responsibility to select the appropriate model in each case. Another approach to exploit the advantages of bulk transfer is to use a new programming model. One example of this approach is a Hybrid protocol [33], in which programmer-supplied annotations are used to support the variable size granularity. The Hybrid protocol consists of a standard hardware DSM protocol that is used for fine grain data and a software (region-based) DSM protocol that is used for coarse grain data. Though both approaches support an arbitrarily variable granularity and thus may potentially lead to large performance gains, they present several problems such as programmability and increased hardware complexity. As we can see in the above examples, the protocol that attempts to support arbitrarily variable grains either needs a new programming model (e.g., Hybrid protocol [33], object-based [6, 24] and region-based [56] software DSM protocols) or requires the user to use explicit message-passing paradigm. In other words, there is a tradeoff between the support of an arbitrarily variable granularity and programmability. The main reason for this tradeoff is that the support of an arbitrarily variable granularity makes its efficient implementation difficult without some information on the granularity from the user. 4.2 Overview o f A daptive Granularity In this chapter, we present a new adaptive scheme, called adaptive granularity (AG) and evaluate its performance in the context of a hardware-software DSM.1 Adaptive granularity is a communication scheme that effectively and transparently integrates bulk transfer into the shared memory paradigm through a variable granularity and 1A hardware-software DSM (HS-DSM) refers to the hardware DSM that implements coherence protocol on the separate node controller in software using handlers (e.g., Stanford FLASH [26], Wisconsin Typhoon [63]). 63 memory replication. Adaptive granularity solves the tradeoff problem (programma bility and the support of an arbitrarily variable granularity) by providing a limited number of granularities (i.e., 2" cache lines up to a page), but efficient enough to achieve gains from bulk transfer, without sacrificing any programmability of the shared memory paradigm and requiring any additional hardware. For memory repli cation, AG assigns part of each node’ s local memory for replicating remote data. In effect, AG uses this local memory as a large fully associative data cache, which elim inates much of the network traffic caused by conflict and capacity misses in smaller hardware caches. An adaptive granularity protocol consists of two protocols: one for fine-grain data and the other one for bulk data. For scalar data and array data whose size is less than some threshold, the standard hardware DSM protocol is used and the granularity is fixed to a cache line. For large array data, the protocol for bulk data is used and the granularity varies depending on the sharing behavior at runtime. When the node controller receives a command (e.g. load or store) from the local cache, it determines the type of communication (bulk transfer or standard load-store communication) depending on the data type2 and sends a request to the home node without specifying any size even for bulk data request. When the home node receives a bulk request, it determines the granularity depending on the sharing pattern and sends back the data to the requesting node. To reduce false sharing, the home node splits a memory block into two when ownership change occurs. To exploit more spatial locality, two adjacent blocks are merged back into a larger block when both blocks are referenced by a single same node. When the data arrives, the node controller puts only the requested cache line onto the cache and the rest of the bulk data goes to the local memory. In summary, for coarse-grain sharing, AG exploits the advantages of bulk trans fer by supporting a spectrum of granularities. For fine-grain sharing, it exploits the advantages of the standard load-store communication by using cache line transfers. Simulation results show that AG changes most remote requests into local requests which in turn reduces the amount of network traffic significantly and improves the performance up to 43% over the hardware implementation of DSM (e.g., DASH). 2The type is saved on the node controller when page mapping is made to replicate the data. 64 T y p e C onnection E xam ple C om m unication G ran u larity MP LC CM-5, Paragon send-receive - HW-DSM TC DASH load-store cache line SW-DSM LC IVY,Munin load-store page Amber new program, model object Midway new program, model region Shared Regions new program, model region IT-DSM TC FLASH, Typhoon load-store cache line send-receive variable memory-copy variable Hybrid new program, model region DSSMP intra: TC MGS load-store cache line, page inter: LC Table 4.1: Communication model and granularity for each parallel system. MP and IT-DSM denote message passing and integrated DSM, respectively; DSSMP refers to Distributed Scalable Shared memory Multiprocessors; LC and TC denote loosely-coupled, tightly coupled, respectively. Compared with the equivalent architecture that supports fine-grain memory repli cation at the fixed granularity of a cache line (e.g., Typhoon), AG reduces execution time up to 35%. The rest of this chapter is organized as follows. Section 4.3 presents communica tion alternatives and the granularity for representative parallel systems and discusses the problems of the protocols that maintain coherence at the fixed or an arbitrar ily variable granularity, mainly focusing on integrated DSMs. Section 4.4 describes AG and presents its memory replication mechanisms and protocols. Section 4.5 describes our simulation methodology to evaluate AG and Section 4.6 presents the experimental results and discusses the performance. Section 4.7 discusses the effect of architectural variations. Related work is given in Section 4.8 and Section 4.9 concludes with a summary. 4.3 Com m unication A lternatives and Granularity A communication model can be presented to the user in different ways depending on the system by which it is supported. These alternatives have different perfor mance, cost, and design characteristics. This section briefly reviews the communica tion alternatives and granularity for the message passing, hardware DSM, software 65 DSM and integrated DSM models, and discusses the advantages and disadvantages of each approach. Table 4.1 summarizes the communication models and granularity for the different types of parallel machines considered here. As the table shows, most parallel machines support only one mode of communication except the integrated DSM. In this chapter, a communication model refers to the high-level model visible to the user, not the low-level implementation, and the granularity refers to the unit of sharing. Load-store refers to the standard shared memory communication and a new programming model represents that programs should be written or annotated using the model that the system supports. A hardware DSM (HW-DSM) refers to the machine that implements a cache coherence protocol completely in hardware (e.g., Stanford DASH, MIT Alewife). A software DSM (SW-DSM) is a shared memory abstraction implemented on top of loosely coupled multicomputers such as worksta tions connected by a local area network (e.g., TreadsMarks [65], Midway [42]). The local node refers to the node that originates a given request and the home node is the node that contains the main memory and directory for the given physical address. A remote node is any other node. 4.3.1 Message Passing Machines and Hardware DSM Message passing and hardware DSM machines are two well-known representatives of parallel machines and send-receive and load-store are usually used for each type, respectively. Although a message-passing machine usually provides more scalability than a hardware DSM, it has a disadvantage in programmability since the pro grammer must explicitly manage the communication for the data in an application. In contrast, a hardware DSM has an advantage of programmability and allows for efficient communications for fine grain sharing at the fixed granularity of a cache line. For coarse grain communications, however, the hardware DSM may not exploit system resources efficiently. 4.3.2 Software DSM A software DSM provides a shared memory paradigm through a software coher ence protocol and it can be divided into two categories, depending on whether the 66 load-store or a new programming model is used for a communication. The software DSM that uses the load-store model (i.e, the standard shared memory programming model) typically maintains coherence at the granularity of physical memory pages (referred to as page-based). The fixed coarse granularity, however, can degrade per formance severely in the presence of fine grain sharing, though some mechanisms such as lazy release consistency [40] can alleviate the false sharing problem to some extent. IVY [34] and Munin [47] fall into this category. To alleviate the mismatch of the page-based software DSM, several new program ming models 3 such as an object-based and a region-based software DSM, in which a variable granularity is used in the protocol, have been proposed [6, 42, 56]. In an object-based approach a granularity is associated to each data objects [6, 24], while in a region-based approach it is allowed to be defined at the level of arbitrary programmer-defined regions. In summary, a software DSM with the fixed granularity (page-based) has the advantage of ease of programming, but the disadvantage of a poor match with fine grain sharing, while a software DSM with a variable granularity (object-based and region-based) exhibits the opposite characteristics. 4.3.3 Integrated DSM For integrated DSM machines (IT-DSM) that support a bulk transfer facility in a cache-coherent shared address space, a programmer is presented with three mecha nisms for communicating data: (i) standard load-store, (ii) explicit messages, and (iii) a new programming model with user annotations. As Table 4.1 shows, an integrated DSM with explicit messages provides several communication models so that the user can select the appropriate model for a communication: load-store, send-receive, and memory-copy model. Load-store and send-receive (or memory- copy) are usually used for fine-grain and coarse-grain communications, respectively. To explain the not-well-known memory copy model, a node sends bulk data from source address directly into destination address, rather than by performing a send to a processor that requires a matching receive [71]. When send-receive or memory-copy is used as the means of a communication for bulk transfer on cache-coherent shared 3Note that they do not necessarily represent a new model, but a model with at least program annotations. 67 memory machines, a granularity is variable and not necessarily cache-line aligned since the models usually support data transfer of any size. Another approach to exploit the bulk transfer facility is to use a new program ming model so that the user’s burden such as selecting an appropriate communication model is alleviated [33]. With this approach, a programmer provides information on the granularity and the compiler and runtime system use this information to select the best matching communication model. One example of this approach is a Hybrid protocol. W ith the Hybrid protocol, programmer-supplied annotations are used to identify shared data regions for which a specific granularity is used using software DSM techniques. Although both approaches achieve the goal of providing a variable size granular ity to the user, they present several problems. First, programs should be changed to effectively use the message passing paradigm [71] or a new programming model is needed for such integrated machines [33]. However, this kind of programming imposes a big burden on the user because the programmer should know both shared memory and message passing paradigms (or the new programming model), defeat ing the intended benefits of using shared memory machines. Second, because bulk transfer data can be cached, a coherence problem arise between the processors that cached the data. Implementing global coherence between arbitrarily sized bulk data and standard load-store data substantially increases the hardware complexity and/or software overhead [30]. Third, a data alignment problem arise because bulk transfer might include only a portion of a cache line rather than a full line. Supporting the data alignment increases hardware and software costs [30]. Fourth, there is a question that more performance gains can be achieved through the message passing paradigm on shared memory machines. Several researchers [11, 71] have shown that the bulk transfer using the message passing paradigm may not help in improving the performance of shared memory applications. The main reason for this performance result is due to the message passing overhead, even though this overhead is greatly reduced by allowing a user-level access. 68 1) If no ownership change occurs, return whole block Return Bulk Data Determine Type Determine Grain Size 2) If ownership change occurs, (i) split the block into two and (ii) return half block Figure 4.1: Oveview of adaptive granularity 4.3.4 Very Large Scalable DSM To provide more scalability and offer better performance/cost, a new shared mem ory multiprocessor called DSSMP (Distributed Scalable Shared-memory Multipro cessors) has been proposed recently in MIT [41]. DSSMPs have two types of hierar chical networks: a tightly-coupled internal network that connects processors within each node of DSSMP, and a loosely-coupled external network that connects each node of DSSMP. DSSMP uses two different protocols (called MGS) for each network communication. For internal network communications, a hardware DSM protocol that provides a fixed size granularity of a cache line is used. For external network communications, a page-based software DSM protocol is used to amortize high cost of external communications. Because MGS provides transparently only two fixed size granularity (cache line and page), it has similar advantage (i.e., programmability) and disadvantage as in HW-DSM and page-based SW-DSM. 4.4 A daptive Granularity In this section we describe adaptive granularity in detail: overview, memory replication, directory handling, and protocol. The granularity of sharing is one of the key components that affect performance in shared-memory machines. Theoretically, the granularity can have any value: word, cache line, page, complex data structure, etc. There are two main parameters in deciding the optimal granularity: false sharing and spatial locality. False sharing occurs when different processors access different location within the same granularity. This causes the data to bounce back and forth between the caches as if it were truly being shared. As the granularity increases, the probability of false sharing goes up while more spatial locality can be exploited 69 [64]. Since the program sharing can vary widely among programs and even in one program as the program executes, we may not obtain the best performance with the fixed granularity. To solve this problem, adaptive granularity (AG) dynamically adjusts the granu larity based on observed reference behavior as Figure 4.1 shows. When false sharing (i.e., by ownership change) is expected, the granularity is decreased at runtime. When more spatial locality is expected, the granularity is increased by merging two adjacent blocks. As the figure shows, the local node determines the transaction type (normal or bulk) and the home node determines the granularity. AG has three main goals in exploiting both fine-grain and coarse-grain commu nications in one machine: programmability, high performance, and no additional hardware complexity. AG achieves the programmability by allowing the user to write programs using the standard shared-memory paradigm (even for bulk trans fer). AG does not require the user to use message passing commands such as send and receive to exploit bulk transfer. AG achieves high performance through a dif ferent granularity depending on the data type and runtime condition. It results in exploiting the advantages of each model. AG provides these good features (high performance and programmability) without requiring any additional hardware for the AG. 4.4.1 Design Choices When a variable size granularity is provided to exploit different communication patterns, several issues arise such as the destination of bulk data, and decision on the transaction type and granularity. We discuss the issues in this section. 4.4.1.1 Static or adaptive granularity If the compiler could analyze the communication pattern accurately for a given ap plication, static multi-granularity (i.e., determined at compile time) might be more efficient than the runtime decision because the runtime overhead could be reduced. However, the problem of predicting a good value of the granularity at compile time is inherently difficult due to several reasons. First, some critical information may not be available at compile time. Moreover this information usually depends on the input 70 data. Second, the dynamic behavior of the program makes compile-time prediction difficult. Third, some aspects of parallel machines such as network contention, syn chronization, cache coherency, and communication delays make it almost impossible for the compiler to predict the best value at compile time. Finally, the optimal value of the granularity tends to vary as the program executes, and this variance can be quite large on some parallel machines. Actually, the integrated DSM that support message-passing paradigm with global coherence passes this difficult task to the user. In AG, the determination on the granularity is delayed until run-time. At runtime, the granularity is decided on the home node based on the sharing pattern when the reference is made. 4.4.1.2 Destination of bulk data When bulk data is transferred, the final target can be a cache or local memory. Transferring bulk data into the cache can exploit more spatial locality without in curring a cache miss, but the bulk data may displace other useful data. Transferring it into a local memory can exploit the advantage of memory replication more easily, though subsequent data accesses are satisfied from the local memory instead of the cache. In AG, only the requested cache line is transfered to the cache and the rest of the bulk data goes to the local memory. This can be easily done since AG transfers bulk data as a sequence of cache lines. 4.4.1.3 Determination on transaction type and granularity Given that we have a variable size granularity and two types of transactions (e.g., standard read and bulk read), one of the issues is that who determines the transaction type and granularity: local or home node. For the transaction type, a local or home node can determine it if it is determined based on just a data type such as array and scalar. If the home node determines it, it can be implemented as follows. When memory is allocated on the home node, the flag is set in the directory to indicate a bulk type if the data size is greater than some threshold value. A local node sends a standard request even for bulk transfer because the node does not have the information. When the home node receives the normal request, it sends a bulk data assuming that the local node needs bulk data. With this approach, implementation 71 is simple because only the home node contains the information. This approach, however, has a disadvantage that a local node cannot request fine-grain data even when it needs small amount of data. To allow a local node to choose an appropriate transaction type, it must have the information on the data type. This scheme can be implemented as follows. When a local node first accesses a shared page on the remote node, a page fault occurs and appropriate actions are taken to map the page. At this time, a flag is set on the local node to indicate the bulk transaction if the home node says it is array data. When a command is given to the node controller, it checks the flag and sends a bulk request if the page is set as a bulk mode. If the local node does not want a bulk transfer (i.e., needs a small amount of data), it resets the flag and then sends a normal request to the home node. Because the local node can know better the transaction type this approach appears more reasonable. Thus we chose this scheme to give the user more flexibility.4 For a granularity, we do not have such a choice as the transaction type. Even though the local node (i.e., consumer) can know better the amount of data needed for bulk transfer, it cannot know the sharing pattern on the home node. Thus, in AG, the home node determines a granularity. Another alternative is that the local node specifies just the maximum size needed and the home node determines the final granularity to send. To implement this scheme, the compiler gives the node controller the information on the size needed for bulk transfer (for example, before beginning of the loop) and the node controller uses it to request a bulk transfer. 4.4.2 Variable Grain M emory Replication Memory replication is one of the techniques to tolerate long remote memory latency. The main purpose of memory replication is to change a remote access caused by conflict or capacity misses into a local access by mapping the remote virtual address to a local physical page and by copying remote data into the local memory so that the subsequent accesses are satisfied locally. To support memory replication with a variable granularity (i.e, 2n cache lines up to a page), AG maps 4 This approach has the same effect as the first one in our current implementation because the compiler does not give any information to the node controller. 72 Virtual Page VA INVALID READ_ONLY READ_WRITE Physical Page P j Physical Page Pj Node A Node B Figure 4.2: Memory replication a virtual address of shared data to a local physical memory at the page granularity, but maintains coherence at a cache line level. To maintain consistency between the local cache and local memory, the node controller allocates a data structure called NCPT (Node Controller Page Table) for line access rights. NCPT has one entry for each page of physical memory that is mapped to the local node on behalf of the remote home node. It contains information about how the page is mapped. Each page table entry specifies the home node and includes various fields to enforce the access rights and to indicate how the transaction is requested to the home node (e.g., bulk read or normal read). When a cache line is displaced out of the cache and cached in again, the table provides the access right. Figure 4.2 illustrates how one virtual page is replicated on two different nodes with a different physical mapping on each node and various granularities. In this example, a granularity of two cache lines is replicated on the two nodes with the access right of READ-ONLY and a granularity of one cache line is replicated on one node with the access READ-W RITE. When a node first accesses a shared page on a remote node, the operating system maps the virtual address to a local physical page and gives information to the node controller such as a home node ID and page type (bulk or non-bulk data). This information has been stored on the page 73 search_dir(p,i) { size = N; /* LINES_PER_PAGE; */ d = 0; g = D[(p,d)].g; /* gran size */ while (g!= size) { size = size / 2; if (i >= d+size) { d += size; g = D[(p,d)].g; > > return d; > Figure 4.3: Algorithm for directory search table when the memory was allocated on the home node. When the node controller receives a transaction on the page mapping from the operating system, it stores the home node ID and page type onto NCPT and initializes the line access tags to Invalid. When an access violates the access right of the cache line, the appropriate action is taken. (Access violation will be discussed in Section 4.4.4.) To distinguish bulk data from non-bulk data, we disallow both data from residing on the same virtual page. 4.4.3 Directory and N C PT handling Given that we have a variable granularity (2" cache lines) and a fine grain mem ory replication, one of the issues related to the overhead of AG is how to handle the directory and NCPT efficiently. Because in one transaction several entries can be implied for both tables (i.e., 4 cache lines for the granularity of 4), we have to make either an entry for each minimum block, or a representative entry that covers the entire range of the blocks. There is a tradeoff between the two approaches: under the first approach search can be performed more efficiently with more overheads of making the entries. With the second approach (representative entry), the overhead of making the entries is small but search takes more time because the hanlder should find out the representative entry first. For NCPT, we adopted the first scheme for 74 Local, Remote Node => Home Node RREQ Read Data Request WREQ Write Data Request ACK Acknowledge Invalidate WACK Ack. Write Inv. and Return Data Home Node => Local, Remote Node RDATA Read Data WDATA Write Data RDATAM Rest of Bulk Read Data WDATAM Rest of Bulk Write Data WSHIP Ownership BUSY Transit State INV Invalidate Request Local Cache => Local Node Controller REPR Replaced Read Data REPW Replaced Write Data Table 4.2: Message types used to communicate between local, home, and remote node easy search because NCPT seems to be referenced frequently. Thus, NCPT[(p,i)] contains the access right information for the ith block of page p. For the directory, however, efficient updates such as split and merge are more important than search and the directory is not frequently accessed because many of cache misses are sat isfied from the local node in AG. Thus we adopted the second scheme of making a representative entry for the directory. The representative entry for a memory block (p,i) is the first directory covering the grain size of the block. The algorithm for searching the representative directory is shown in 4.3. 4.4.4 Adaptive Granularity Protocol An AG protocol consists of a hardware DSM protocol (HDP) and a bulk data protocol (BDP). For small size data in which reference behavior is always fine-grain, the standard invalidation-based hardware DSM protocol is used and the granularity is fixed to a cache line. For large size data in which the expected reference behavior is difficult to predict, BDP is used and the granularity varies depending on the sharing behavior at runtime. BDP is similar to HDP in a structure such as the number of states except that BDP is implemented using the buddy system [50] to efficiently support a variable granularity. The buddy system was adopted originally in a memory management algorithm to speed up the merging of adjacent holes when a memory block is returned by exploiting the fact that computers use binary numbers for addressing. 75 A rc M sg -T y p e C o n d itio n Action O ut Message Local Cache => Node Controller 1 2 3,4 5 6 7 8,9 LOAD STO R E LOAD,STORE LOAD STO R E LOAD,STORE R E PR .R E PW - determ ine type determ ine type N C P T [(p,i)].f= YES; determ ine type N C P T [(p,i)].f= YES; N C P T [(p,i)].f= NO; R R EQ -> hn W R EQ => hn W R EQ hn Home Node => Local, Remote Node 10,11 RDATA N C P T [(p ,i)].f= YES; type = = BULK N C PT[(p,s-e)].s = RT; 12 RDATAM 13,14 WDATA N C P T [(p,i)].f= YES; type = = BULK N C PT[(p,s-e)].s = W T; 15 WDATAM 16 INV N C PT[(p,s-e)].s = INV; ACK => hn 17 INV N C PT[(p,s-e)].s = INV; W ACK => hn 18,19 INV ACK => hn 20 BUSY determ ine type R R EQ = > hn 21 BUSY determ ine type W R EQ = > hn 22 W SHIP N C PT[(p,s-e)].s = RW; Local, Remote Node => Home Node 31 R R EQ D [(p,0)].s==EM P D[(p,0)].g = N; D[(p,0)].c = 1; RDATA => In D[(p,0)].m = In; (RDATAM,N) => In 32 W R EQ D [(p,0)].s==EM P D[(p,0)].g = N; D[(p,0)].c = 1 WDATA => In D [(p,0)]m = In; (W DATAM,N) => In 33 R R EQ d = search-dir(i); D [(p,d)].c++; RDATA => In D[(P,d)].m U= In; g=D[(p,d)].g (RDATAM,g) => In 34 W R EQ d=search-dir(i); IN V =» D [(p,dl)].m (dl,d2) = split-dir(d); D [(p,dl)].a= D [(p,dl)].c; D [(p ,d l)].c= l; D [(p,dl)].m =ln; 35,36 W R EQ ,R R EQ d= search.dir(i); INV D [(p,dl)].m (dl,d2) = split-dir(d); D [(p ,d l)].a= l; D [(p ,d l)].c= l; D [(p,dl)j.m =ln; 37,38 R R EQ ,W R EQ d=search-dir(i); BUSY => In 39 W R EQ D[(p,d)].c = = 1; d= search.dir(i); W SH IP =» In D[(p,d).m = = In 40 ACK.WACK D[(p,i)].a = = 1 D [(p,i)].a— ; m erge-dir()?; WDATA => D[(p,i)].m g=D [(p,i)].g (WDATAM, g)=i.D((p,i)].m 41 WACK D[(p,i)].a = = 1 D [(p,i)].a— ; m erge.dir()?; RDATA = > D[(p,i)].m g=D[(p,i)].g; (RDATAM, g)=>D[(p,i)].m 42 ACK D[(p,i)l.a > 1 D [(p,i)].a— ; Table 4.3: State transition table for the adaptive granularity protocol. <message> = > <pid> represents that <message> is sent to <pid>. INV: INVALID; RO: READ-ONLY; RW: READ.WRITE; RT: READ-TRANSIT; WT: WRITE-TRANSIT; EMP: EMPTY; s-e denotes the range from starting to ending block for the bulk transfer. “In” and “hn” denote local node and home node, respectively. NCPTQ.s: state; NCPT[].f: cache present bit; N: number of cache lines per page; DQ.s: state; D[].g: grain-size; D[].c: count; D[|.m: member; D[].a: ack count. 76 LOAD,STORE (3) :REQ,WREQ (37) BUSY (21) INV (18). READ rRANSi READ rRANSi LOAD (1), IATAM(12) 'ACK (41) RREQ (33) LOAD (5) (11) RDA1 INV (16) RREQ (31) READ ONLY READ ONLY EMPTY INVALID RREQ (36) REPR (8) !TA( STORE 'WREQ (39) IAD,STORE (7) W R IT E )WDATA (1 3 / READ rRANSnj WSHIP (2g( WRITE READ WRITE ^ — - ' a c k .w a c k T ii ,4lREQ,WREQ (38) WDATAM (1 REPW (9) BUSY (20| INV (19) LOAD,STORE (4) ACK (42] (b) H om e Node (a) Local and Remote Node Figure 4.4: State transition diagram for the adaptive granularity protocol Table 4.2 shows the message types used for AG to communicate between local, home, and remote nodes. Most of them are similar to those of the standard hard ware DSM protocol except that several extra message types (e.g., RDATAM) are introduced to support bulk transfer. Figure 4.4 shows the state transition diagrams on the local (and remote) node and home node assuming the sequential memory con sistency, mainly focusing on BDP. Table 4.3 is a specification for the AG protocol (mainly focusing on BDP) and describes the actions to be taken and out messages for the transition arcs in Figure 4.4. Most of the message types and states of HDP have a corresponding message type or state on BDP. The message types and state are distinguished from their equivalent message types and states on HDP by the prefix B K - For example, RREQ and BK-RREQ are used for a read data request for HDP and BDP, respectively. In Table 4.2 and 4.3 and Figure 4.4, the message types and state are not distinguished to simplify the figure and tables. Messages for bulk transfer (except for the memory request messages such as BKJIREQ, for which the local node cannot know the range of bulk transfer) contain the starting address and size so that the handlers can deal with them efficiently. 77 Local Node A local node maintains consistency on NCPT and a state transition diagram for the local node is shown in Figure 4.4.(a). After the mapping of a virtual page to a local physical memory is made (discussed in Section 4.4.1), when an application attem pts to access the block, the line-access-fault handler is invoked because all line access tags for the page were initialized to Invalid. The handler retrieves the home node’s ID from the local table and sends a request (bulk or non bulk depending on the page type) to the home node. This situation is represented by arcs 1 and 2 in Figure 4.4.(a). At the home node, the message invokes the handler that performs the appropriate coherence actions and returns the data (sends the requested memory block first and then the rest of them as a sequence of a cache line to utilize the memory and network bandwidth if the request was bulk transfer). When the requested block arrives from the home node, the message handler writes the data into the cache and changes the block’s access state. If the arrival data is the requested block (e.g., BK-RDATA, not BK_RDATAM), the handler also changes the state for the rest of the bulk data to a transit state so that a request to the transit data is suspended until it arrives without sending a request to the home node. This situation is represented by arcs 10, 11, 13 and 14 in Figure 4.4.(a). When the rest of the bulk data (e.g., BK.RDATAM) arrives (arcs 12 and 15 in Figure 4.4.(a)), the message handler writes the data into local memory and changes the line’s access state. Home Node When the home node receives a bulk request from the local node, it takes similar actions to the standard hardware DSM protocol, as shown in Figure 4.4. (b). The main difference between HDP and BDP is in the directory handling. For BDP there is one more field (grain size) in the directory to support the variable granularity. Initially, the grain size of a memory block is N (the number of cache lines per page) and the representative directory for any address of the page is D[(p,0)], regardless of the requested memory block. When the home node receives BKJIREQ or BK.WRITE with the address (p,i), the directory handler on the home node first searches the representative directory from D[(p,0)], as shown in Figure 4.3. With this scheme, the search can cost up to log2N, but directory operations such as split and merge are efficient because only two representative directories are accessed to split or merge. Furthermore, only two 78 message types (BK-READ and BK_WRITE) are needed to search the representative directory. For the other messages such as acknowledgment, the address for the representative directory is given with the size and thus the search is not needed. After finding out the representative directory for the address, the handler takes some actions to send bulk data. Initially, the state for D[(p,0)] is E M P TY and thus the handler sends to the local node a whole page (a requested block is sent first and then the rest of the page is sent as a sequence of a cache line), changes the state and sets the grain size field to N (number of cache lines per page). This situation is represented by arcs 31 and 32 in Figure 4.4.(b). When a bulk request requires some ownership change (e.g., READ-ONLY => READ.WRITE), the memory block is split into two sub-blocks (called buddies) with the same size if the granularity for the block is greater than 1. The handler sends B K J N V to the remote nodes to invalidate the half of the original block containing the requested offset within a block and updates the representative directories for the two sub-blocks. The other half of the original block is untouched so that the remote nodes continue to access the sub-block. This situation is illustrated by arcs 34, 35, and 36 in Figure 4.4.(b). When all acknowledgments for the invalidation requests arrive from the remote nodes (arcs 40 and 41 in Figure 4.4.(b)), the handler sends the sub-block to the local node. At this time, if the grain size for the block is 1 (minimum granularity), the handler checks the directories of the two adjacent blocks to see whether merges are possible. The handler starts and continues to merge the two blocks as long as both blocks have the same state and only a single same node as a member for the blocks. As an example assuming a page size of 8 cache lines, in order to migrate a page from one node to another, the page is split three times (into 4, 2, and 1 blocks, respectively) and merged back into the original page when the last split is made. In summary, a memory block is split into two sub-blocks when a bulk data request arrives with an ownership change and the directories are checked for a merge at the time the ownership change completes with granularity 1. 79 C om m on Memory consistency sequential Memory access latency 35 cycles CPU cache direct-map Cache line size 16 bytes Cache size 4 Kbytes Page size 4 Kbytes TLB miss 25 cycles Switch delay 2 cycles Wire delay 2 cycles Channel width 2 bytes F G Only (H an d ler O verhead) Memory request 14 cycles Memory reply 30 cycles Data arrival 20 cycles A G Only (H an d ler O verhead) Memory request 14 cycles Memory reply 40 cycles Data arrival 20 cycles Table 4.4: Simulation parameters 4.5 Sim ulation M ethodology This section presents the simulation framework used to study the performance of adaptive granularity. In Section 4.5.1 and 4.5.2, we present the architectural as sumptions and simulation environment, respectively. In Section 4.5.3 we describe the benchmark programs. 4.5.1 Architectural Assumptions We simulate a system of 16 processing nodes, which are interconnected by a bi-directional mesh network. Each node consists of a processor, a cache, a portion of the shared memory, and a node controller. The node controller contains a pro grammable processor which processes all messages from the processor and network using corresponding handlers with active messages [1]. The pages are allocated to the memory modules in a round-robin way. The physical memory is equally distributed among the nodes and a sequential memory consistency model [22] is enforced. Cache coherence for shared memory is maintained using a distributed directory-based pro tocol. It is similar to Stanford FLASH and Wisconsin Typhoon except that the bulk transfer facility for a message passing paradigm is not needed for AG. Table 4.4 summarizes default machine parameters and values used in our sim ulations. With these values, the machine has the following read latencies without contention: 1 cycle for a cache hit, an average of 120 cycles for a 2-hop global access and average of 140 cycles for a 3-hop global memory access. In Section 4.7, we will 80 P ro g ra m D escription In p u t D a ta Set Cholesky Cholesky factorization of sparse matrices bcsstkl4 FFT Fast Fourier transformation 64K complex points LU Blocked dense LU factorization 260x260 Radix Integer radix sort 256K keys, radix=1024 Table 4.5: Application Characteristics consider various architectural variations such as longer line size, larger cache size, 64 processors, and increased network latency. 4.5.2 Simulation Environment The machine is simulated using an execution-driven simulator called Trojan [66]. Trojan is an extended version of MIT Proteus [52] and supports a virtual memory simulation. It also allows both process-based (e.g., SPLASH [59]) and thread-based (e.g., SPLASH2 [60]) applications to be simulated without requiring the program to be modified and provides a detailed model of the various hardware components. All hardware contentions in the machine are simulated, including the network. For the network, the model proposed by Agarwal [2] is used. Instruction references are assumed to take one cycle and virtual memory is enabled in all simulations. 4.5.3 Benchmark Applications In order to compare the performance of adaptive granularity with other ap proaches, we use four scientific applications that have different communication pat terns. The applications we study is shown in Table 4.5. Cholesky is taken from the SPLASH [59] suite and the rest of them (FFT, LU and Radix) are taken from the SPLASH2 benchmarks [60]. 4.6 Experim ental R esults and D iscussion AG attem pts to improve performance through the memory replication and bulk transfer. In order to quantify the performance gains from each scheme, we simulate three systems: a hardware DSM (HW), the hardware-software DSM that provides 81 E P c o Figure B enchm ark Type O verall M iss-R atio(% ) Local (R ,W )(% ) R em ote (R ,W )(% ) P riv ate R W (% ) Cholesky HW 9.6 4.2 (3.3, 0.9) 85.0 (71.1, 13.9) 10.8 FG 9.6 74.8 (73.1, 1.7) 15.2 (7.9, 7.3) 10.0 AG 9.6 76.3 (73.3, 3.0) 16.2 (8.2, 8.0) 7.5 FFT HW 18.6 5.1 (3.1, 2.0) 82.2 (49.3, 32.9) 12.7 FG 18.6 51.8 (28.7, 23.1) 35.5 (23.6, 11.9) 12.7 AG 18.6 85.5 (51.2, 34.3) 1.8 (1.2, 0.6) 12.7 LU HW 11.6 1.1 (1.1, 0.0) 96.7 (63.8, 32.9) 2.2 FG 11.6 90.8 (58.9, 31.9) 7.0 (6.0, 1.0) 2.2 AG 11.6 97.6 (64.7, 32.9) 0.2 (0.2, 0.0) 2.2 Radix HW 8.8 3.5 (1.4, 2.1) 92.9 (41.5, 51.4) 3.6 FG 8.8 77.4 (35.4, 42.0) 18.6 (7.4, 11.2) 4.0 AG 8.8 88.1 (38.9, 49.2) 7.9 (3.9, 4.0) 4.0 Table 4.6: Decomposition of cache misses. R: read, W: write 16 0 140 120 100 80 60 40 20 0 sync private-wrtte private* read remote* write local-write remote-read local-read HW FG AG HW FG AG HW FG AG HW FG AG Cholesky FFT LU Radix 4.5: Simulation results for default machine parameters 82 fine-grain replications with the fixed granularity of a cache line (FG), the hardware- software DSM th at supports Adaptive Granularity (AG). By comparing the perfor mance of HW and FG (FG and AG), we can see the performance improvement (or degradation) that is achieved via memory replication (bulk transfer). Because we want to see the performance gains achieved mainly from the bulk transfer of AG (not from memory replication), we normalized the execution times of HW and AG to that of FG. Figure 4.5 shows the relative execution times of three types (HW, FG, AG). The execution time of an application is broken down as follows: (1) busy time spent executing instructions (busy), (2) read stall time, (3) write stall time, and (4) stall time due to synchronization (sync). The stall times for read and write are further decomposed into local, remote, and private. The local stall time represents the stall time satisfied from the local node for shared memory accesses. Thus, a cache miss that is satisfied from the replicated local memory is classified as a local cache miss. Table 4.6 shows the cache miss ratio for each benchmark and decomposes the cache misses into local, remote, and private misses. Note that each component (Local, Remote, and Private) does not represent a cache miss ratio, but the percentage over the total number of cache misses. Because FG and AG do not change the cache miss ratio itself directly (it can be changed slightly because of changes in cache mapping), we do not show it for each component in Table 4.6. When we compare overall performance of FG with that of HW, FG outperforms HW for three applications (Cholesky, LU, Radix). For FFT, however, FG is slightly outperformed by HW, mainly due to the handler overhead. As the breakdowns for the execution times and cache misses show, the gains from FG over HW are mainly due to the reduction in remote memory latency that arise from memory replications. They show that FG changes the cache misses caused by conflict or capacity misses to the local access in most cases by supporting fine-grain memory replication. However, FG has a limitation in reducing remote latency, especially the one caused by the migrating data type as shown in a FFT case. For FFT, though FG changes remote references from 82.2% to 35.5% in cache misses, the execution time is slightly increased for FG. This leads to the argument th a t supporting only fine grain memory replication is not enough in changing remote accesses to local accesses. 83 For the overall gains from AG over F'G, Figure 4.5 shows that AG reduces the execution times of the programs over FG by 7.7%, 35.4%, 12.2%, and 11.5%, re spectively. FFT showed the greatest improvement because it generates the most migrating memory references of the programs simulated. The gains are mainly due to the remote latency reduction that arises from bulk transfer. In order to migrate one page (N cache lines) fom one node to another, FG communicates 3N times (INV, WACK, WDATA for each cache line) while AG performs the migration with only 31og2N times communications. Compared with HW (i.e., normalized with HW), AG reduces the execution times of the programs by 18.1%, 35.1%, 42.7%, and 29.3%, respectively (this is not shown in Figure 4.5). Another observation from the results in Figure 4.5 is that processor utilizations (i.e., the percentage of busy time over the total execution time) appears to be low: 19%, 16%, 26%, and 29%, respectively. The main reason for the low utilizations is that the applications exhibit high cache miss rates (see Table 4.6), mainly caused by a small cache (4K). Furthermore, the latency due to write misses is not eliminated because sequential consistency is enforced. If we increase the cache size, cache misses caused by conflict and capacity misses can be reduced, which thus increases processor utilizations. However, the increased channel utilization does not necessarily decrease the relative effectiveness of the adaptive granularity. The effects of the cache size on the performance of AG is discussed in the next section. 4.7 Effect of A rchitectural Variations In this section, we evaluate the impact of several architectural variations on the performance of adaptive granularity. The variations we study include changes in the line size, cache size, number of processors, and network latency. 4.7.1 Cache Size For our experiments so far, the cache size has been set to 4 Kbytes. In order to see whether AG works well also for larger caches, we ran additional experiments with the cache size set to 16 Kbytes. For larger cache size, we used the same size of data sets because we want to see the performance changes with larger caches. The 84 FFT LU 100.0 100.0 100.0 100.0 private-write private-read remote-write remote-read FG AG FG AG FG AG FG AG 4K 16K 4K 16K Figure 4.6: Effect of cache size variation results of our experiments are presented in Figure 4.6. We see that for FFT, AG has a 39% shorter execution time with larger caches than FG while it has a 35% shorter execution time with 4KB caches, which is contrary to the intuition. The explanation for this difference is that the increase in cache size may reduce conflict misses, but cannot reduce the true sharing misses. Because FG also provides the memory replication, the effect of the reduced conflict misses is small. Thus increasing the cache size cuts the execution time, which results in the increase of the component of remote stall time (not the absolute value but the percentage of the stall time). 4.7.2 Line Size The next variation we study is to see how the effectiveness of AG is changed when the line size is increased from 16 bytes to 64 bytes with a channel width fixed. Increasing the line size has the advantage of providing better spatial locality and the disadvantages of increasing false sharing and making remote accesses longer since a channel width is fixed. Thus it is interesting to see how performance is affected by 85 16B 64B 16B 64B Figure 4.7: Effect of cache line size variation the line size change. Figure 4.7 presents the resulting execution times for FFT and LU. As the figure shows, the effectiveness of AG is decreased. The main reason for this is that the component of remote-stall time is decreased (from 58.4% to 48.7% for FFT and 13.3% to 8.6% for LU) due to the increase in busy-time resulting from the increased spatial locality. This means that the benefits achieved by exploiting more spatial locality more than offsets the overhead caused by false sharing, and therefore there is less room for performance improvement by AG. 4.7.3 Number of Processors Next we study the performance effect on AG as the number of processors increases from 16 to 64 processors. For this experiment, larger data sets are used. For FFT, 256K complex points are used instead of 64K for 16 processors. For LU, 460 by 460 matrices are used instead of 260 by 260 for 16 processors. As the number of processors increases, the variance of network latency is becoming large since the average hop count goes up. In Figure 4.8, we show the experimental results. We see 86 1 0 0 • 8 0 E p c o ■ = 60 8 x O Z 40 20 100.0 pnvate-wnte private-read remote-write loca-wnte remote-read oca-read FG AG FG AG P=16 P=64 FG AG FG AG P=16 P=64 Figure 4.8: Effect of number of processors variation that increasing the number of processors reduces the effectiveness of AG for FFT and does not change it much for LU. The main reason for the FFT case is that the cache miss ratio was increased a lot (from 19% to 41%) due to larger data sets with the fixed size. Because the increased cache misses are caused by cache conflicts (not true sharing misses), they are satisfied from the local node. As the figure shows, the component of local stall time is increased from 20.5% to 38.9% and AG cannot cover the increase. For LU, the cache miss ratio was not changed much (from 11.6% to 11.9%) though a larger data set is used with the fixed cache size. Even with a little change in a cache miss ratio and longer network latency, the effectiveness of AG is reduced slightly with 64 processors. The explanation for this is that the data set size is not increased as much as the increase in the number of processors due to the simulation limitation of time and space. This resulted in the less allocation of work (and/or data) to each processor. 87 FG AG FG AG FG AG FG AG N1 N2 N1 N2 Figure 4.9: Effect of network latency variation 4.7.4 Network Latency The final variation we study is to see how the effectiveness of AG is changed by a variation in the network latency assuming the same number of processors. Until now, we have assumed that a switch and wire delay are each 2 cycles, which results in around 140 cycles for 3-hop remote memory access without contention. Figure 4.9 shows the resulting execution time for FFT and LU, assuming that the delay is doubled (i.e., 4 cycles each). With this assumption, it takes average 230 cycles for 3-hop remote memory access without contention. We observe in Figure 4.9 that AG works better under a longer network latency. As a network latency increases as a result of increased wire (or switch) delays, the components of remote-read, and remote-write are becoming larger and thus there is more room for improvement. 88 4.8 R elated Work Adaptive granularity improves performance by reducing the memory latency through memory replication and variable granularity. Memory replication has been used differently depending on the system by which it is supported. Cache-only memory architectures (COMA) [20, 36] makes all the local memory into a cache and replicates all shared data into the cache. Software DSMs performs memory replication at the page granularity by the operating system, or at the granular ity of user-defined objects or regions by the runtime system. Page-based memory replications, however, are poor-matched with fine grain applications and thus ex perience severe performance degradation. To solve this problem, Typhoon allows memory replication to be made at a much finer granularity (i.e., cache line) through a mechanism called Stache [63]. Stache consists of user-level handlers such as a page-fault handler, message handler, and line-access-fault handlers. Our particular implementation of memory replication borrows from that of the Typhoon system [63]. The main difference is that Typhoon uses tagged memory with the fixed cache line granularity while AG uses general memory with a variable granularity. Several other approaches have been proposed to support multi granularities in the context of shared memory machines. Dubnicki and LeBlanc [21] describes a hardware cache coherent system that dynamically adjusts the cache block size based on the reference behavior. Cache blocks are split when false sharing occurs and merged back into a larger cache line when a single processor owns both of the sub blocks. Compared to adaptive granularity, it cannot not achieve a large granularity such as a page nor provide flexibility since a cache line is split and it is implemented in hardware. Galactica Net [68] and MGS [41] support two fixed size granularities of a cache line and a page. Both schemes select between cache line grain and page grain blocks, depending on shared memory reference behavior. Another alternative approach for achieving benefits from bulk transfer is to sac rifice the programmability of a shared memory paradigm to some extent. With this approach, application annotations or explicit messages are used in an application to communicate bulk data. Hybrid protocol [33] and Shared Regions [56] allow co herence to occur at any granularity with user annotations to identify the regions in 89 which a specific granularity is applied. With explicit message passing communica tion, a programmer is presented with two communication paradigms (load-store and message passing) that he has to to select an appropriate model for a communication [71]. Even though application annotations and explicit messages may potentially exploit a variable granularity better than adaptive granularity does, they presents several problems such as programmability and increased hardware complexity. 4.9 Chapter Sum m ary Several studies have shown that the performance in multiprocessor systems is sensi tive to the granularity supported by the system and the sharing patterns exhibited by application programs. For fine-grain applications, a traditional hardware DSM provides an efficient communication through the fixed granularity of a cache line. For coarse-grain applications, however, this fine-grain communication can be less ef ficient than bulk transfer. The main advantage of bulk transfer is that we can reduce communication costs through fast data transfer and replications. A mismatch be tween the granularity and communication behavior can result in serious performance degradations. In this chapter we proposed a new communication scheme called Adaptive Gran ularity (AG) th at achieves high performance without sacrificing the programmability of the shared memory paradigm, through the support of memory replication and use of different protocols depending on the data type. For small size data, a standard hardware DSM protocol is used so that fine grain communications are optimized through the fixed size granularity of a cache line. For large size array data in which it is difficult to find the best match between the granularity and sharing patterns, the protocol that dynamically adjusts the granularity according to the reference behav ior is applied. A memory block is split when false sharing occurs, and merged back into a larger memory block when some conditions are met to exploit more spatial locality. Simulation results show that adaptive granularity eliminates a substantial fraction of remote cache misses through the memory replication and more spatial locality gained from bulk transfer. AG improves performance up to 35% over the equivalent system with the fixed size granularity and up to 43% over the hardware DSM. 90 Chapter 5 Trojan Sim ulator The last two chapters discussed adaptive prefetching and adaptive granularity in detail. To evaluate their relative performance on parallel shared-memory machines, we developed an execution-driven simulator called Trojan. This chapter discusses the simulator in detail. 5.1 Lim itations o f Current Sim ulators Although execution-driven multiprocessor simulators have been shown to be effec tive in simulating large programs running on complex machines, sometimes these simulators tend to offer a limited set of features which can potentially restrict their applicability. The three most important features to consider are: 1) the specific shared-memory execution model supported; 2) the ability to simulate virtual mem ory; and 3) whether or not applications need to be modified in order to conform to the simulator requirements. With respect to the execution model, simulators tend to support one of two mod els: process-based (e.g., SPLASH [59]) or thread-based (e.g., SPLASH2 [60]). The difference between them is that, in the process model, heavyweight processes created with the Unix “fork” receive their own private copies of all global variables (private global and not shared global variables), whereas in the thread model, lightweight processes share the same virtual address space, and hence all global data. Most sim ulators use the same model as the applications they simulate (i.e., Unix processes under the process-model) and thus support only a single set of semantics. This, how ever, presents two problems: first, a separate simulator is needed for each model, 91 which is inconvenient and expensive, and second, simulating a process model using Unix processes is very expensive mainly due to the high cost of context switching. For example, the Stanford Tango simulator [61] supports only the process model and can only simulate applications sharing these semantics. In contrast, Stanford’ s Tango-Lite [62] (a successor to Tango) and MIT’s Proteus [52] use threads to simu late only thread-based applications. In order to simulate process-based applications in these simulators, all private global variables (variables visible by a single pro cess/thread) which can be modified during the parallel stage have to be replicated (manually modified) [62], This imposes a significant burden on the user. Another important limitation of many simulators is their lack of support of vir tual memory execution. While most Massively Parallel Processing research machines (MPPs) do not support virtual memory, recent commercial MPPs such as KSR and Intel Paragon do [36, 51]. Furthermore, as networks of workstations [49] (NOWs) emerge as an alternative to dedicated MPPs, the performance impact of virtual memory features such as TLB misses, page faults, and working sets has become more significant. Thus, we believe that in order to reflect more realistic behavior, future simulators will need to offer some support for it. This, however, has been ignored in most state-of-the-art simulators. A third important limitation of many simulators is that they sometimes require that applications be annotated in order to simulate them correctly. For example, Proteus uses an extra symbol to represent accesses to shared memory. Though this simplifies writing the simulator, it imposes two problems on users: first, application programs must be modified to meet the simulator’ s requirements, which can be quite burdensome on a complex application, and second, the annotations can affect the code produced by compilers, such as register allocation, stack handling and timing. This, in turn, impacts cache behavior and reduces overall accuracy. 92 5.2 Overview o f Trojan Sim ulator To solve efficiently some of the limitations of current simulators, we developed an execution-driven simulator called Trojan, which is an extended version of MIT Pro teus. Its main features are: 1) it supports the efficient execution of both process- model based and thread-model based applications. In particular, it uses a “copy-on- write” mechanism to simulate efficiently process-based applications using a threads package; 2) it provides functionality for virtual memory simulation, making Trojan the first execution-driven simulator to do this; and 3) Trojan does not require mod ifying an application, which increases accuracy and usability. In addition to these features, Trojan also offers support for several other functions such as prefetching and relaxed memory consistency [18]. The rest of this chapter is organized as follows. Section 5.3 presents background material by briefly reviewing the class of machines we are able to simulate and dis cussing the main principles behind execution driven simulation. Section 5.4 presents two examples of an execution-driven simulator: Stanford Tango and MIT Proteus, while Section 5.5 discusses the two execution models for shared-memory applications and the scheme Trojan uses to efficiently simulate each model. Section 5.6 discusses the relevant issues behind virtual memory simulation. The implications of annotat ing applications in order to simulate them and other useful features of Trojan are discussed in Section 5.7. The overall performance of Trojan is given in Section 5.8. Finally, Section 5.9 concludes with a summary and a brief discussion of future work. 5.3 Background Trojan is an execution-driven simulator with a threads package, and simulates MIMD multiprocessors in which processing nodes are connected via a direct network such as a mesh or a hypercube. In this section we present the general machine model that Trojan is able to simulate and we then briefly review the main ideas behind execution-driven simulation. 93 To or from Nodei Cache Processor Directory Local Memory Network Interface Interconnection Network (eg. fc-ary n-cube network) Bus Network Figure 5.1: NUMA Architecture 5.3.1 Target Machines Figure 5.1 shows the model of target machine Trojan simulates. It consists of N pro cessing nodes, which are interconnected by a direct network like a k-ary n-cube. Each node consists of a processor, a cache, a portion of the shared memory, a directory, and a network interface, all of them linked by a bus. Processors are connected di rectly to their respective caches. The physical memory is equally distributed among the nodes and a particular memory consistency model (sequential, relaxed, etc) [22] is enforced by the hardware. Cache coherence for shared memory is maintained using a distributed directory-based protocol. We believe that this is a reasonable abstraction for many large scalable shared multiprocessors (e.g., Stanford DASH [18], Stanford FLASH [26], and MIT Alewife [3]), since it captures the essence of the memory hierarchy of such machines. Trojan applications are written in C using the ANL macros [5]. These macros provide a variety of abstractions such as locks, barriers, and process (or thread) management. 5.3.2 Execution-Driven Simulation There are the two well-known methods for parallel computer simulation: trace- driven and execution-driven simulation. Trace-driven simulation decomposes the system into two components: a trace generator and a memory/network simulator. The simulator emulates the execution of the target machine based on the addresses 94 the trace generator produces. Though trace-driven simulators are relatively easy to build, their accuracy is limited by the timing assumption the address generator makes [61]. Execution-driven simulation, on the other hand, interleaves the execu tion of the application program with the simulation of the target system, without having to generate an intermediate address trace. This interleaving (unlike timing as sumptions in trace-driven simulation) allows more accurate simulation of contention and interactions amongst the nodes [61]. Execution-driven simulation with a threads package allows one process to be multiplexed amongst the various activities in the target machine. Sequences of instructions are directly executed until the program performs globally visible oper ations such as shared memory references. When this happens, the simulated time is updated and control is returned to the simulator engine for the scheduling of future events. The simulator is the main thread; it maintains a queue of requests, sorted by timestamp. While the queue is not empty, the simulator repeatedly removes the earliest request and calls its associated handler. When the queue becomes empty, the simulation is done. As shown in Fig. 5.2, the compilation steps involved in executing an execution- driven simulator are: macro expansion, compilation into assembly code, augmenta tion of the assembly code, assembly, and linkage. The first step is to expand the macros which provide various abstractions such as locks and barriers. The output of the macro expansion is a C source file which is then compiled into assembly code. This assembly code is augmented to insert simulation calls when appropriate. In structions are added to each basic block so that when that block is executed, the extra instructions increment a time counter with an amount corresponding to the number of cycles required by the processor to execute the original basic block. At special events, such as shared memory references, code is inserted to transfer control to the simulator in order to process the event and reschedule future activities. An example of code augmentation is shown in Fig. 5.6 (Section 5.7). The augmented assembly code is assembled and linked with the simulator to obtain the executable code. (Augmentation is done on object code in other simulators such as FAST [25] and MINT [43].) 95 Target Sourci Pgm Assembly Code Augmented Assembly Code Executable C Source File Macro Expansion Linkage Compile Into Assembly' Code Augmentatior Simulator Network Module Cache Module Augmented C Lib Augmented Math Lib Figure 5.2: Compilation Steps of Execution-Driven Simulator 5.4 R elated Work In this section we present two examples of execution-driven simulators, Stanford Tango and MIT Proteus, and briefly discuss their advantages and disadvantages rel ative to Trojan.1 Both simulators are based on the execution-driven model, which greatly reduces simulation time by directly executing each target instruction when ever possible without costly instruction interpretation. Tango supports trace-driven simulation as well as execution-driven simulation. In an execution-driven mode, the Tango simulator uses the Unix “fork” to create child processes. Though this scheme makes simulating process-based applications relatively easy, the overhead of context switching tends to dominate the simulation time because executing one tends to consume thousands of cycles. Frequent con text switching between processes is necessary in order to accurately interleave the execution of events. To reduce this overhead, Tango allows the user to select the regions of memory to simulate in detail: all memory, only shared memory, or only synchronization operations. 'W e chose Tango as an example rather than more recent Tango-Lite though the latter is more efficient, since Tango supports process-based applications while Tango-Lite is basically the same as Proteus in a sense that both of them use a threads package. 96 Process l ’s Process n ’s address space address space A single address space area reserved for shared memory Thread 1 int a; private globalx global variabh ini a; variables int a; Thread n int b; • • • (a) Process model (b) Thread model Figure 5.3: Two models of shared memory applications Whereas Tango runs in multi address spaces, Proteus, on the other hand, runs in a single address space, making it two orders of magnitude faster than Tango. Moreover, Proteus can be easily configured to simulate a wide range of architectures such as a bus-based or k-ary n-cube networks. Proteus also provides detailed hop- by-hop network simulation as well as analytical modeling using Agarwal’s network model [2]. Two limitations of Proteus are that process-based applications cannot be simulated and that applications have to be changed to represent shared memory references. Compared with Tango and Proteus, Trojan has some advantages: first, it sup ports efficiently the two basic execution models with a threads package. Second, Trojan provides functionality for virtual memory simulation whereas Tango and Proteus do not. A disadvantage of Trojan relative to Tango is that Trojan does not support trace-driven simulation. Compared with Proteus, Trojan has the advantage of not requiring any modification to applications. 5.5 Sim ulation of Two Shared M em ory M odels The semantics of parallel shared-memory programs are represented by two dif ferent execution models and three different scoping levels. This section explains in detail the scoping levels and discusses the semantics of the two execution models. We also show how Trojan executes efficiently these two models. 97 5.5.1 Process Model and Thread Model Figure 5.3 shows the two models of shared memory programs. In the process model (Fig. 3.a) each process has its own address space and part of it is reserved for shared memory. Only the reserved part of the address space can be used for shared memory, while the rest is considered private to the process. The main reason for having distinct processes is to provide better protection and a larger address space to each child. The SPLASH [59] benchmark suite is an example of applications written using the process model. As the amount of data sharing and the number of cooperating parallel activities increases, the overhead of creating child processes becomes too time consuming. To deal with this situation, a lightweight process called a thread is used instead. Like processes, threads provide a separate control path. Unlike processes, creating a new thread incurs much less overhead than creating a new process. The SPLASH2 [60] benchmarks were written using the thread-model. While conventional programs have only two kinds of scoping levels (local and global), parallel shared-memory programs have three scoping levels: local, global, and private global. Local variables are visible only to their own functions, while global variables are visible to the entire program (i.e., all processes/threads). Private global variables are the ones that are visible only to a single process/thread, but not visible to the others. For example, on Fig 5.3.(a), representing the process model, variable a is a private global variable which is allocated to different physical locations on each process even when all refer to it using the same virtual address. Hence, modifications to a made by one process are not reflected on the other processes’ address spaces. Because of the extra scoping level supported by parallel programs, a new mechanism is needed to represent variables in this new class. Since most programming languages do not have a way of expressing the semantics of this new scope, ad hoc methods have been used in main benchmarks suited like SPLASH and SPLASH2. The main differences between the process and thread models in scoping visibility is summarized by the following table, which also shows the natural implementation of these scopes in a uniprocessor execution model. 98 S im u la to r S u p p o r tin g a p p lic a tio n s S im u la tio n C o m m e n ts Tango Process model Unix processes high ctx switch overhead Proteus, FAST T hread model T hreads package No support for process model Trojan Process & thread model T hreads package Copy-on-write Table 5.1: Examples of Execution-Driven Simulators and Programming Model Uniprocessor Process model Thread model Local Local Local Global Private-global Global Global Private-global As the table above shows, global variables in the conventional programming plays the role of private global in the process model and global in the thread model. Thus, in order to accurately simulate parallel programs in a uniprocessor environment, we need an extra scope as a way of representing global variables in the process model and private global variables in the thread model. Global variables (i.e., shared mem ory) in the process model are represented by global_malloc() instead of malloc(), and allocated only from the reserved part of the address space. For private global variables in the thread model, there is no easy way to express them and thus several awkward methods has been used in the past. One way is to allocate a chunk of memory for these variables and pass it to each function in the thread, as an ex tra parameter [50]. Thus, although the thread model has an advantage of making communication simple and efficient, it has the disadvantage that it tends to overly complicate writing a program. 5.5.2 Simulation of the Two Models Most simulators are developed with a specific execution model in mind and thus support only a single programming model. This approach, however, has two prob lems: first, when we need to evaluate the performance of benchmarks written in the two models (as each model has its own advantages and disadvantages) we cannot 99 Child A’s Simulator’s Child B’s Address Space Address Space Address Space copy of page 2 RW RO RO RO RO Figure 5.4: Simulation of process model by a threads package using copy-on-write scheme use a single simulator for both models. The fact that a simulator is needed for each model causes many inconveniences and is expensive. The second problem is that of the high overheads we have to pay when we simulate processes/threads using Unix processes. When a child process is created, the whole memory image of the parent process is copied into the child address space. A naive implementation of this copy ing scheme requires huge amount of extra memory and consumes simulation time. To this we have to add the high overhead of context switching which is incurred in interleaving the execution of the simulated parallel processes. Context switching requires thousands of cycles, making this overhead the dominant factor of simulation time. Table 5.1 summarizes the programming model that some of execution-driven simulators support. To deal with this problem, Trojan uses a threads package to simulate programs representing both the process and the thread model. The fact th at all target pro cesses run in a single address space solves the problem of heavy context switching, but creates a new problem: how to enforce the semantics of the process model. That is, how to make updating private global variables visible only to a single process. One solution is that each thread (simulating a target process) makes a whole copy of the parent process. However, this scheme requires huge amount of memory and copying time. To get around this problem, Trojan uses a copy-on-write scheme [50]. Under a copy-on-write policy a page is copied only when the child process tries to write into 100 it for the first time. As long as the child makes only read references to the page, the operation is satisfied using the original page. However, if the child attem pts to write into the page, the simulator creates an independent copy. Subsequent references to this page are done on the copied page. Figure 5.4 illustrates how Trojan’ s thread package simulates a process model using copy-on-write. In Fig. 5.4, child A makes read references to page 0, 1, and 2, while child B makes read references to page 0, no reference to page 1, and read-write references to page 2. In this example, no independent copies of the pages exist for child A, but when child B attempts to write into page 2, Trojan copies this page into a new page (represented by page 9). Copy-on-write has several advantages over copying the whole process. First, some pages are read-only, so there is no need to copy them. Second, some pages may never be referenced, so they do not have to be copied. In this way, only those pages that the child process actually writes on have to be copied, resulting in less overhead. 5.6 V irtual M em ory Sim ulation Given that recent commercial MPPs such as KSR and Intel Paragon use virtual memory and NOWs (networks of workstations) are now becoming a credible alter native to MPPs, assessing the impact of virtual memory as represented by TLB misses and page faults in the overall performance is critical. Thus, in order to ac count of these important effects, simulators need to offer some support for virtual memory simulation. For example, without virtual memory, cache simulation is done using the same address generated by the compiler. In many real machines, cache addressing is done using the physical address (unless the cache is virtual). Hence to improve overall simulation accuracy, virtual memory simulation is needed. In this section, we discuss the assumptions we make about the simulated environment in order to support virtual memory and then show how Trojan is able to do this efficiently. 101 Address Space Virtual Address Physical Address Virtual Address Memory CPU TLB Cache Directory Shared Memory (a) Shared memory assumption (b) Addressing in memory hierarchy for virtual memory simulation Figure 5.5: Virtual memory simulation 5.6.1 Virtual Memory Environment Virtual memory can be implemented in several ways: paging, segmentation, and segmentation with paging. For generality and simplicity, in Trojan we use paging as implemented by the Mach operating system [50]. As Fig. 5.5.(a) shows, we assume that part of the address space is reserved for shared memory and only these areas are used for this purpose even when, in the thread model, the whole address space can be shared. The main reason for doing this is to simplify the allocation of the cache coherency directory space. Trojan simulates a NUMA machine, in which physical shared memory is distributed amongst the nodes and a distributed directory is used to maintain cache coherency. If the whole address space were allocated for shared memory, directory handling could be much more complicated and expensive. This is because it is very difficult to know whether the memory reference represents shared memory or private memory. Thus we would have to allocate a large directory in order to cover all possible memory references. Hence huge amount of memory would be needed for the directory itself even though no directory is needed for private variables. 102 5.6.2 Virtual Memory Simulation With virtual memory, the CPU produces virtual addresses that are translated by the TLB into physical addresses to access main memory. The main difficulty in simulat ing virtual memory is how to deal with the physical addresses and the corresponding need for huge amounts of memory. Shared memory simulation without virtual mem ory does not cause any special problems since the address comes from the reserved part of a contiguous area for which there exists a corresponding directory. When simulating virtual memory, however, the address given to the memory module is a physical address, and since the physical address can in principle be any address, the directory has to be large enough to cover the whole physical space. One solution is to limit the mapping of virtual to physical to the contiguous spe cific address range. This method, however, does not reflect real virtual-to-physical mappings. In addition, the directory handling is still expensive since there is no one- to-one correspondence between pages and frames. That is, a directory is allocated for a frame, not for a page, and a frame can be used for several different pages. Tro jan deals with this problem by using different addressing modes at different levels of the memory hierarchy, as shown in Fig. 5.5.(b). First, Trojan addresses the TLB, page tables and the cache in the same way as it is done in real machines. In the memory module, however, the virtual address is used to access a memory location and its corresponding directory. Although this scheme simulates virtual memory in a partial way, it achieves the same effect as the real virtual memory execution, with out requiring additional memory. This is due to the fact that all the performance data for virtual memory can be obtained from the behavior of the TLB, page tables and the cache. Most of large memory requirement for virtual memory has to do with the amount of space needed for the page tables, e.g., a 32-bit address space with a 4K page size needs 1 million page table entries for each process. This memory requirement is a significant limitation in the simulation of virtual memory. Trojan addresses this problem by splitting a page table into two parts: one for shared memory and another for private memory. For shared memory, a page table is statically allocated and shared among the processes/threads. Since part of the address space is allocated for shared memory, this page table can be easily calculated 103 and allocated. For 16M of shared memory with 4K page size, only 4,096 page table entries are needed. For private memory, a page table is allocated as needed using a simple open hash function. With B buckets, we use a hash function h such that for each page p, h(p) is the bucket to which p belongs. The elements on ith list are the page table entries that belong to bucket i and are allocated dynamically at simulation time. Using this scheme, the additional memory requirement for the page table was found to account for less than 5 percent of the total memory requirement. Currently, Trojan assigns shared pages to nodes by round-robin and private pages reside on its corresponding node. On a page fault, Trojan selects a victim page using the Clock algorithm [50]. This round-robin page allocation and Clock page replacement algorithm can be easily replaced with other policies since Trojan is designed with a modular structure. The virtual memory simulation has been used for the simulation of networks of workstations (NOW) working as a parallel machine. 5.7 A dditional Functionality in Trojan So far we have discussed the issues of simulating two different execution models of shared memory programs using a threads package and the simulation of virtual mem ory. Another issue relevant here is the requirement that applications be annotated in order to simulate them correctly. Ideally when simulating a parallel program, the source code should be compiled and optimized in its original form as it would be done on a shared memory multiprocessor. However, some simulators like Proteus require making changes to the source code. This changes, however, can be burdensome on a complex application and can cause accuracy problem. This section discusses the issues involved in annotating applications and in the simulation of private memory. We also discuss other useful features Trojan provides to the user. 5.7.1 Application Annotations Proteus is a good example of a simulator that requires adding annotations to the source code in order to simulate it. It requires the use of special operators when accessing shared memory variables. These new operators are replaced by the preprocessor with specific simulation calls. This modified code is then compiled 104 • Original assembly code add , /, 13 ,' /. 14,'/,ol Id [, /.i2+, /.lo(_A)] , * / , o 3 • After Augmentation add , /.13,'/.14,, /.ol ! augmentation of ‘‘Id [, / , i 2 + ' / i lo(_A)] ,7,o3’’ add */.i2,, /.lo(_A),, /.10 sethi '/,hi(_mem_address_) ,7,11 st 7,10, [7.11 + 7.1o(_mem_address.)] mov 2,7,10 sethi 7,hi (_mem_type_) ,7.11 st 7.10, [7.11 + 7.1o(_mem_type_)] call _mem_issue,0 nop sethi 7.hi(_mem_return_addr_) ,7.10 Id [7.10 + 7.1o(_mem_return_addr_)] ,7.10 Id [7.10] ,7.o3 Figure 5.6: An Example of Code Augmentation into assembly code and augmented for timing purposes on the assembly language. Because each shared reference is replaced with a procedure call, the object code resulting from compiler optimizations is substantially changed from its original form. This can cause a substantial accuracy problem in the instruction timing and cache behavior. To deal with this problem, Trojan augments memory reference instructions with a special simulation call at the assembly-code level. After compiling the original application into the assembly code, the Trojan’s augmentation program inserts calls for memory-related instructions 2. Proteus uses the low level augmentation just for timing, but Trojan uses the augmentation also for memory references as well as for timing. Figure 5.6 shows an example of the augmentation for a memory reference, as it is done on a Sparc architecture platform. In this example, the first instruction (add) does not access the memory, so it is directly executed on the host machine without 2This augmentation causes code size to increase by about 3 to 10 times, but this code size increase does not affect the simulation time since most of the time is spent on the simulator’s modules such as the cache and network. 105 having to call the simulator. Since the second instruction (Id) is a memory load, Trojan inserts extra instructions, which ends with an explicit call to the simulator (memJssue). After receiving control, the simulator does some housekeeping work such as scheduling the next event; eventually, the control will return to the next instruction in the original program (nop). Some simulators do not provide a way to simulate private memory references. That is, a private memory reference is assumed to always hit in the cache. Although the effects of these references are not explicitly visible to the other nodes, they indirectly affect the behavior of other nodes through the cache coherency protocol. For example, a private reference can displace a cache line used by a previous shared memory reference. If private variables are not simulated, there is no way to assess the effects of conflicting cache misses between private and shared memory references. In Trojan, all memory references are augmented and this enables Trojan to simulate private memory references as well as shared memory, resulting in more accurate simulation. Finally, if the user wants, there is an option in Trojan to ignore the simulation of private variables at simulation time. 5.7.2 Extra Functionality Trojan is based on MIT Proteus, exploits its good features and increases its usability, accuracy, and flexibility by solving some of its original limitations. In addition to the functionality described above, Trojan: • Supports relaxed memory consistency and sequential memory consistency. Proteus only supports sequential consistency. • Improves simulation time by using a sleep-wakeup mechanism on cache misses and page faults. • Supports spin and queue-based locks. On synchronization operations such as locks and barriers, Proteus supports only spin locks. Trojan adds flexibility by supporting also queue-based operations like those supported by DASH [18]. • Supports a full map (i.e., DirnN B ) [18] and a LimitLESS cache coherence protocol [38]. Proteus only supports the LimitLESS protocol. 106 Program D escription Input D ata Set Size (lines) Cholesky Sparse matrix cholesky factorization bcsstkl4 1888 FFT ID fast Fourier transformation 2**18 data points 1060 LU LU decomposition 256 x 256 600 Table 5.2: Application Characteristics • Simulates the effect of latency tolerating techniques such as prefetching. 5.8 Perform ance In this section, we discuss the performance of Trojan with and without virtual memory. Table 5.2 shows the programs and input data sets used in the experiments. The programs were taken from the Stanford SPLASH (Cholesky) or SPLASH2 (LU and FFT) suites. All memory references (i.e., private memory references as well as shared memory) were simulated and queue-based operations were used for syn chronization. To calculate network delays, Agarwal’s network model [2] was used, instead of simulating contention at each hop. This is done by computing the arrival time at the target node using the analytical model. Doing this significantly decreases simulation time without impacting accuracy. 5.8.1 Performance W ithout Virtual Memory Trojan is based on MIT Proteus which is two orders of magnitude faster than Tango. Thus the performance of Trojan is basically the same as that of Proteus ex cept for some extra features we added. Our original intent was to compare the per formance of Trojan with that of Proteus. Unfortunately, however, this goal turned out very difficult since Proteus requires applications to be changed as discussed in Section 5.4. Hence we only show Trojan’s performance. Table 5.3 shows the performance results on the various benchmark applications without supporting virtual memory. The table shows execution time (of simulated machine), simulation time, the number of target cycles simulated per second (rate), 107 P ro g ra m P ro cs E xec, tim e (1000 eye) Sim. tim e (sec) R a te (cyc/sec) Slow dow n (host/target) Cholesky 1 58,843 765 76,880 439 Cholesky 4 22,325 860 103,806 321 Cholesky 16 10,673 1,444 118,229 282 Cholesky 64 9,115 1,406 414,781 80 LU 1 172,082 1,682 102,270 326 LU 4 91,680 1,667 219,951 152 LU 16 34,389 1,696 324,415 103 LU 64 17,062 1,911 571,184 58 FFT 1 61,587 825 74,571 451 FFT 4 21,112 844 99,944 335 FFT 16 6,870 954 115,118 331 FFT 64 2,992 1,066 179,552 277 Table 5.3: Performance of Trojan without Virtual Memory and slowdown 3 (number of simulation cycles needed to execute a single simulated cycles of a single process/thread). The number of processors varies from 1 to 64. The execution time is given to allow comparison with the results for virtual memory. Several interesting observations can be made from Table 5.3. First, the slowdown factor varies from one program to another (although the variation tends to be small) from a factor of 50 to 450. This difference is a direct result of the different frequen cies at which the applications interact with the simulator. Although in Trojan most instructions are directly executed and the context switching is fast (10 to 20 cy cles) , the slowdowns in most cases are larger than 200, resulting from several extra overheads, which come from the scheduling mechanism within the simulator, the simulation of network, cache and memory modules, and the gathering of statistics. Another interesting observation is that as the number of processors is increased, the slowdown decreases. This is due to the fact that the same input data is used for all simulations of each benchmark. This means that as the number of processors is increased, the smaller size of data set is allocated to each processor, which increases 3 Slowdown is calculated by the equation slowdown = simulation time / execution time (not in cycles, but in seconds), assuming a 33MHz clock. 108 data locality. Increased data locality results in less need for network and memory simulations. An alternative explanation for the slowdown trend is that synchronization oper ations are implemented and simulated with queue-based (not spin lock) operations as in DASH [18]. That is, when a processor issues a lock that another processor al ready holds, the processor is put to sleep and is waken up when it the lock becomes available. Thus the simulation cost for synchronization is amortized over a large number of processors which gives lower slowdown factors. Finally, the results show that the performance for process-based benchmarks is not much different from that for thread-based. The average slowdowns for Cholesky (process-based), LU (thread-based) and FFT (thread-based) are 280, 160 and 349, respectively. If the Cholesky program would had been simulated using processes in host, the slowdown would have been much worse mainly due to the high overhead of context switching. This shows that a copy-on-write scheme used to emulate process- based applications performs very well. To summarize, first, the average slowdown for the three benchmarks is around 250, which allows Trojan to simulate large data sets. Second, as the number of processors is increased, the slowdown goes down, which is good for simulating larger number of processors. Third, Trojan can simulate the two execution models effi ciently without any additional performance loss. 5.8.2 Performance with Virtual Memory Table 5.4 summarizes the performance of Trojan with virtual memory enabled. The table confirms most of the observations discussed above when virtual memory was ignored. However, the results show certain patterns that affect only virtual memory. First, the average slowdown without virtual memory for the three bench marks was 262. W ith virtual memory the average slowdown is 271. As without virtual memory, this slowdown decreases as the number of processors increases. Second, in most cases, the programs took longer (in execution time) with virtual memory, mainly due to TLB misses. In a couple of cases, however, programs took less time with virtual memory. For example, LU program for 64 processors took 17,062K cycles without virtual memory. With virtual memory, it took 11,782K 109 P ro g ra m P ro cs Exec, tim e (1000 eye) Sim . tim e (sec) R a te (cyc/sec) Slow dow n (host/target) Cholesky 1 59,047 859 68,711 485 Cholesky 4 37,957 1,428 106,269 321 Cholesky 16 21,847 2,618 133,519 259 Cholesky 64 10,633 2,183 311,700 113 LU 1 172,576 2,240 77,021 434 LU 4 89,247 1,869 190,949 175 LU 16 29,169 1,713 272,290 123 LU 64 11,782 1,787 421,813 84 FFT 1 66,933 917 72,923 463 FFT 4 24,328 967 100,537 336 FFT 16 9,592 1,099 139,645 254 FFT 64 3,761 1,216 197,895 211 Table 5.4: Performance of Trojan with Virtual Memory cycles, a 30% reduction. The main reason for this is due to the physical addressing of the cache. With physical addressing, cache behavior can be different from that of virtual addressing. This shows that virtual memory simulation is important when evaluating cache behavior. Additional memory requirement for virtual memory was less than 5% of the to tal memory required. Specifically, without virtual memory, the amount of memory consumed by Cholesky using 64 processors was 44,876K bytes. When virtual mem ory was enabled, the total memory increased to 45,092K bytes. This additional 216K bytes were used for page table and TLB management, plus some other data structures. 5.9 C hapter Summary Trojan is based on MIT Proteus and exploits its advantages: it delivers high accuracy through execution-driven simulation and high speed by running in a single address space and by directly executing each target instruction whenever possible. Trojan increases usability and accuracy by a3ding extra functionality. It supports process- model based benchmarks using a threads package and does not require applications to be modified in order to be simulated. In addition, Trojan simulates all memory 110 references and exhibits good speed by using sleep-wakeup mechanism for cache misses (and page faults). Trojan extends the flexibility of Proteus by allowing users to select various options such as relaxed memory consistency and sequential consistency. Finally, as virtual memory becomes more commonplace in parallel machines, Trojan supports it as an option with an additional 10% time and 5% memory overhead. We have used Trojan to conduct many experiments on cache behavior, network traffic patterns, and application performance. Trojan has also been used for the validation of analytical machine models and to assess the performance of novel ar chitectural organizations such as the use of networks of workstations working as a parallel machine. Specifically, without virtual memory, it would not be possible to simulate networks of workstations (NOW) accurately. There is a room for future improvement within Trojan. Though Trojan provides virtual memory simulation, it does not support multiprogramming (and scheduling). To reflect realistic behavior of applications in NOWs, the simulation of multipro gramming is needed. Future version of Trojan will provide some form of multipro gramming. Ill C hapter 6 C onclusions 6.1 Sum m ary Recent advances in a computer architecture continue to add pressure to program mers and compilers to find new ways of transforming and optimizing programs in order to make efficient use of hardware resources. Achieving this goal on parallel machines, however, is substantially more difficult than it was on scalar or vector uniprocessors. Some reasons for this are: 1) the performance space exhibited by multiprocessors is orders of magnitude larger than that exhibited by uniprocessors and vector machines; 2) the existence of flexible hardware mechanisms for the sup port of parallel programming, like shared-memory, message-passing, relaxed memory models [22], etc, require that programs and compilers make correct decisions about when and how to use these mechanisms; and 3) the interactions between nodes, either in the form of node to node communication or global synchronization, makes it almost impossible to characterize performance solely as a function of the local activity of a node. To be short, it is difficult for compilers to predict the optimal values of per formance parameters at compile time for parallel machines. Furthermore, these optimal values may change as the program executes. We have addressed this ques tion by proposing adaptive execution and applying it to software prefetching (for adaptive program execution) and the granularity of sharing (for adaptive control execution). AE solves the problem by making the program execution adapt in re sponse to changes in machine conditions. The two main ideas behind adaptive program execution which set it apart from static and dynamic schemes are: (1) that 112 the determination at runtime of parameters having an effect in performance should be made based not only on program information, but on machine performance in formation as well; (2) that in order to maintain a high level of optimality, some of these parameters have to be re-evaluated periodically based on measurements taken by a performance monitor. We have argued that the efficiency of certain optimiza tions, specially those dealing with memory latency, network contention, and remote communication can benefit from adaptive schemes. The effectiveness of software prefetching for tolerating latency depends mainly on the ability of programmers and/or compilers to: (1) predict in advance the mag nitude of the runtime remote memory latency, and (2) insert prefetches at a distance that minimizes stall time without causing cache pollution. Scalable heterogeneous multiprocessors, such as networks of workstations (NOWs), present special chal lenges to static software prefetching algorithms because on these systems the net work topology and node configuration are completely determined at compile time. Furthermore, individual nodes are expected to experience radically different remote delays, so a single prefetching scheme cannot be expected to perform well on all of them. This dissertation presents an adaptive scheme for software prefetching that makes it possible for individual nodes to dynamically change, not only the amount of prefetching, but the prefetch distance as well. Doing this makes it possible to tailor the execution to the individual conditions affecting each node. We show how simple performance data collected by hardware monitors makes it possible for programs to monitor and modify their prefetching policies. Simulation results show that on most programs adaptive prefetching is capable of improving performance over static prefetching by 10% to 45%. More important, as machine conditions become less predictable, the benefits of adaptive prefetching over static policies increase. To improve performance on the node controller of parallel machines, another adaptive scheme called adaptive granularity is presented in this dissertation. Pro viding only one or two fixed size granularities to the user may not use resources efficiently, mainly due to false sharing (for a coarse granularity) or less spatial local ity (for a fine granularity). Providing an arbitrarily variable size granularity increases hardware and/or software overheads, and requires the sacrifice of the programma bility of a shared memory model. That is, there is a tradeoff between the support 113 of a variable size granularity and programmability. The main reason for this trade off is that the support of an arbitrarily variable size granularity makes its efficient implementation difficult without some information on the granularity from the user. Adaptive granularity solves the tradeoff problem by supporting a limited number of granularities, but efficient enough to achieve performance gains from bulk trans fer without sacrificing the programmability and requiring any extra hardware for adaptive granularity. The performance parameter (i.e., the granularity in this prob lem) is adjusted at runtime depending on the reference behavior of an application. Simulation results show that AG changes most remote requests into local requests which in turn reduces the amount of network traffic significantly and improves the performance up to 43% over the hardware implementation of DSM (e.g., DASH). Compared with the equivalent architectures with fixed granularity (e.g., Typhoon), AG reduces execution time up to 35%. 6.2 Future Work The goal of adaptive execution is to guarantee performance in the environment of parallel machines. In this section we briefly discuss how our research can be extended to achieve the goal. • Implement adaptive prefetching algorithm in the compiler. In our adaptive prefetching simulation, we have inserted prefetching instruc tions manually into source code to prefetch data and transformed the static prefetching into adaptive prefetching manually. To measure the more detailed performance effect such as register spilling from the program transformation, we need to implement the adaptive prefetching algorithm in the compiler. • Evaluate a combined effect of AG with prefetching. The replication and bulk transfer achieved by our adaptive granularity reduce memory latency, but does not decrease cache miss ratio. To reduce cache miss ratio, prefetching can be used by prefetching data into the processor cache. Thus, it is interesting to see the effect of combinations of the two approaches such as bulk transfer together with prefetching to hide the local miss latency. 114 • Improve performance of adaptive granularity with compiler support. Current implementation of adaptive granularity does not require any support from the compiler except that the data type (scalar or array) is needed when the memory is allocated. W ith the compiler support for adaptive granularity, performance can be improved further. In the current implementation, for example, merge can be started only when the granularity reaches 1 cache line (minimum granularity). With the compiler support, when the local node does not need data any more (e.g., at the end of a loop), the compiler insert code for check out operations so that the local node returns the data to the home node. W ith the check-out scheme, spatial locality and latency can be improved. • Develop a adaptive granularity protocol for release consistency. We assumed in the adaptive granularity that sequential memory consistency is enforced in hardware. To hide the latency for write misses, more relaxed memory consistency is more effective. Our adaptive granularity protocol needs to be extended to support the memory consistency model. • Simulate handler overheads in detail. In the simulation of adaptive granularity, all hardware contentions are sim ulated. However, for the handler overheads, we used fixed cycles such as 14 cycles for remote memory request on the local node. As part of our future research, we hope to quantify the accurate overhead by implementing the han dlers into Trojan simulator. 115 R eference List [1] T. Eicken, D. Culler, S. C. Goldstein and K. E. Schauser. Active Messages: a Mechanism for Integrated Communication and Computation. In Proceedings of the 19th Annual International Symposium on Computer Architecture, 256-266, May 1992. [2] A. Agarwal. Limits on Interconnection Network Performance. IEEE Trans, on Parallel and Distributed Systems, 2(4):398-412, October 1991. [3] D. Kranz, K. Johnson, A. Agarwal, J. Kubiatowicz, and B. Lim. Integrat ing Message-Passing and Shared-Memory: Early Experience In Proceedings of the Fourth ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming, 54-63, May 1993. [4] C. Anderson and A. Karlin. Two Adaptive Hybrid Cache Coherency Proto cols. In Proceedings of the 2nd International Symposium on High-Performance Computer Architecture, February 1996. [5] Lusk, Overbeek, and et al. Portable Programs for Parallel Processors. Holt, Rinehart and Winston, Inc., 1987. [6] J. Chase, F. Amador, E. Lazowska, H. Levy, and L. Littlefield. The Amber System: Parallel Programming on a Network of Multiprocessors. In Proceed ings of the 12th ACM Symposium on Operating Systems Principles, 147-168, December 1989. [7] J. Bennett, J. Carter, and W. Zwaenpoel. Adaptive Software Cache Man agement for Distributed Shared Memory Architectures. In Proceedings of the 17th Annual International Symposium on Computer Architecture, 125-134, May 1992. [8] B. Bershad, D. Lee, T. Romer, and J. Chen. Avoiding Conflict Misses Dynami cally on Large Direct-Mapped Caches. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, 158-170, November, 1994. 116 [9] W. Bolosky and M. Scott. NUMA Policies and Their Relation to Memory Architecture. In Proceedings of the Sixth International Conference on Archi tectural Support for Programming Languages and Operating Systems, 212-221, 1991. [10] D. Barrett and B. Zorn. Using Lifetime Predictors to Improve Memory Allo cation Performance. In Proceedings of SIGPLAN Conference on Programming Language Design and Implementation, 187-196, 1993. [11] S. Chandra, J. Larus, and A. Rogers. Where is Time Spent in Message-Passing and Shared-Memory Programs? In Proceedings of the Sixth International Con ference on Architectural Support for Programming Languages and Operating Systems, 61-73, November 1994. [12] Mark D. Hill, James R. Larus, Steven K. Reinhardt, and David A. Wood. Coop erative Shared Memory: Software and Hardware for Scalable Multiprocessors. ACM SIG PLAN NOTICES, volume 27, number 9, 262-273, September 1992. [13] A. Cox and R. Fowler. Adaptive Cache Coherence for Detecting Migratory Shared Data. In Proceedings of the 20th Annual Int. Symposium on Computer Architecture, 98-108, May 1993. [14] A. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, and W. Zwaenepoel. Software Versus Shared-Memory Implementation: A Case Study. In Proceedings of the 21st Annual Int. Symposium on Computer Architecture, 106-117, May 1994. [15] R. Chandra, A. Gupta, and J. Hennessy. Data Locality and Load Balancing in COOL. In Proceedings of ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming, 249-259, May 1993. [16] CRAY T3d System Architecture Overview. Cray Research, Inc., September 1993. [17] K. Johnson, M. Kaashoek, and D. Wallach. CRL: High-Performance All- Software Distributed Shared Memory. In Proceedings of the 15th ACM Sym posium on Operating Systems Principles, December 1995. [18] D. Lenoski, J. Laudon, K. Gharachorloo, W. Weber, Anoop Gupta, John Hen nessy, Mark Horowitz, and Monica Lam. The Stanford Dash Multiprocessor. IEEE Computer, 25(3): 63-79, March 1992. [19] D. Lenoski, J. Laudon, T. Joe, D. Nakahira, L. Stevens, A. Gupta, and J. Hennessy. The DASH Prototype: Logic Overhead and Performance. IEEE Transactions on Parallel and Distributed Systems, Vol 4, No 1, 41-61, January 1993. 117 [20] E. Hagersten, A. Landin, and S. Haridi. DDM - a Cache-only Memory Archi tecture. IEEE Computer, 25(9):44-54, 1992. [21] C. Dubnicki and T. LeBlanc. Adjustable block size coherent caches. In Proceed ings of the 19th Annual Int. Symposium on Computer Architecture, 170-180, May 1992. [22] M. Dubois, C. Scheurich, and F. Briggs. Synchronization, Coherence, and Event Ordering in Multiprocessors. IEEE Computer, Vol 21, No 2, 9-21, 1988. [23] A. Aho, R. Sethi, and J. Ullman. Compilers Principles, Techniques, and Tools. Addison Wesley, 1988. [24] E. Jul, H. Levy, and N. Hutchinson. Fine-Grained Mobility in the Emerald Sys tem. ACM Transactions on Computer Systems, 6(1): 109— 133, February 1988. [25] B. Boothe. Fast Accurate Simulation of Large Shared Memory Multiprocessors. Technical Report UCB/CSD-93-752, UC Berkeley, June, 1993. [26] J. Kuskin, et al. The Stanford FLASH Multiprocessor. In Proceedings of the 21th Annual In t’ l Symp. on Computer Architecture, 302-311, 1994. [27] S. Graham, P. Kesseler, and M. McKusick. An Execution Profiler for Modular Programs. Software Practice and Experience, 13:671-685, October 1992. [28] A. Gupta, J. Hennessy, K. Gharachorloo, T. Mowry, and W. Weber. Compar ative evaluation of latency reducing and tolerating techniques. In Proceedings of the 18th Annual In t’ l Symp. on Computer Architecture, 43-63, May 1991. [29] A. Gupta and W. Weber. Cache Invalidation Patterns in Shared-Memory Mul tiprocessors. IEEE Transactions on Computers, Vol 41, No. 7, 794-810, July 1992. [30] J. Heinlein, K. Gharachorloo, S. Dresser, and A. Gupta. Integration and of Message Passing and Shared Memory in the Stanford FLASH Multiprocessor. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, 38-50, November, 1994. [31] J. Heinlein, K. Gharachorloo, S. Dresser, and A. Gupta. Integrating Multiple Communication Paradigms in High Performance Multiprocessors. Technical Report CSL-TR-94-604, Stanford University, 1994. [32] J. Hennessy and D. Patterson. Computer Architecture A Quantitative Ap proach. Morgan Kaufmann Publishers, 1996. 118 [33] R. Chandra, K. Gharachorloo, V. Soundararajan, and A. Gupta. Performance Evaluation of Hybrid Hardware and Software Distributed Shared Memory Pro tocols. In Proceedings of the eighth ACM International Conference on Super computing, 247-288, July 1994. [34] K. Li and P. Hudak. Memory Coherence in Shared Virtual Memory System. ACM Transaction on Computer System, Volume 7, Number 4, 321-359, Novem ber 1989. [35] K. Hwang. Advanced Computer Architecture: Parallelism, Scalability, Pro grammability. McGraw-Hill, 1993. [36] Kendal Square Research Corporation. KSR Parallel Programming. KSR1 Doc umentation, February 1992. [37] M. Lam, E. Rothberg, and M. Wolf. The Cache Performance and Optimizations of Blocked Algorithms. In Proceedings of the Fourth In t’ l Conf. on Architectural Support for Programming Languages and Operating Systems, 63-74, April 1991. [38] D. Chaiken, J. Kubiatowicz, and A. Agarwal. LimitLESS Directories: A Scal able Cache Coherence Scheme. In The Jfth In t’ l Conference on Architectural Support for Programming Languages and Operating Systems, 224-234, April 1991. [39] B. Nitzberg and V. Lo. Distributed Shared Memory: A Survey of Issues and Algorithms. IEEE Computer, 52-60, August 1991. [40] P. Keleher, A. Cox, and W. Zwaenepoel. Laze Release Consistency for Software Distributed Shared Memory. In Proceedings of the 19th Annual International Symposium on Computer Architecture, 13-21, May 1992. [41] D. Yeung, J. Kubiatowicz, and A. Agarwal. MGS: A Multigrain Shared Memory System. To appear in Proceedings of the 23rd Annual International Symposium on Computer Architecture, May 1996. [42] B. Bershad, M. Zekauskas, and W. Sawdon. The Midway Distributed Shared Memory System. In Proceedings of the 1993 IEEE CompCon Conference, 528- 537, February 1993. [43] J. Veenstra and R. Fowler. MINT: A Front End for Efficient Simulation for Shared-Memory Multiprocessors. In Proceedings of the Second International Workshop on Modeling, Analysis, and Simulation of Computer and Telecom munication Systems (MASCOTS), 201-207, January 1994. [44] T. Mowry and A. Gupta. Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors Journal of Parallel and Dis tributed Computing, volume 12, 87-106, 1991. 119 [45] T. Mowry, M.a Lam, and A. Gupta. Design and Evaluation of a Compiler Al gorithm for Prefetching. In Proceedings of the Fifth In t’ l Conf. on Architectural Support for Programming Languages and Operating Systems, 1993. [46] T. Mowry. Tolerating Latency Through Software-Controlled Data Prefetching. Ph.D. Thesis, Stanford University, 1994 [47] J. Bennett, J. Carter, and W. Zwaenepoel. Munin: Distributed Shared Memory Based on Type-Specific Memory Coherence. In Proc. of the 2nd A CM Sympo sium on Principles and Practice of Parallel Programming, SIGPLAN Notices, 25:3, 168-176, March 1990. [48] N. Boden et al. Myrinet: A Gigabit-per-Second Local Area Network. IEEE Micro, 29-36, February 1995. [49] T. Anderson, D. Culler, and D. Paterson. A Case for NOW (Networks of Workstations). IEEE Micro, 54-64, February 1995. [50] A. Tanenbaum. Modern Operating Systems. Prentice Hall, 1992. [51] Intel SSD. Intel Paragon Supercomputers. Technical Summary, November 1994. [52] E. Brewer, C. Dellarocas, A. Colbrook, and W. Weihl. PROTEUS: A High Performance Parallel-Architecture Simulator. Technical Report M IT/LCS/TR- 516, MIT, 1991. [53] R. Saavedra, D. Culler, and T. Eicken. Analysis of Multithreaded Architectures for Parallel Computing. In Proceedings of the 2nd Annual A CM Symposium on Parallel Algorithms and Architectures, 169-177, 1990. [54] R. Saavedra, W. Mao, and K. Hwang. Performance and Optimization of Data Prefetching Strategies in Scalable Multiprocessors. Journal of Parallel and Dis tributed Computing, 1994. [55] J. Saltz, R. Mirchandaney, and K. Crowley. Run-Time Parallelization and Scheduling of Loops. IEEE Trans, on Computers, Vol 40, No. 5, 603-612, May 1991. [56] H. Sandhu, B. Gamsa, and S. Zhou. The Shared Regions Approach to Software Cache Coherence on Multiprocessors. In Principles and Practices of Parallel Programming, 229-238, 1993. [57] M. Blumrich, K. Li, R. Alpert, C. Dubnicki, and E. Felten. Virtual Memory Mapped Network Interface for the SHRIMP Multicomputer. In Proceedings of the 21th Annual International Symposium on Computer Architecture, 142-153, April 1994. 120 [58] A. Singhal and A. Goldberg. Architectural Support for Performance Tuning: A Case Study on the SPARCcenter 2000. In Proceedings of the 21th Annual In t’ l Symp. on Computer Architecture, 48-59, 1994. [59] J. Singh, W. Weber, and A. Gupta. SPLASH: Stanford Parallel Applications for Shared Memory. Tech. Report, CSL-TR-91-469, Stanford Univ., April 1991. [60] S. Woo and J. Singh. The SPLASH-2 Programs: Characterization and Method ological Considerations. In Proceedings of the 22nd Annual In t’ l Symp. on Com puter Architecture, 43-63, June 1995. [61] H. Davis, S. Goldschmidt, and J. Hennessy. Multiprocessor Simulation and Trace using Tango. In Proceedings of In t’ l Conf. on Parallel Processing, volume 2, 99-107, 1991. [62] S. Goldschmidt. Simulation of Multiprocessors: Accuracy and Performance. Ph.D. Thesis, Stanford University, June, 1993. [63] S. Reinhardt, J. Larus, and D. Wood. Tempest and Typhoon: User-Level Shared Memory. In Proceedings of the 21th Annual In t’ l Symp. on Computer Architecture, 325-336, 1994. [64] J. Torrellas, M. Lam, and J. Hennessy. False Sharing and Spatial Locality in Multiprocessor Caches. IEEE Transactions on Computers, Vol 43, No. 6, 651-663, June 1994. [65] P. Keleher, A. Cox, S. Dwarkadas and W. Zwaenepoel. TreadMarks: Dis tributed Shared Memory on Standard Workstations and Operating Systems. USENIX, 115-131, January 1994. [66] D. Park and R. Saavedra. Trojan: A High-Performance Simulator for Shared- Memory Architectures. In Proceedings of the 29th Annual Simulation Sympo sium, April 1996. [67] K. Wang. Precise Compile-Time Performance Prediction for Superscalar-Based Computers. In Proceedings of the ACM SIGPLAN Conference on Program Language Design and Implementation, 73-84, 1994. [68] A. Wilson Jr. and R. LaRowe Jr. Hiding Shared Memory Reference Latency on the Galactica Net Distributed Shared Memory Architecture. Journal of Parallel and Distributed Computing, 15(4):351-367, 1992. [69] S. Wilton and N. Jouppi. An Enhanced Access and Cycle Time Model for On-Chip Caches. WRL Resarch Report 93/5, July 1994. 121 [70] M. Wolf and M. Lam. A data locality optimizing algorithm. In Proceedings of the ACM SIGPLAN Conference on Program Language Design and Implemen tation, 30-44, 1991. [71] S. Woo, J. Singh, and J. Hennessy. The Performance Advantages of Integrat ing Block D ata Transfer in Cache-Coherent Multiprocessors. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, 219-229, November 1994. 122
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
A unified approach to the construction of correct concurrent programs
PDF
A flexible framework for replication in distributed systems
PDF
Conflict identification and resolution for software attribute requirements
PDF
A framework for coarse grain parallel execution of functional programs
PDF
An algebraic view of protection and extendibility in abstract data types
PDF
Algorithms for trie compaction
PDF
Automatic code partitioning for distributed-memory multiprocessors (DMMs)
PDF
A divide-and-conquer computer
PDF
An analysis of bit blitting and polygon clipping
PDF
An integrated systems approach for software project management
PDF
Composing heterogeneous software architectures
PDF
A compiler analysis of cache interference and its applications to compiler optimizations.
PDF
A flexible integration framework for software tool interoperability
PDF
Achieving high performance and energy efficiency in superpipelined processors
PDF
Characterization of deadlocks in interconnection networks
PDF
An investigation of the effects of output variability and output bandwidth on user performance in an interactive computer system
PDF
Automatic array partitioning and distributed-array compilation for efficient communication
PDF
Built-in self-test for interconnect faults via boundary scan
PDF
An end -to -end architecture for quality adaptive streaming applications in the Internet
PDF
Application-specific external memory interfacing for FPGA-based reconfigurable architecture
Asset Metadata
Creator
Park, Daeyeon
(author)
Core Title
Adaptive execution: improving performance through the runtime adaptation of performance parameters
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Computer Science,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Saavedra, Rafael (
committee chair
), Horowitz, Ellis (
committee member
), Pinkston, Timothy (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c17-100934
Unique identifier
UC11354473
Identifier
9636738.pdf (filename),usctheses-c17-100934 (legacy record id)
Legacy Identifier
9636738.pdf
Dmrecord
100934
Document Type
Dissertation
Rights
Park, Daeyeon
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA