Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Cost -sensitive cache replacement algorithms
(USC Thesis Other)
Cost -sensitive cache replacement algorithms
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
INFORMATION TO USERS This manuscript has been reproduced from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type of computer printer. The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. ProQuest Information and Learning 300 North Zeeb Road, Ann Arbor, Ml 48106-1346 USA 800-521-0600 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. COST-SENSITIVE CACHE REPLACEMENT ALGORITHMS by Jaeheon Jeong A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) May 2002 Copyright 2002 Jaeheon Jeong Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UMI Number: 3074932 ___ ® UMI UMI Microform 3074932 Copyright 2003 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, Ml 48106-1346 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UNIVERSITY OF SOUTHERN CALIFORNIA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES. CALIFORNIA 90007 This dissertation, written by "X hEH 6c>N T 6 oN^ under the direction of H is Dissertation Committee, and approved by all its members, has been presented to and accepted by The Graduate School, in partial fulfillm ent of re quirements for the degree of DOCTOR OF PHILOSOPHY Dean of Graduate Studies. D a te .. DISSERTATION C O M M ITT E Chairperson Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Dedication To my ancestor and family. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Acknowledgments I can finally relax and remember all the wonderful people who have shared many academic years with me. Foremost, I deeply thank my advisor, Dr. Michel Dubois, for giving me invaluable opportunities and for sharing his passion and vision. Dr. Timothy Pinkston, Dr. Cyrus Sha- habi, Dr. Ahmed Helmy and Dr. Monte Ung have been very generous to serve on my pro gram committee, and I am grateful to them. I have enjoyed the friendship and cooperation of many colleagues, especially dur ing the RPM project. Thanks are due to Luiz Barroso, Jianwei Chen, Adrian Moga, Koray Oner, Fong Pong, Xiaogang Qiu, and Yong Ho Song. Lucille Stivers and Brendan Char were always helpful and made my school life joyful. I also thank Dr. Russell Clapp and Dr. Ashwini Nanda for their support in continu ing on my thesis and extending my industrial experience. Needless to say, this would have been impossible without support from my family. My parents always taught me integrity and diligence and also gave me lots of encourage ment and their belief. I thank my parents-in-law and my siblings, who made my student life much easier and fruitful. Lastly, 1 thank my wife, Jihyun, for her single determination throughout difficult years, for patiently waiting for the moment, and for parenting two children. I could rely on her, and she always recharged me when I was weary and disheartened. We indeed did it “together”. I hope we can enjoy and share our achievement with our family, especially Julia and Seho. iii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table of Contents Dedication....................................................................................................................................... ii Acknowledgments........................................................................................................................ iii List of Figures.............................................................................................................................. vii List o f Tables.................................................................................................................................. ix Abstract............................................................................................................................................ x Chapter 1 INTRODUCTION........................................................................................................1 1.1 Research Contributions........................................................................................5 1.2 Thesis Organization............................................................................................. 6 Chapter 2 BACKGROUND..........................................................................................................8 2.1 Classic Cache Replacement Algorithms............................................................8 2.1.1 Random Algorithm ................................................................................ 9 2.1.2 LRU (Least Recently Used) Algorithm............................................. 9 2.1.3 PLRU (Partial LRU) Algorithm ........................................................10 2.1.4 Stack Algorithm.................................................................................... 10 2.2 Examples o f Non-uniform Miss Costs.............................................................1 1 2.3 Cache Organization in Modem Processors.....................................................12 2.4 Miss Cost Prediction.......................................................................................... 12 2.5 Evaluation Methodology.................................................................................... 13 Chapter 3 COST-SENSITIVE OPTIMAL REPLACEMENT ALGORITHMS................ 15 3.1 Optimal Replacement Algorithms w.r.t. Miss Count..................................... 15 3.1.1 OPT in Uniprocessors.......................................................................... 15 3.1.2 OPT in Multiprocessors....................................................................... 16 3.2 Cost-sensitive OPT (CSOPT) with Multiple Miss Costs...............................17 3.2.1 Basic Implementation of CSOPT........................................................17 3.2.2 Exploiting OPT......................................................................................18 3.2.3 Illustration.............................................................................................. 19 iv Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.2.4 Further Pruning of the Search Tree...................................................21 3.2.5 Implementation of CSOPT.................................................................24 3.2.6 Complexity Analysis............................................................................24 3.3 Evaluation Methodology................................................................................... 26 3.3.1 Trace Generation..................................................................................26 3.3.2 Baseline Architecture and Evaluation Approach........................... 27 3.4 Performance Evaluation.................................................................................... 29 3.4.1 Random Cost M apping...................................................................... 30 3.4.2 First-Touch Cost M apping.................................................................36 3.4.3 Distribution Across Cache S ets........................................................37 3.4.4 Effect of Cache Associativity........................................................... 38 3.5 Towards On-line Cost-sensitive Replacement Algorithms..........................39 3.6 Summary..............................................................................................................41 Chapter 4 ON-LINE COST-SENSITIVE REPLACEMENT ALGORITHM S.................43 4.1 Integrating Locality and C o st.......................................................................... 43 4.2 GreedyDual (G D ).............................................................................................. 44 4.3 Design Issues in LRU-based Cost-sensitive Replacement Algorithms 45 4.4 LRU-based Cost-sensitive Replacement Algorithms................................... 46 4.4.1 Basic Cost-sensitive LRU Algorithm (BCL)..................................47 4.4.2 Dynamic Cost-sensitive LRU Algorithm (DCL)............................49 4.4.3 Adaptive Cost-sensitive LRU Algorithm (ACL)............................51 4.5 Implementation Considerations....................................................................... 52 4.6 Summary..............................................................................................................53 Chapter 5 IMPROVING MEMORY PERFORMANCE OF MULTIPROCESSORS 54 5.1 Static Case with Two Costs...............................................................................54 5.1.1 Evaluation Methodology.................................................................... 54 5.1.2 Random Cost M apping...................................................................... 55 5.1.3 First-Touch Cost Mapping.................................................................65 5.2 Dynamic Case with Multiple Latencies..........................................................67 5.2.1 Miss Cost Prediction in CC-NUMA Multiprocessors...................68 5.2.2 Evaluation Approach and Setup....................................................... 69 5.2.3 Execution Tim es................................................................................. 70 5.2.4 Implementation Considerations........................................................ 73 5.3 Summary........................................................................................................... 74 Chapter 6 IMPROVING MEMORY PERFORMANCE OF ILP PROCESSORS 75 6.1 Targeting Miss Penalty in ILP Processors...................................................75 6.2 Baseline Architecture......................................................................................77 v Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6.3 Evaluation Methodology................................................................................... 77 6.4 Perfect Access Type Prediction....................................................................... 79 6.5 Instruction-based Access Type Prediction......................................................82 6.6 Static Prediction.................................................................................................. 84 6.7 Dynamic Prediction............................................................................................87 6.7.1 Prediction Updates and History Updates........................................... 88 6.7.2 Prediction Accuracy (Infinite H ardware).......................................... 89 6.7.3 Dynamic-MRU ATPs............................................................................ 91 6.7.4 One-level Dynamic-MRU ATPs (Finite Hardware).........................94 6.8 Injecting Finite Cost Ratios...............................................................................96 6.9 Summary...............................................................................................................99 Chapter 7 RELATED W ORK...................................................................................................101 7.1 Targeting Miss Count........................................................................................ 101 7.2 Targeting Miss Cost...........................................................................................103 7.3 Trace Sampling and Cache Evaluation Techniques......................................104 7.4 Prediction Schemes............................................................................................105 Chapter 8 CONCLUSIONS...................................................................................................... 107 Bibliography................................................................................................................................ I ll Appendix A ALGORITHM OF CSOPT..................................................................................117 vi Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List of Figures Figure 2.1. Partial LRU in four-way set-associative cache................................................... 10 Figure 2.2. Examples o f non-uniform miss costs.................................................................. 11 Figure 3.1. Illustration of search tre e ........................................................................................18 Figure 3.2. Example of reservation with a single high-cost block.......................................20 Figure 3.3. Pruning of the search tree......................................................................................22 Figure 3.4. Probability of a new reservation (s = 8)...............................................................26 Figure 3.5. Relative cost savings with random cost mapping.............................................. 31 Figure 3.6. Replacements, RV ratio, RVS ratio and average cost savings per R V S.........33 Figure 3.7. Cache set distribution with random and first-touch cost mapping (r = 8 ) 38 Figure 3.8. Relative cost savings by various cache associativities......................................39 Figure 4.1. Algorithm for Basic Cost-sensitive LRU (BCL)............................................... 48 Figure 4.2. ACL automaton in each set....................................................................................51 Figure 5.1. Relative cost savings by GD, BCL and DCL with random cost mapping......57 Figure 5.2. Relative cost savings by DCL and ACL with random cost mapping (% )......58 Figure 5.3. Reservation Behavior in DCL with random cost mapping............................... 61 Figure 5.4. Relative miss rate increase in DCL with random cost mapping...................... 62 Figure 5.5. Relative cost savings in DCL with different cache associativities..................64 Figure 6.1. Block diagram of the baseline architecture......................................................... 77 Figure 6.2. Load misses in DCL with perfect access type prediction (r = in f).................81 Figure 6.3. Examples o f access type sequence........................................................................83 vii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 6.4. Load miss improvement with static prediction (T = 0.99)............................... 86 Figure 6.5. General structures of dynamic ATPs................................................................... 87 Figure 6.6. Timing o f access type transition...........................................................................89 Figure 6.7. Misprediction rate by dynamic-MRU ATPs with different numbers of sets. 92 Figure 6.8. Load miss improvement with dynamic-MRU ATPs (infinite hardw are) 93 Figure 6.9. Misprediction rate by one-level dynamic ATP with and without EPD........... 95 Figure 6.10. Load miss improvement by one-level dynamic ATP with 8K-entry PH T ... 96 Figure 6.11. Relative miss rate changes by DCL and ACL (16-Kbyte 4-way cache) 98 viii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List of Tables Table 3.1. The characteristics of the benchmarks................................................................ 28 Table 3.2. Relative cost savings with first-touch and random cost mapping (% )...........36 Table 5.1. Relative cost savings with HAF = 0.2 and HAF = 0.6......................................59 Table 5.2. Relative cost savings with first-touch data placement (% )...............................66 Table 5.3. Relative miss rate increase over LRU with first-touch data placement (%) 67 Table 5.4. Latency variation in protocol without replacement hints.................................68 Table 5.5. Baseline system configuration..............................................................................70 Table 5.6. Reduction of execution time by cost-sensitive algorithms over LRU (%).. 71 Table 5.7. Relative miss rate increase over LRU (%).......................................................... 72 Table 6.1. The characteristics of the benchmarks.................................................................78 Table 6.2. Average of load miss improvement rate (%) by D C L......................................82 Table 6.3. Coverage by static prediction...............................................................................85 Table 6.4. Misprediction rate by dynamic-ALL ATPs (infinite hardware)..................... 90 Table 6.5. Misprediction rate by dynamic-MRU ATPs (infinite hardware).....................91 Table 6.6. RV rate and RVS rate by DCL and ACL (r = 2 )................................................ 99 ix Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Abstract Cache replacement algorithms, originally developed in the context of simple uni processor systems, are aimed at reducing the aggregate miss count. In modem systems, however, all cache misses are not created equal, and some are more expensive than others. The cost may be due to latency, penalty, power consumption, bandwidth consumption, or any other ad-hoc property attached to a miss. The goal is to minimize the miss cost and not the miss count. Since average memory access latency and penalty are more directly related to execution time, memory performance can be improved by minimizing these metrics instead of the miss count. In this context, the class o f replacement algorithms to minimize a non-uniform miss cost function is called cost-sensitive replacement algorithms. This work first presents a cost-sensitive optimal replacement algorithm (CSOPT) in the face o f multiple non-uniform miss costs. CSOPT typically has more misses than OPT by trading off several “cheap” misses for one “expensive” miss, but its overall cost is lower. The various cases of non-uniform miss cost functions are explored to analyze when CSOPT is effective in reducing the aggregate miss cost. CSOPT demonstrates significant improvements of the cost function across various applications and cache configurations. Several extensions of LRU to account for non-uniform miss costs are then pro posed based on two key ideas: blockframe reservation and cost depreciation. These LRU extensions have simple implementations, yet they are very effective in various situations. The extended algorithms are applied to two important cases where miss costs must be pre dicted. Using the miss latency as the cost function and a simple latency prediction scheme in the L2 cache o f a multiprocessor with ILP processors, the execution times o f parallel x Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. applications are improved significantly. This improvement is possible because total miss latency is more critical to performance than total miss count. In the case of targeting the penalty difference between load and store misses, the extended algorithms yield marginal but reliable cost savings. Keywords: Cache Replacement Algorithm, Optimal Replacement Algorithm, Cache Memory, Memory Access Latency, Memory Access Penalty, Distributed Shared Memory, NUMA, Performance Evaluation xi Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 1 INTRODUCTION The cache replacement algorithms widely used in modem systems are aimed at reducing the aggregate miss count, and thus it is assumed that miss costs are uniform. However, as the memory hierarchies of modem systems have become more complex, and as other factors besides performance have become critical in recent years, this uniform cost assumption has lost its validity, especially in the context of multiprocessors and ILP (Instruction-Level Parallelism) processors. For instance, the cost of miss mapping to a remote memory is generally higher in terms of latency, bandwidth consumption and power consumption than the cost o f miss mapping to a local memory in a multiprocessor system. Similarly, a non-critical load miss or a store miss is not as taxing on performance as a crit ical load miss in ILP processors. Since average memory access latency and penalty [43] are more directly related to execution time, we can expect better memory performance by minimizing these metrics instead of the miss count, although reducing the miss count is still advantageous. Generally speaking, the cost can be latency, penalty, power consump tion, bandwidth consumption, or any other ad-hoc property attached to misses. The goal of replacement algorithms thus should be to minimize the aggregate target miss cost and not the miss count. In some cases when misses to particular regions of memory are very costly to han dle, the problem can be solved by pinning blocks from these regions in cache, which may cause deadlocks and underutilization o f the cache. Replacement algorithms that account 1 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for non-uniform miss costs can elegantly avoid misses to these regions by charging a rea sonably large cost to these misses. Thus, in many cases, it is desirable to inject the actual cost o f miss into the replacement policy, and reliable algorithms to integrate cost and locality information to make better cache replacement decisions are needed. In general, we call the class of replacement algorithms aimed at reducing the aggregate miss cost in the face o f multiple miss costs cost-sensitive replacement algorithms, which may translate into latency-, penalty-, bandwidth-, or power-sensitive replacement algorithms, depending on the target cost. In this dissertation, we develop cost-sensitive cache replacement algorithms in the context o f general multiple non-uniform miss costs and analyze their performance in vari ous cases of target cost functions. In doing so, we first revisit the problem of designing optimum cache replacement algorithms and present a Cost-Sensitive OPTimal replacement algorithm (CSOPT) to obtain the optimal aggregate miss cost. CSOPT is an extension of OPT, the classic replace ment algorithm minimizing miss count, since CSOPT and OPT are identical under a uni form miss cost function. With multiple non-uniform miss costs, however, CSOPT does not always replace the block selected by OPT if its miss cost is greater than the miss cost of other cached blocks. Rather, CSOPT considers the option o f keeping the block victimized by OPT in cache until the next reference is made to it. We call this option a (block or blockffame) reservation. While pursuing a blockframe reservation, the miss cost may tem porarily increase over OPT because the number of non-reserved blockframes is tempo rarily reduced and the blocks replaced instead of the currently reserved block are accessed again and miss in the cache before the reserved block is accessed. However, when the reserved block is eventually accessed again, the aggregate miss cost may drop as com pared to OPT by saving a large miss cost. CSOPT may explore many sequences of replacements through the reservations of different high-cost blocks until it reaches a point 2 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where it can decide on the cheaper sequence. Fortunately the complexity of this search is low because the search tree can be pruned very quickly. CSOPT is evaluated using a trace-driven simulator and compared to OPT. We explore the various cases of two static miss costs as well as a realistic miss cost function in a CC-NUMA multiprocessor to analyze when CSOPT is effective in reducing the aggre gate miss cost. Although CSOPT (as well as OPT) requires advance knowledge o f future memory accesses and is unrealizable in practical systems, it is useful as a lower bound on the achievable cost of realistic cost-sensitive replacement algorithms. It also gives guide lines for improving existing on-line cache replacement algorithms. Using the guidelines discovered during the design of CSOPT and its experiments with various cases of miss costs, we develop realistic on-line cost-sensitive replacement algorithms by extending the LRU replacement algorithm to include cost in the replace ment decisions. The concept o f blockframe reservation in CSOPT is applied to explore the option o f keeping the block victimized by LRU in cache. In extending LRU by pursuing blockframe reservations for high-cost blocks, however, we face the two following funda mental issues. First, whether a reservation will eventually lead to a reduced miss cost is unknown at the time o f replacement. Moreover, if the reserved block is not accessed for a long period o f time or is never accessed again, the aggregate miss cost may be greatly increased. Since a blockframe for a block with a high-cost miss cannot be reserved for ever, a mechanism must exist to relinquish the reservation after some time. This mecha nism is achieved by depreciating the cost of the reserved block whenever a low-cost block1 is sacrificed in its place. Every potential cost savings opportunity thus needs to be pursued through aggressive reservations of high-cost blocks, but these reservations must 1. Since the cost is always associated with misses, we refer to “blocks with low (high)-cost miss” simply as “low (high)-cost blocks” throughout the rest o f this dissertation. 3 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. terminate as early as possible if they are fruitless to minimize negative cost savings espe cially when the cost differential between blocks is low. Second, in many situations, the future miss costs of cached blocks are not known at the time o f replacement, and, in such cases, they must be predicted. Cost prediction is a procedure that varies on a case-by-case basis and is rather independent of the first issue. In this dissertation, we first focus on the first fundamental issue, i.e., the effective reservation of high-cost blocks, and explore several replacement algorithms to integrate cost with locality. The first algorithm is Greedy-Dual (GD), a well-known existing algo rithm, which is cost-centric in the sense that the cost function dominates the replacement decisions. Three new algorithms, which extend LRU by blockframe reservations and are more locality-centric, are then introduced. The first algorithm, called BCL (Basic Cost- sensitive LRU), uses a crude method to depreciate the cost of reserved blocks. The second algorithm, called DCL (Dynamic Cost-sensitive LRU), can detect actual situations when the reservation of a high-cost block has caused other blocks to miss and thus can depreci ate the cost of a reserved block with better accuracy. The third algorithm, called ACL (Adaptive Cost-sensitive LRU), is an extension of DCL that can turn itself on and off across time and cache sets. We evaluate and compare these four algorithms in the simplest case, the case of two static costs, using a wide range o f cost values. Then we explore the second issue in two special but important cases of multiprocessor systems and ILP processors, where the cost functions are the latencies of misses and the penalties charged by load and store misses, respectively. In the case of multiprocessor systems, the latencies can take many different values, and the cost associated with each miss of a given block varies with time and must be predicted. Similarly, predicting the penalties of memory accesses in ILP pro cessors is equivalent to predicting the next access type (load or store) to all cache blocks in a set. 4 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. We show, in the case of multiprocessor systems, that a very simple miss latency prediction scheme is successful and that, coupled with our simple latency-sensitive replacement algorithms, it can reliably improve execution times by significant amounts. On the other hand, in the case of ILP processors targeting the penalty difference between loads and stores, it achieves very marginal improvements mainly due to the domination of load misses and low access type prediction accuracy, although the cost difference is embarrassingly large. 1.1 Research Contributions This dissertation first proposes the integration of miss costs and access locality into cache replacement algorithms to reduce not the total miss count but the total miss cost, and thereby presents the design, systematic analysis and important applications of cost-sensitive cache replacement algorithms with various theoretical and practical aspects and merits. The main contributions of this research are as follows: • A feasible optimal replacement algorithm to yield minimum aggregate cost is developed by introducing the concept of blockframe reservation, and schemes to reduce branching factor and prune hopeless paths as early as possible in the search of an optimal replacement schedule in a search tree. • On-line cost-sensitive replacement algorithms are developed to integrate locality and cost. From the simple implementations of blockframe reservation and cost depreciation, several incremental features utilizing dynamic feedback are pro posed to further enhance their performance and reliability. • A set of systematic evaluation approaches is proposed to broaden design and evaluation space and to characterize the behavior of the proposed algorithms. In addition, a set of performance metrics is introduced to analyze the behavior of blockframe reservations. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • Performance evaluations are undertaken to uncover when the proposed algo rithms are effective in reducing the miss cost with respect to important parame ters. • A set of practical applications are investigated with the development of miss cost prediction schemes. 1.2 Thesis Organization Chapter 2 provides background material for this research including classic replace ment algorithms and their extensions, examples o f non-uniform miss costs in modem sys tems, and architectural issues related to cache replacement algorithms. Chapter 3 presents the design and evaluation of CSOPT. We first review OPT and its extension dealing with coherence invalidations in multiprocessors followed by the design o f CSOPT with multiple non-uniform miss costs. The evaluation methodology based on trace-driven simulations is discussed, and simulation results from SPLASH-2 traces are presented. Last, we consider the design issues of on-line cost-sensitive replace ment algorithms based on our observations. Chapter 4 presents the design of cost-sensitive replacement algorithms extended from LRU. We first consider the design issues of LRU-based cost-sensitive replacement algorithms and describe four replacement algorithms. Chapter 5 presents the performance evaluation of LRU-based cost-sensitive algo rithms in the context of multiprocessor systems. We first explore these algorithms in the simple case of two static costs using trace-driven simulations. Then we apply our algo rithms to multiprocessor systems with ILP processors to reduce the aggregate miss latency based on a simple latency prediction scheme. The results of execution times using execu tion-driven simulations are then presented. We also evaluate the hardware complexity of the schemes. 6 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 6 explores the application o f our algorithms in the context of uniprocessor systems made o f ILP processors. We apply our algorithms to ILP processors to reduce the aggregate penalty of load and store misses assuming that the penalty of stores is much less than the penalty of loads and we present various schemes to predict the next access type to cache blocks. Chapter 7 presents related work on replacement algorithms and prediction schemes. Chapter 8 presents the conclusions o f this dissertation. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 2 BACKGROUND To sustain the performance growth of modem processors, the hit rate, access time and bandwidth of caches must be improved. The common approach to improve hit rate is to increase cache size and associativity with an increase in access time. To improve the trade-off between hit rate and access time, caches are often organized in hierarchies. The cache at each level is usually much smaller than the cache at the next level or main mem ory. Due to the size difference and for fast access, congruence address mapping is used to access caches, and each congruence class or cache set consists o f a number of blocks rang ing practically from one to eight. When a cache set is full, one o f the previously cached blocks needs to be replaced to bring a new block into the cache set. A cache replacement algorithm selects which block to replace from a cache set. 2.1 Classic Cache Replacement Algorithms With the goal of reducing miss count, classic cache replacement algorithms need to follow the simple principle o f OPTimal (OPT) algorithm [6][31]. OPT yields the mini mum aggregate miss count by replacing a block whose next reference is farthest away among all cached blocks. Effective on-line cache replacement algorithms thus require some form of prediction of future access distance. In fact, references with short distances are relatively easy to predict, but they are mostly redundant to the victim selections since such references exhibit a large degree of access locality. Because of the difficulty and 8 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. complexity of underlying predictions, classic cache replacement algorithms use simple heuristics to capture general memory reference behavior. Here we briefly review the widely used classic replacement algorithms for caches in modem processors. 2.1.1 Random Algorithm Random policy [49] selects a victim randomly. If there are invalid blocks, it first selects one of them. In an 5-way set-associative cache, the probability that Random picks a correct victim each time is I/5. It thus performs better with small associativities. The main advantage of using Random is that it does not rely on the past history of memory refer ences. Consequently, it does not require any additional storage or look-up, and the victim selection is cheap and fast. 2.1.2 LRU (Least Recently Used) Algorithm LRU [49] statically bets that the cached blocks will be accessed in the order of their latest accesses. It maintains fairness in the sense that unused blocks cannot stay unduly long in a cache. It can perform very poorly, however, if a working set is larger than the cache size and the accesses are positioned rather evenly apart [56]. For instance, sup pose blocks A, B and C are accessed in a row many times in the same set of a two-way associative cache. With this scenario, LRU performs even worse than Random. However, LRU is the most accredited algorithm because of its reliable performance across various cache organizations and applications. LRU must maintain the recent access order of every cached block. In an 5-way set- associative cache, this information can be encoded using [iog2 s!] bits. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.1.3 PLRU (Partial LRU) Algorithm Implementing pure LRU in large associative caches can be extremely costly. For example, in an eight-way associative cache, 16 bits are required to encode the access order. PLRU [51] [3 5] approximates LRU by representing the access order in a binary tree using 5-1 bits. Figure 2.1 illustrates the structure of PLRU in a four-way associative cache. The access order among four blocks (L0-L3) is partially maintained in a binary tree using three bits (B0-B2). In this case, B1 and B2 tell the order between LO and L I, and L2 and L3, respectively. BO tells the order of those two groups of blocks. Note that LRU and PLRU are identical in two-way associative caches. L O L 1 L 2 L 3 Figure 2.1. Partial LRU in four-way set-associative cache 2.1.4 Stack Algorithm A replacement algorithm is a stack algorithm [31] if it can be correctly evaluated with the following method having a stack property. Upon every memory reference the order o f memory references is maintained in a stack. Then the position o f a block in the stack is called the stack distance. To evaluate miss count the stack distance of the block upon each reference is first calculated and then compared to the cache associativity being evaluated. If the stack distance of current memory reference is greater than the associativ ity, the miss count is increased by one. A single scan o f trace can efficiently evaluate miss counts for various associativities. LRU and OPT are stack algorithms. 10 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.2 Examples of Non-uniform Miss Costs In distributed shared-memory systems as shown in Figure 2.2[a], the cost o f miss mapping to a remote memory is generally higher in terms of latency, bandwidth consump tion or power consumption than the cost of miss mapping to a local memory. Latency or the Average Memory Access Time (AMAT) is an important performance metric that is often dominant because the speed of processors grows much faster with time than the speed of memory. In systems where cache coherence is maintained in hardware, the unloaded remote-to-local memory latency ratio ranges typically from 2 to 14 depending on the status o f the memory block [ 13] [ 19] [26] [30] [59]. This ratio would be larger in sys tems with large-scale networks connecting a large number of processors, slower networks or software cache coherence. Besides latency, bandwidth is also an important performance issue [8]. A remote miss always consumes interconnect bandwidth whereas a local miss can be satisfied locally most of the time. In this case, we can state that the cost ratio is infi nite. Misses that can be completed locally on chip, or even on board, also consume much less power than misses that must be serviced across an interconnection network. local m em m em remote interconnection network CPU load store sp lit I/D cache store buffer system bus main m em ory [a] M ultiprocessor system [b] ILP processor Figure 2.2. Examples of non-uniform miss costs 11 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Another example shown in Figure 2.2[b] is a system with an ILP processor [50]. It consists o f a processor with on-chip split instruction and data caches, a store buffer in par allel with the data cache, and a main memory connected through a system bus. The role of the store buffer is to eliminate the penalty o f stores. When a store misses in the data cache, the store immediately writes the data into the store buffer, if an entry is available, and the store instruction can retire. Stores incur a penalty only when the store buffer is full. How ever, this penalty can be minimized with a proper retirement policy or a large store buffer [22] [48]. The penalty by load misses is therefore much larger than the penalty by store misses and the miss costs are non-uniform. 2.3 Cache Organization in Modern Processors The selection o f a cache replacement algorithm is closely related to the structure of caches. The associativity of LI caches tends to increase with each processor generation. However, the size of LI cache increases rather slowly to meet ever increasing processor clock speed. A notable exception is PA-RISC family that implements a large LI cache. The structure of lower level caches can be classified into two kinds depending on the physical location o f the tag directory. If the tag array is external [63], the caches can be very large but the associativities are relatively low since the tags must be checked sequen tially due to the chip pin constraints. If the tag array is internal [4][18][35], larger associa tivities are possible since the tags can be checked in parallel but the size of the external cache is limited by the number of on-chip tag entries. The trend is that L2 caches are partly or entirely integrated to have a larger associativity. 2.4 Miss Cost Prediction To determine a victim, a cost-sensitive cache replacement algorithm must know the next miss costs o f all blocks presently in a cache set at the time of replacement. The 12 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. next miss cost is the miss cost of the block assuming that it is now replaced and hence will be missed at the next access to it. The miss cost prediction is heavily dependent on target cost and architecture. In the case o f penalty in ILP processors, predicting the penalties of memory accesses is equivalent to predicting the next access type (load or store) to all cache blocks in a set. In the case of non-uniform latency in multiprocessor systems, the prediction of latency involves many subsequent predictions such as the block status in memory and coherence activities. Similar to other prediction schemes, the prediction schemes can be static, dynamic or hybrid. Static prediction schemes are usually based on profiling or compiler analysis, and their hardware implementation is relatively cheap. However, they are less applicable in real systems since the behavior of applications are often dynamic and data-dependent especially in integer programs. Also, the information at compile time is very limited, espe cially when the prediction must be made per cache block. Dynamic prediction schemes attempt to capture run-time memory reference behavior using past history, which can be recorded in hardware tables. 2.5 Evaluation Methodology Trace-driven simulation is our primary method to evaluate cache replacement algorithms. It has several advantages as compared to execution-driven simulation. First, given that traces remain unchanged once they are collected, future references are known in advance. Thus using the traces is the only practical way to evaluate off-line optimal replacement algorithms. Second, trace-driven simulations are considerably faster than execution-driven simulations. Consequently, they readily enable us to perform a variety o f experiments within a reasonable time and thus help characterize and understand the behavior of cost-sensitive replacement algorithms. Third, the results of trace-driven simu- 13 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. lations are concise and static so that we can easily verify design ideas by avoiding the dynamic variations of complex components in modem systems. With this evaluation approach, the primary performance metric is the aggregate miss cost, calculated as the product of the total miss count and the average static miss cost. Although trace-driven simulations satisfy most o f our design objectives and needs, we additionally use execution-driven simulations to understand how cost-sensitive replacement algorithms affect the memory performance o f modem systems. The execu tion-driven simulators widely used for the study of ILP processors and memory systems are SimpleScalar [7], RSIM [41][42], and SimOS [16], in the order of their complexity and execution time. SimpleScalar is used mostly to study processor architecture. It models a uniprocessor system made o f a simplified memory system and an ILP processor. RSIM is used to study CC-NUMA multiprocessors. It models ILP processors, memory hierar chies and an interconnection network in detail. SimOS provides the simulation of com plete systems including operating systems. Because of their complexity and slow simulation speed, the simulated architecture and applications are often scaled down. These simulators are also used to generate traces. 14 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 3 COST-SENSITIVE OPTIMAL REPLACEMENT ALGORITHMS 3.1 Optimal Replacement Algorithms w.r.t. Miss Count In this section we first briefly describe an OPT replacement algorithm similar to the one by Mattson et al. [31]. Then we extend it to handle cache invalidations in multi processors. 3.1.1 OPT in Uniprocessors The goal of OPT is to minimize the number of blocks brought into the cache in uniprocessor systems. Since OPT requires the knowledge of future block references, it can be applied off-line to a trace of block addresses X - x h x2, •••, xL, with length \X\ = L. The victim block selected by OPT at the time of a miss in the cache is the block whose next reference is farthest away in the future among all blocks presently in cache. OPT can be implemented with a priority list P. The priority list P, at time t con tains the identity of the blocks currently in cache ordered by forward distance right before the reference x, is performed. The forward distance w,(a) to a block a in P, at time t is defined as the position1 1 ’ in the trace x,+ 1, — , x,-, where x,- is the first reference to block a 1. In [31 ], the forward distance w,(a) to a block a at time / is defined as the number o f distinct blocks in xn l, ■ x,-, where x, is the first reference to block a after time t. 15 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. after time t. If a block a is never referenced again after time t, the forward distance w,(a) is set to 1+1. Initially P x contains null blocks whose forward distances are set to L+l. P is updated before each reference and, when a replacement is required, the victim block is the one whose forward distance is the largest in P, at the bottom of P. In the case of a tie, any block at the bottom of P is chosen among the candidates. Many replacement sequences can yield the same optimal miss count, and OPT explores one of them. The intuition behind OPT is simple. When we decide not to replace a block in the cache at the time of a miss, we hope to be able to keep it in cache until its next access. (If we cannot keep it, then we should have replaced it since it occupies cache space use lessly.) The chances of success are higher if the next reference to the block is closer in the trace. Since each update of the priority list P requires the forward distances, a naive implementation o f OPT would take a significant amount of forward scanning of the trace X. To speed up OPT, a single backward pass of the trace X from xL to x { first generates a distance trace D = d], d2, dL , where d, = t' if x, - x,' and xt- is the first reference to x, after time t. In the case x, is never referenced again after time t,d, = L+1. After the generation of the distance trace D, a single forward pass o f the trace X simulates OPT. A formal proof that OPT is an optimal replacement algorithm can be found in [31]. OPT takes 0(L) time. 3.1.2 OPT in Multiprocessors OPT was conceived in the context o f uniprocessors, but it can be trivially extended to multiprocessors by including the effects o f invalidations. A block invalidated before the next reference to it must become a prime candidate for replacement. First, we keep the trace of a single processor [10]. Second, we augment the trace with all the writes from remote processors. Third we modify the distance trace and the updates o f the priority list as follows. Consider a trace X = x,, x 2 , — , xL, and a block a in P, 16 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. at time t whose forward distance is t' in the distance trace D, where t' < L and x,- is a write by a remote processor. Then the forward distance w,(a) at time t is set to L+l, and P, is reordered according to the updated forward distances such that block a is an immediate candidate for replacement at time t. With this simple modification, the evaluation proce dure of OPT remains the same as for uniprocessors, requires two-pass scanning of the trace X, and also takes 0(L) time. 3.2 Cost-sensitive OPT (CSOPT) with Multiple Miss Costs For a given trace of memory references resulting from an execution, X = x u x2, — , xL, let c(x,) be the cost incurred by the memory reference with block address x, at time t. Note that c(x,) and c(xt ) may be different even if x, = x,- because memory management schemes, such as dynamic page migration, can dynamically alter the memory mapping of a block at different times, or because the cost o f a static memory mapping may vary with time. With no loss in generality, if x, hits in the cache, c(x,) = 0. Otherwise, c(x,) is the cost for the miss, which can be any non-negative integer. Then, the problem is to find a cache replacement algorithm such that the aggregate cost o f the trace, C (X )=^= 1 c(x,), is mini mized. 3.2.1 Basic Implementation of CSOPT The basic implementation o f CSOPT expands all possible replacement sequences in a search tree, as shown in Figure 3.1, and picks the sequence with the least cost at the end. The procedure does not require advance knowledge of future memory accesses, but it involves a huge search tree in which the nodes of the tree are cache states. Each state is assigned the cost associated with reaching the state from the start of the trace. We add one level of depth in the search tree on every reference in the trace and, for every cache state expanded at the current level o f the tree that does not contain the current reference, there 17 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. are 5 possible blocks to replace, where s is the cache size. Thus the branching factor in the search tree is the cache size or set size in blocks. As we expand the search tree by one level at every reference, we can prune nodes of the tree which have the same cache state; we simply keep the node with the lowest cost among all the nodes with the same cache state in the current level o f the tree. Needless to say this procedure is extremely complex and unfeasible for any practi cal trace. Fortunately, there are many ways to drastically prune the huge search tree. In the following we do not provide a formal proof, but we explain intuitively how and why the complexity of the search can be drastically reduced. Search Tree Trace v. j V s Figure 3.1. Illustration of search tree 3.2.2 Exploiting OPT The rationale behind OPT can be applied to cut down on the branching factor and frequency in the search tree. For example, if the future miss costs of all the blocks in the set are identical, then the victim can be selected solely by OPT. For a given traceX = x u x2, — , xL and a cache with associativity s, consider a prior ity list P, = p,{\), •••,p,(s) ordered by forward distances right before block x, is accessed at time t and misses. Also let f( i) be the miss cost to bring p,(i) into the cache at its very next reference, if p,(i) is replaced at time t. In the case that w(p,(i)) = L+1, f( i) - 0. Then, we can safely invoke OPT at time t to select p,{s) as the victim, iff{ s ) < f( i) for every i, where 18 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. i < s. The reason for selecting p,(s) as the victim with the above condition is straightfor ward; no other replacement decision at time t can yield less miss count than OPT does, and OPT keeps the blocks whose miss costs are equal to or greater than f(s). On the other hand, OPT cannot decide on a victim, iff{i) < f ( s ) for any i, where i < s. In this situation, OPT would select p,(s) as the victim. However, it may pay off to keep p,(s) longer in the cache until its next reference by replacing one of the cheaper blocks. This is the only situation where the search must branch with two replacement options: one pursuing OPT and replacing p,(s), and one reserving a cache block for p,(s) until the next reference to it. This second option bets that it is worthwhile to save the high-cost miss on p,(s) and is called a (block or blockframe) reservation. In this reservation option, the search may further branch with several replacement options by considering p,{i) from i - s - 1 to 1, and replacingp,(i) instead o fp,(s) iff(i) <f{k ) for every k, where i< k < s , because it is unnec essary to replace p,(i) iff(i) is equal to or greater than any others with larger forward dis tances and because we cannot determine which replacement decision leads to the least cost at time t. While pursuing a reservation for a block, more reservations of high cost blocks may become possible options as the search tree expands. We must prune these options until their relative cost can be established with certainty. In general, for a cache with associativity s and a trace with n different costs, the maximum branching factor including OPT in a search tree is n but cannot exceed s. 3.2.3 Illustration To illustrate CSOPT in a simple case, consider a cache of size s = 3 and a trace X = *i, x2, — , * 14, as shown in Figure 3.2. The block addresses are a, b, c, and d. For simplicity, we consider the case of two static miss costs, and assume that c(d) = r ,r > 1, and the miss cost o f other blocks is 1. 19 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The first two rows show the reference string from the trace. Then we show the con tents o f the cache for OPT as well as the cost of each reference. A block is underlined if it is never accessed again. The contents of the cache are ordered in the priority list P, by the forward distances right before time t when x, is performed. For example, PA = {a, b, d}, and w4(a) = 5, w4(b) - 6 and w4(d) = 11. At t = 4, a block has to be replaced to bring block c in the cache and we have the situation that the bottom of P4 is a high-cost block and oth ers are low-cost blocks. OPT victimizes block d. However, CSOPT cannot determine which block to replace without projecting the cost of keeping d in the cache until the next reference to block d and comparing the cost of CSOPT to the cost of OPT. time (t) 1 2 3 4 5 6 7 8 9 10 1 1 12 13 14 15 xi a b d c a b c a b c d a b d - Cache M OPT - a a a a b c a b c a a b d a - - b b b c a b c a b b d a b - - - c a b c a b c d a b d c(x,) 1 1 r 1 1 0 0 0 0 0 0 r 0 0 0 - Cache for RV a c c b b a a a d d a c a b c a b c d c a b d d d d d d d c a b i c(x,) 0 1 0 1 0 1 0 0 1 0 - Figure 3.2. Example of reservation with a single high-cost block Therefore at t = 4, CSOPT starts a reservation (RV) for block d by replacing block b. In the reservation, block d is kept in cache until 1=11 where it is first accessed after t - 4. In fact, RV keeps block d at the bottom of the priority list and applies OPT from t = 4 to t - 11 on the remaining two blockframes because these blocks are all low-cost. RV releases the hold on block d a . i t - 11 and returns the management of the cache to pure OPT. 20 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. At t =14 the two cache states are identical and thus we compare the costs of OPT and o f RV. At this point, C0PJ(x5, — , x 1 4 ) = r and CR V (x5, — , x 1 4 ) = 4. If r > 4, RV yields lower cost and the decision is made to replace block b at t = 4. If r = 4, both options lead to the optimal cost. Otherwise, OPT yields lower cost and block d is replaced. An important observation is that RV has more misses after t = 11 than OPT since the contents of the caches in the two options are different at t = 11. Thus, we cannot safely determine at / = 1 1 which replacement option started at t = 4 leads to a lower cost. 3.2.4 Further Pruning of the Search Tree In the above, we have discussed the simple situation with a single reservation. In practice, multiple reservations may be in progress at any one time. In the worst case, the termination decision of a reservation option must wait until the end of the trace, and the exploration of multiple concurrent reservations can cause the explosion of the search tree. One limit to this explosion is the cache set size since there is only one possible victim if s- 1 high-cost blocks are reserved in the cache, where s is the set size. Nevertheless, our experience indicates that the complexity of running CSOPT in its current form on real traces is still much too high to be feasible. Further pruning of the search tree is required. We need to find a way to prune hopeless nodes in the search tree before nodes reach the same cache state. Among the active nodes of the search tree at time t shown in Figure 3.3, consider any two active nodes k and m in which two different cache states have been reached through different sequences of replacement options. Let ck be the cost to reach node k and cm the cost to reach node m with the trace from x, to x,. From time t the search generates two independent search sub-trees Sk and Sm whose roots are k and m and whose initial costs are ck and cm respectively, as the rest of the trace from x ^ to x L is applied to both nodes. 21 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. O active node Figure 3.3. Pruning of the search tree We define the distance between nodes k and m denoted dk_ > m as the minimum cost to move from node k to node m in order for k to reach the same cache state as m. Let a*_,m be the set of all valid blocks in node m but not in node k. Then, dk^ m is the miss cost of every block in a*_>m for the cache o f node k. Theorem 1. Consider nodes k and m at the same time t in the search tree and assume that ck+dk_ ^m < cm . Then node m can be pruned from the search tree at time t. P ro o f To prove this theorem, we start with the premise that the theorem is false and then reach a contradiction. Let us assume that we cannot prune node m at time t because a path Pm with optimum cost exists in Sm . If this is true then we will show that a path Pk o f lower or equal cost in S '* exists, which contradicts the premise. We now show how to systematically build Pk based on the knowledge o f Pm. (Refer to Figure 3.3.) Consider a node m in Pm and a node k' in Pk at time t', where t< t' < L. As we apply the next reference xf + 1 to both m' and k' at time t'+1, there are four pos sible cases. Case (1) x, + 1 hits in m ’ but not in k'. This means that x,-+ 1 must be present in o c* As a result, x,-+ 1 must be brought into the cache of k' and removed from One o f the 22 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. blocks in otm must be victimized from k'. Otherwise one of the blocks in the caches of both m' and k ’ must be replaced from the cache of k' and this may incur more cost to Pk. The added cost for Pk is c(xt + l). There is no added cost for Pm . Case (2) hits in k' but not in m . xr+1 must be brought into the cache of m . If a block is selected for replacement among the blocks in in m ' then the victim block must be removed from If one of the blocks in the caches of both m and k' is replaced from the cache of m then this may incur more cost to Pm . There is no added cost for Pk. The added cost for Pm is c(x,-+\). Case (3) x,-+ l misses in both m and k '. xr+i must be brought into the caches o f both m and k'. If a block is selected for replacement among the blocks in C L k^ m- in m ’ then the victim block must be removed from a k-^m ■ , and one of the blocks in a m must be victimized in the cache of k'. Otherwise the same block must be selected for replacement in the caches of both m' and k'. The added cost for both Pm and Pk is c(x,-+ 1). Case (4) xr+i hits in both m ' and k '. Nothing need be done. No cost accrues. Observe that, in all four cases, we never add any block to ar In fact we remove a block every time we encounter case (1) and sometimes in cases (2) and (3). Thus after some references either we reach the end of the trace or becomes empty. If we reach the end of the trace the cost differential between the last nodes of Pk and Pm cannot be more than dk_ > m since we add some cost to Pkbut dk^ m is also reduced by the same cost in case (1). If a * b e c o m e s empty and thus dk_ > m = 0, then the cache states in the nodes of Pm and Pk become identical. However, since ck+dk_ + m < cm , the total cost of the node of Pk cannot be more than the total cost of the node of Pm at the time when the two states become identical. This demonstrates that there exists a path Pk with equal or lower cost than the opti mum path Pm assumed in Sm . Thus either the premise of the proof is false or it is true but another optimum path exists and we do not have to keep node m to find it. ■ 23 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Looking back at the example in Figure 3.2, we can prune the paths taken by OPT or by RV depending on the value of r by applying the result of Theorem 1. The path taken by RV is pruned at t - 8 if r < 2 or at t = 10 if r < 3, since C0P 1(x5, •••, x8 ) + c(d) < CRy(x5, - , x8 ) if r < 2, and C0P 1(x5, - , x1 0 ) + c(d) < CR ] i x 5 , •••, x 1 0 ) if r < 3. Also the path taken by OPT is pruned at t =11 if r > 4, since CR V (x5, x n) + c(b) < C0P1{x5, x n). We believe that more pruning of the search tree is possible besides what is allowed by Theorem 1. However, by applying Theorem 1, we have been able to derive all the results in this work in a reasonable amount of time, and thus we have not explored more pruning opportunities. 3.2.5 Implementation of CSOPT A straightforward implementation of CSOPT is to expand the search tree and search the path which yields the optimal cost. Instead of deciding a correct path at each branch, CSOPT carries all the possible paths until they can be securely pruned or until the end of the trace is reached. In practice, it is important to prune the search tree as early as possible to avoid the explosion of the search tree. Thus the pruning is invoked whenever any node has changed its cost or cache state. If every node hits on a reference, no pruning is possible even if the next reference is an invalidation and hence changes some cache states. Each node main tains its hit/miss status on each reference, and pair-wise pruning must be attempted between the set o f nodes that hit and the set of nodes that miss, as well as between the nodes that miss. The details of the algorithm for CSOPT can be found in the Appendix A. 3.2.6 Complexity Analysis From the worst case analysis, CSOPT takes a polynomial time. For a cache with associativity s and a given trace X = x u x2 , x L with length \X\ = L, there exist at most t 24 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. different blocks available for reservations at time t, and there are s- 1 blockframes for res ervations of those t blocks. Suppose the pruning has been attempted for every pair of active nodes at every reference but could not prune any node until t-l, then the number of active nodes in a search tree at time t is £ Q . Since Q s a n d the associativity s is very / = 1 small as compared to the length of practical traces, the number o f active nodes at time t is 0 (/s~ '). Then the pair-wise pruning for every active node takes 0 (t2s~2) at time t and is performed at every reference to yield 0 ( / r s~2) time from the beginning of the trace. Thus CSOPT takes 0 ( l 2s~ 1) time. Contrary to the above worst case analysis, CSOPT takes much less time for realis tic traces since the frequency o f reservations is bounded by a relatively small number of replacements and by exploiting OPT, and since the reservation options can be pruned even before the reserved blocks are accessed. In general, the execution time of CSOPT depends on many factors such as cache size, miss rate, associativity, the fraction of high-cost blocks with respect to low-cost blocks, and the access pattern of high-cost and low-cost blocks, which affects the duration and overlap of reservations. Another way to estimate the complexity of CSOPT is to associate a random vari able with the miss costs of blocks in a trace. For simplicity, we assume that blocks are stat ically assigned one of two miss costs. Let p be the probability that a block in the trace is mapped to a high-cost. We also assume that the probability that a blockframe is occupied by a high-cost block at the time of a replacement is p . Let X, be the probability of invok ing a new reservation when i blockframes are already reserved. Then, by exploiting OPT, X,: = p ( i - p 1 ' 1' ') if / < 5 - 1 . Otherwise, X,. = o . This is because the block to be reserved must have a high-cost while every other unreserved block should not have a high-cost. 25 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 3.4 shows the probability o f a new reservation X, as a function o f p as the number of reserved blockframes i varies from 0 to 6 with 5 = 8. X, is maximized when / = 0. 0.9 0.8 0.7 0.6 0.4 0.3 0.2 P Figure 3.4. Probability of a new reservation (5 = 8) Suppose the average distance between two adjacent references to a same block in the trace is a constant a . Then the number o f high-cost blocks at any time after a is a p . As long as our pruning scheme prunes active nodes in a search tree at the same rate as the invocation o f new reservations, the number of active nodes at any time is at most (1 + X0 ) a p , which can be considered as a constant. Thus CSOPT takes O(L) time on the average. 3.3 Evaluation Methodology 3.3.1 Tracs Generation Since the optimum replacement algorithms are only known after the entire execu tion o f the program its evaluation is based on trace-driven simulations. Traces are gathered from an execution-driven simulation of a multiprocessor system having an ideal memory system with no cache. We pick one trace among the traces of all processors [10]. The trace of one selected slave process is gathered in the parallel section of each benchmark. To correctly 26 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. account for cache invalidations, all writes by all processors must be included in the trace. However, we coalesce all the writes to the same block by other processors between two consecutive references by the selected processor to reduce the size o f the traces. Our traces thus contain all the shared data accesses o f one processor plus the shared data writes from other processors. The traces exclude private data and instruction accesses because the number o f private data accesses are either very small or concentrated on a few cache sets yielding an extremely low miss rate, and because our study focuses on data caches. 3.3.2 Baseline Architecture and Evaluation Approach Our baseline architecture is a CC-NUMA multiprocessor system in which cache coherence is maintained by invalidations [13]. The memory hierarchy is made of one level o f cache to which we apply the optimum replacement algorithms, and a share o f the dis tributed memory. In general there are multiple costs associated with blocks, and the costs are dynamic and vary with time. This makes it very complex to analyze and visualize the effects of various parameters on the behavior of the optimum replacement algorithms. Thus our experiments are based on a system with two static costs. This is the simplest case possible, yet much can be learned from this simple but very important case. The cost o f a block is purely determined by the physical address mapping in main memory and this mapping does not change during the entire simulation. Based on the address of the block in memory, the cost model can be readily simplified such that local misses are assigned a cost of 1 and remote misses are assigned a cost r . Then r, the cost ratio of accessing remote and local memory, is the only parameter related to miss costs. If x, hits in the cache, c(xt) = 0. Otherwise, c(xt) = 1 if x, is mapped to local memory, or c(xt) = r if x, is mapped to remote memory. 27 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Data placement in main memory is very important to this study since it determines the cost mapping of each reference and the fraction of high-cost blocks. In a first set of experiments we assign costs to blocks randomly, based on each block address. This approach encompasses our evaluation space and gives us maximum flexibility since the fraction of high-cost blocks can be modified at will in practical traces. In practical situa tions however, costs are not assigned randomly and so costs are not distributed uniformly among sets and in time. To evaluate this effect we allocate blocks in memory according to the first-touch policy. Both approaches place data per block rather than page to maximize the flexibility of data placement especially with the relatively small data set sizes in our experiments. Benchmark Problem size Number of processors Memory usage (MB) Simulated references by sample processor Bames 64K particles 8 11.3 34,197,827 LU 512 x 5 1 2 ,-b 64 8 2.0 12,619,669 Ocean 258x258, -r 4000 16 15.0 15,618,531 Raytrace car 8 32.0 14,036,215 Water-Sp 2744 molecules 8 1.78 19,179,522 Table 3.1. The characteristics of the benchmarks Five benchmarks from the SPLASH-2 suite [61] are used and their main features are listed in Table 3.1. They were chosen because of the variety o f their behavior. They are compiled for a SPARC V7 architecture with an optimization level 02. We have made a small modification in LU. In the original LU, the master processor initializes the matrix in the parallel section and the first-touch placement forces all data to map to the master pro cessor. Thus we allocate data in LU at the beginning of the parallel section to spread the data around. In Bames and Raytrace, memory accesses are data-dependent and irregular. Thus the remote access fraction varies among processors and we picked the trace of a pro cessor whose remote access fraction is most representative. 28 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In our evaluations, the most important parameters are the cost ratio r, the cache associativity s and the cache size. We vary r from 1 to 32 to cover a wide range of cost ratio. We also consider an infinite cost ratio by setting the low cost to 0 and the high cost to 1. A practical example o f infinite cost ratio is bandwidth consumption in the intercon nection network connecting the processors. The infinite cost ratio experiments also give us the maximum possible savings for all cost ratios above 32. We vary the cache associativi ties from 2 to 8, and 64-byte blocks are used throughout our evaluations. To scale the cache size, we first looked at the miss rates for cache sizes from 2 Kbytes to 512 Kbytes. To avoid an unrealistic situation while at the same time having enough cache replacements, we first investigated cache sizes such that the primary work ing sets started to fit in the cache. Overall, this occurred when the cache was 8 Kbytes in Bames, LU and Water-Sp. We also examined a cache such that the secondary working sets fit in the cache. Overall, the knee is at 64 Kbytes. At the end we selected a 4-way set- associative 16-Kbyte cache with 64-byte blocks. For Ocean and Raytrace, in which the miss rates are inversely proportional to the cache size, the same sizes were used. For the evaluation of CSOPT, OPT is used first to warm up the cache before statis tics were collected so that fair comparisons could be made. 3.4 Performance Evaluation In this section we present the results of CSOPT in the case o f two static costs with random cost mapping and first-touch cost mapping. We mainly compared the performance o f CSOPT with that of the version of OPT accounting for invalidations in multiprocessors. Furthermore, we captured the characteristics of CSOPT with respect to the high-cost access fraction, the cost ratio and cache parameters in SPLASH-2 benchmarks. 2 9 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.4.1 Random Cost Mapping The simulation results of CSOPT with random cost mapping based on the block address are presented in this section. This approach, although not truly realistic, allows a systematic analysis and characterization o f CSOPT. With this approach, we can easily vary the high-cost access fraction (HAF) in a trace. We first assign a random value to each new block address as the trace is scanned from the beginning. During the simulations, all references mapped to a block whose assigned value is greater than a programmable threshold are assigned a high cost. Thus we can systematically vary the HAF with a ran dom cost distribution by simply changing the threshold. We do not consider random cost mapping per reference since the miss costs o f the references to a same block are very static in real systems. 3.4.1.1 Relative Cost Savings With the two static costs and the cost ratio r in shared-memory systems, the rela tive cost savings of CSOPT over OPT is calculated as ( V - M * 0Cc s o p t) + ( f / e m o p , - A f ™ c s o p t) • r where M locR denotes the number of local misses and Afc m R denotes the number of remote misses using a replacement algorithm R. In the case of the infinite cost ratio, the relative cost savings is (M r e m o p t - M rem c s o p t ) / M * opt by setting the cost of local misses to 0. Figure 3.5 shows the relative cost savings gained by CSOPT and relative to OPT in a 16-Kbyte 4-way set-associative cache. We vary the cost ratio r from 2 to an infinite value and HAF from 0 to 1 with a step of 0.1. We add two more fractions at 0.01 and 0.05 to see the detailed behavior between HAF = 0 and HAF = 0.1. 3 0 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 100 100 100 Ocean LU Bs rnes O ) high-cost access fraction high-cost access fraction high-cost access fraction 1 0 0 100 Raytrace Water-Sp o > high-cost access fraction high-cost access fraction Figure 3.5. Relative cost savings with random cost mapping In all benchmarks, the relative cost savings increases with r, as expected. Interest ingly, the relative cost savings does not increase proportionally to r, as r increases from 2 to 32. The relative cost savings quickly increases with small r, but for larger r, the savings tapers off. This is because the absolute amount of cost savings increases linearly with r, but the amount of cost savings does not increase as fast in relative terms, as the aggregate cost of OPT also increases with r. With r infinite, the graphs show the upper-bound o f cost savings. In this case, CSOPT systematically replaces low-cost blocks instead o f high-cost blocks whenever low-cost blocks exist in the cache. With r = 32 and HAF > 0.5, the rela tive cost savings almost saturates and is very close to the upper-bound o f cost savings. In all benchmarks, as HAF varies from 0 to 1, the relative cost savings quickly increases, consistently showing a peak between HAF = 0.1 and HAF = 0.5 depending on r, and then slowly decreases after the peak as HAF reaches 1. Clearly, it is easier to benefit from CSOPT when HAF < 0.5. The reasons for this behavior will be explained in the next section. 31 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Overall the results show that the relative cost savings by CSOPT over OPT is sig nificant and its behavior is very consistent across all benchmarks. We observe a “sweet spot” with respect to HAF and the cost ratio. The results indicate that the opportunities for sizable cost savings coincide with the desirable range of HAF in practical systems. 3.4.1.2 Reservation Behavior Figure 3.6 shows the total number o f replacements, RV (reservation) ratio, RV success ratio and average cost savings per RV success as a function o f HAF and r in Bar nes, LU and Raytrace. Ocean and Water-Sp behave similarly to Raytrace. The results are based on the same parameters as in the previous section. The RV ratio is the fraction of replacements for which reservation options are con sidered in an optimal path in the search tree by CSOPT. The RV success ratio is the frac tion of reservation options for which a reservation option succeeds in keeping the high- cost block in cache. The average cost savings per RV success in each benchmark is calcu lated as (a /° V - A j 0cc s o p t) + ( M r e '”opi - M r e m cso p t ) ■ r RVS ’ where Mlo cR denotes the number of local misses, Mrem R denotes the number of remote misses using a replacement algorithm R, and RVS denotes the total number of RV suc cesses. Thus, the cross product of the number of replacements, the RV ratio, the RV suc cess ratio and the average cost savings per RV success corresponds to the total amount of cost savings by CSOPT over OPT. In all benchmarks, the number of replacements increases as r increases from 2 to 32 because the reserved high-cost blocks occupy the cache longer and the effective cache size is reduced. In the case of an infinite cost ratio, the number of replacements increases significantly since CSOPT blindly replaces low-cost blocks over high-cost blocks. The 32 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Barnes, 16KB, 4way 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 high-cost access fraction / \ ^ LU, 16KB, 4way 4___ 1 .r = Inf 9 — © r= 32 S I S r = 16 £ 3 = 2 ------r - 1 - . S i t d 0.1 0.2 0:3 0:4 0.5 0.6 0.7 0.8 0.9 high-cost access fraction Raytrace, 16KB, 4way 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 high-cost access fraction , 16KB,4way ,16KB,4way rnes, 16KB, 4way O—©r = 32 high-cost access fraction 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.! high-cost access fraction 0:1 0:2 0:3 0:4 0.5 0:6 0.7 0.6 0:9 high-cost access fraction 100 Raytrace, 16KB, 4way Barnes, 16KB, 4way high-cost access fraction high-cost access fraction high-cost access fraction Raytrace, 16KB, 4way Barnes, 16KB, 4way LU, 16KB, 4way k . in 25 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 high-cost access fraction 0.2 0.3 0.4 0.5 0.6 0.7 0.6 Q .S high-cost access fraction 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.6 0.9 high-cost access fraction Figure 3.6. Replacements, RV ratio, RV success ratio and average cost savings per RVS miss rates by OPT are 11.84, 2.84, 5.59, 5.19, and 0.48% for Bames, LU, Ocean, Ray trace, and Water-Sp, respectively. The RV ratio curve forms a bow shape as a function o f HAF with a peak around HAF of 0.6 among benchmarks. The shape of the curves for the RV ratio in general matches the analytical model in Figure 3.4 except for the cases of an infinite cost ratio. In 3 3 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. fact, the RV ratio in this experiment with s = 4 dictates the mix o f three bottom curves in Figure 3.4. Interestingly, the RV ratio is less affected by the cost ratio and is even slightly lower as r increases. This is mainly due to the increase of replacements with r, which off sets the RV ratio even though the absolute number o f reservations increases with r. With r infinite, the RV ratio is almost 100% when 0 < HAF < 0.1, since low-cost blocks are mostly available for replacements in the cache to reserve high-cost blocks. Afterwards, it drops quickly. The RV success ratio increases with r among benchmarks as expected, but the sen sitivity o f the RV success ratio to the cost ratio varies widely among benchmarks. In Ray trace whose memory access pattern is irregular and data dependent, the RV success ratio consistently increases with r, but not proportionally to r. However, to yield the additional RV successes as r increases, CSOPT in Raytrace replaces more low-cost blocks resulting a larger increase o f replacements (otherwise, those additional RV successes could be made with lower cost ratios.) In LU, whose memory access pattern is regular and the high-cost accesses are either clustered or farther apart, the RV success ratio is rather insensitive to the cost ratio. The clustered ones can be easily saved by CSOPT even with small r whereas the ones that are farther apart could not be saved even with very large r. However the RV success ratio in LU is higher than others. The RV success ratio in Bames demon strates the mixed behavior of LU and Raytrace. In Bames, the RV success ratio quickly saturates around r = 8. Afterwards, the additional RV successes with larger cost ratios become much more difficult to harvest as the accesses to the corresponding blocks are far ther apart. With r infinite, the RV success ratio is 100% since the reserved blocks stay in the cache until the next references are made to them. The RV success ratio decreases as HAF increases in all benchmarks. Since less and less low-cost blocks are available for replacements during the reservations of high- cost blocks, reservations are forced to terminate sooner as HAF increases. 34 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In theory the upper-bound of the average cost savings per RV success1 is r- 1 since a low-cost block replaced instead of a high-cost block will be accessed at least once before the access to the reserved high-cost block. Thus the average cost savings per RV success is relatively small with small r. The graphs show that the average cost savings per RV suc cess almost reaches the upper-bound as soon as reservations are active with a very small HAF. Then it decreases steadily as HAF increases. This is because less and less low-cost blocks are available for replacements during the reservations o f high-cost blocks as HAF increases and because the low-cost blocks are usually replaced from the top o f the priority list while several blockframes are reserved for high-cost blocks from the bottom of the pri ority list. Thus the replaced low-cost blocks usually exhibit higher access locality and the average cost savings per RV success is reduced. The decrease is faster with large r since the additional RV successes with larger cost ratios are obtained by replacing more low- cost blocks. To visualize the average cost savings per RV success with r infinite, low-cost blocks are assigned a cost of zero and high-cost blocks are assigned a cost o f 100. With r infinite, the average cost savings per RV success increases with HAF showing a peak and then quickly drops. However, we observe that the average cost savings per RV success or the effectiveness of blind reservations is very low as compared to the cases o f realistic cost ratios even if high-cost blocks are assigned a cost of 100 in this experiment. By blindly replacing low-cost blocks whenever they exist in the cache during reservations, a high- cost block can occupy a blockframe for a long time even until its last reference in a trace unless all the possible blockframes are reserved with high-cost blocks. Thus, the average cost savings per RV success is very low with a very small HAF, and it increases as the 1. The unit is a ratio to the cost of local misses. In absolute terms, the amount of the average cost savings per RV success corresponds to the product of the average cost savings per RV success and the actual cost (e.g., latency) o f the low-cost miss. 35 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. low-cost blocks and the high-cost blocks are evenly mixed. Then it diminishes as less low- cost blocks are available for replacements. Overall, the relative cost savings is affected mainly by the RV rate and the RV suc cess rate, and decent cost savings is obtained where both the RV rate and the RV success rate are moderate. 3.4.2 First-Touch Cost Mapping In the previous section, the cost of blocks was assigned randomly and we were able to extensively investigate the effect of the cost ratio and the high-cost access fraction on the total cost. With the random cost assumption, low-cost and high-cost blocks are homogeneously spread in time and across cache sets. However, in a realistic situation, the assignment o f cost is not random and cost assignments may be highly correlated. Due to the correlation, we expected that the cost savings would not be as impres sive in actual situations. Thus we have explored a simple, practical situation to investigate the effect o f cost correlations among blocks. In this section we modify the policy to assign costs to blocks as follows. We allocate blocks to memory according to the first-touch data placement policy. Remote blocks are assigned a high-cost and local blocks are assigned a low cost. Bench mark HAF first-touch cost mapping random cost mapping r=2 r=4 r=8 r= 16 r=32 r=inf r=2 r=4 r= 8 r= 16 r=32 r=inf Barnes 0.448 9.44 23.40 33.52 39.84 43.44 47.45 9.89 25.10 36.70 44.16 48.51 53.45 LU 0.191 1.36 3.78 6.90 10.12 13.21 22.24 10.40 27.59 45.26 59.77 69.85 85.30 Ocean 0.074 3.73 11.77 21.53 30.74 38.41 51.63 2.63 9.62 20.41 33.50 47.00 76.60 Raytrace 0.296 3.13 9.26 16.22 22.17 26.43 32.13 4.30 12.92 22.54 30.66 36.18 43.10 Water-Sp 0.219 2.72 8.22 13.89 18.19 21.05 25.93 3.71 11.68 21.05 29.26 35.09 45.86 Table 3.2. Relative cost savings with first-touch and random cost mapping (%) Table 3.2 shows the relative cost savings by CSOPT over OPT in our basic cache with the first-touch policy as r varies from 2 to infinite. The HAF by the first-touch policy 36 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. in each benchmark is shown in the first column of Table 3.2. To understand the perfor mance of CSOPT under the first-touch policy, we compare their savings to the savings under random cost mapping at the same HAF (which also corresponds to the vertical lines in Figure 3.5) by setting the corresponding threshold as described in Section 3.4.1. Overall, we observe that the difference in the cost savings achieved under the ran dom cost mapping and the first-touch cost mapping are moderate except for LU. In LU, the cost savings under the first-touch policy is very poor even though HAF falls into the “sweet spot”. In LU, high-cost and low-cost accesses are not spread enough in time with the first-touch policy and thus the RV ratio and the RV success ratio are significantly reduced by the first-touch policy as we will see in the next section. 3.4.3 D istribution Across Cache Sets Figure 3.7 shows the distribution of relative cost savings, RV ratio and RV success ratio by CSOPT with the random cost mapping and the first-touch cost mapping across 64 cache sets in our basic cache when r = 8. Overall, we observe that CSOPT yields decent cost savings across cache sets with both cost-mapping schemes, and the relative cost savings does not vary much among cache sets except for a few cache sets among benchmarks. The RV ratio and the RV suc cess ratio vary relatively wider than the relative cost savings does across cache sets. The RV success ratio is mostly proportional to the RV ratio indicating that more reservations often lead to more successes of reservations and cost savings. However, in some cases (e.g., the set of index 56 in Raytrace with the first-touch policy), the RV ratio and the RV success ratio are counterbalanced by each other to yield the cost savings close to the aver age. Barnes and Raytrace are less sensitive to the cost mapping and show similar behavior under both cost mappings. In LU, the RV ratio and the RV success ratio by the 37 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. first-touch policy are significantly reduced and a similar pattern repeats itself in every 8 sets whereas they are irregular across cache sets with the random cost mapping. 100 90 80 70 60 50 40 30 20 10 Barnes, random LU, random 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 cache set index Raytrace, random 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 0 4 6 12 16 20 24 28 32 3 6 4 0 4 4 4 8 5 2 5660 cache set index cache set index Barnes, first-touch 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 cache set index LU, first-touch RV success ratio (%) o— e RV ratio (%) » — k Relative cost savings (%) iV V V * V ,lV « V ,l¥ '* k Raytrace, first-touch 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 0 4 8 12 16 20 24 26 32 36 40 44 48 52 56 60 cache set index cache set index Figure 3.7. Cache set distribution with random and first-touch cost mapping (r = 8) 3.4.4 Effect of Cache Associativity Figure 3.8 shows the effect of cache associativity under the first-touch policy. We vary the associativity from 2 to 8 in a 16-Kbyte cache and the cost ratio from 2 to infinite. Note that the trace applied to each cache set is different for different associativities since the same cache size is used. Therefore, as the pattern of high-cost and low-cost accesses is altered among cache sets and the miss rates are lowered as s increases, it becomes less straightforward to directly compare the results with different associativities. In general, as demonstrated by the analytical model in Section 3.2.6, there would be more opportunities for reservations as s increases. For applications whose data access behavior is random and wide, the longer trace per set under large 5 gives more opportuni ties for RV successes. By contrast, for applications whose access pattern is regular and 3 8 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. narrow, the RV success ratio decreases with 5 because the large associativity saves most of the high-cost misses that are clustered together without invoking reservations. In Barnes, the relative cost savings steadily increases with s. In other benchmarks, larger associativ ity does not necessarily increase the relative cost savings. The savings with s = 8 is close to or lower than the savings with s = 4. In LU, it is even lower than the savings with s = 2. Overall, the results show that CSOPT yields decent cost savings for the associativities from 2 to 8 across all benchmarks. Ocean Barnes LU O 30 « 2 0 I n f cost ratio cost ratio cost ratio 70 70 Raytrace Water-Sp ■E 50 n 40 ® 20 r a 1 0 I n f cost ratio cost ratio Figure 3.8. Relative cost savings by various cache associativities 3.5 Towards On-line Cost-sensitive Replacement Algorithms In this section, we explore the design issues of on-line cost-sensitive replacement algorithms in practical systems based on our observations with CSOPT. The setup for the on-line algorithms in realistic systems differs from the setup for CSOPT in several important aspects. First, CSOPT has the perfect knowledge of future memory references as well as cost of misses. On-line cost-sensitive algorithms however 39 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. must rely on the future memory references predicted by existing algorithms and predict future miss costs. Second, CSOPT cannot always determine the victim at the time of replacement even with the perfect knowledge of future memory references and miss costs. Thus CSOPT branches out all the potential replacement options that may lead to the opti mal miss cost at each time of replacement whereas on-line cost-sensitive algorithms must decide a victim at each time o f replacement. The concept o f blockffame reservation can be applied to existing replacement algorithms to integrate locality and cost. However, with the setup for on-line algorithms, one fundamental design issue is to decide when to invoke and when to terminate the reser vations since it would be extremely difficult to predict the outcome of reservations at a given time. Needless to say, on-line replacement policies must invoke reservations care fully, as unfruitful reservations may increase the final cost, and must terminate the attempted reservations as early as possible if they are deemed not rewarding, especially when the reserved blocks will not be accessed for a long period of time. It is also desirable to control reservations dynamically rather than statically since the reservation behavior varies with application and cache set. From our observations in CSOPT, several heuristics and dynamic schemes can be considered and applied to the control o f reservations and the prediction o f miss costs. First, it is advantageous to dynamically monitor the HAF and invoke reservations accord ingly since large cost savings is possible when HAF is between 0.1 and 0.5, and since the RV success ratio decreases with HAF. Second, the behavior o f the average cost savings per RV success can be applied to the cost depreciation of reserved blocks to terminate the unfruitful reservations. Since the average cost savings per RV success is roughly propor tional to the cost ratio, the cost of a reserved block can be linearly depreciated when the reserved block is continuously kept in the cache at the time o f replacement. Last, the RV ratio and RV success ratio are either insensitive to the cost ratio or quickly saturate when r 40 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. = 8. The reservation behavior with the cost ratios greater than 8 may not differ much from the reservation behavior with r = 8, although the relative cost savings increases with r. Thus the cost prediction does not necessarily have to be accurate as long as the cost rela tionship is securely established. The prediction of miss cost varies with target cost functions. For instance, since the penalties of memory accesses are very dynamic and affected by many factors, they are very difficult to predict with acceptable accuracy. The prediction of bandwidth consump tion or latency are much easier in real systems. 3.6 Summary In this chapter, we have presented an off-line cache replacement algorithm called CSOPT, which minimizes the aggregate miss cost rather than the aggregate miss count in the context of general multiple non-uniform miss costs. We have found that victim selection leading to an optimal replacement schedule cannot be made at the time o f replacement even with full knowledge of future references. Thus all possible replacement sequences must be pursued until their cost relationship is determined with certainty and this procedure involves a huge search tree. To make it feasi ble, we have introduced the concept of blockffame reservation to trade high-cost misses for low-cost misses, and pruning schemes to simplify the search for an optimal replace ment sequence. CSOPT takes a polynomial time. CSOPT has been extensively evaluated and characterized in a simple case of two static miss costs in the context of CC-NUMA multiprocessors using the traces of SPLASH-2 benchmarks. From the trace-driven simulations with the random cost map ping, we have observed that CSOPT boosts its performance over OPT when the fraction of high-cost accesses is between 0.1 and 0.5, and that CSOPT is more effective in reducing the total cost with cost ratios lower than 16. This property of CSOPT is very desirable, as 41 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. modem multiprocessors are designed to maximize node locality and minimize remote latencies. From the experiments with the first-touch cost mapping, we have observed that the cost savings can be degraded in practical situations if low-cost and high-cost accesses are not evenly mixed in time. However, the results with both cost mapping schemes dem onstrate that CSOPT has better performance than OPT and strongly advocate the use of cost-sensitive replacement algorithms in real systems. 42 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 4 ON-LINE COST-SENSITIVE REPLACEMENT ALGORITHMS 4.1 Integrating Locality and Cost There are many possible approaches to account for non-uniform miss costs in the management of caches. One simple approach is to reserve a part of the cache for high-cost blocks. This idea was explored in [53], where a critical cache is used as a victim cache for critical blocks, i.e., blocks whose access misses are expensive in an ILP processor. The problem with this approach is that the special cache becomes underutilized since the frac tion o f high-cost accesses varies from program to program. In some cases, high-cost blocks may never be removed from the special cache if accesses to them are rare. The out come is dismal if the locality of references to critical blocks is poor. Another approach is to have a unified cache but to refuse to replace a block unless all other blocks have equal or higher cost. This approach may also result in high-cost blocks staying in cache for inordinate amounts of time, possibly forever. Clearly, cost-sensitive replacement algorithms must consider both the access local ity and the miss cost of every cached block, and a mechanism to depreciate the cost of high-cost blocks is needed in order to integrate cost with locality. In this chapter, we explore several on-line cost-sensitive replacement algorithms that integrate cost and locality. The first one is Greedy Dual [65] [9], which is cost-centric 43 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. in the sense that the cost function dominates the replacement decision. Three new algo rithms, which are locality-centric and extended from LRU algorithm by blockframe reser vations adopted in CSOPT and cost depreciation mechanisms, are then introduced. 4.2 GreedyDual (GD) GreedyDual was originally developed in the context of disk paging [65] and, later on, was adapted to web caching [9]. In the web caching version, its goal is to reduce the aggregate latency where size and location of data can vary widely. GD can be easily adapted to processor caches, although it has not been promoted as such. In GD, each block in cache is associated with its miss cost. GD replaces the block with the least cost, regardless of its locality. However, when a block is victimized, the costs of all blocks remaining in the set are reduced by its cost. Whenever a block is accessed, its original cost is restored. Thus the only effect of locality on GD is that high- cost MRU (Most Recently Used) blocks are protected from replacement and that their life time in cache is raised. GD is a cost-centric algorithm and works well when cost differen tials are wide. GD is theoretically optimum in the sense that its cost cannot be more than 5 times the optimum cost, where s is the cache set size [65]. Unfortunately, GD does not work well when the cost differentials between blocks are small, as is the case, for exam ple, for memory latencies in modem multiprocessors. Our goal is to explore replacement algorithms that are more locality-centric, that is, algorithms that give priority to locality over cost. Whereas we may lose some gains, we expect that locality-centric algorithms will exploit a wider range of cost differences, including small cost differences. Since LRU or an approximation of it is widely adopted in modem systems, our algorithms rely on the locality estimate of cached blocks predicted by LRU. In the following we first discuss the design issues of LRU-based cost-sensitive cache replacement algorithms, then we introduce our three LRU-based algorithms. 44 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.3 Design Issues in LRU-based Cost-sensitive Replacement Algorithms To reduce the aggregate miss cost, cost-sensitive replacement algorithms must consider both the access locality and the miss cost of every cached block. The position of a block in the LRU stack [31] represents an estimate o f its relative access locality among all blocks cached in the same set. Cost-sensitive replacement algorithms must therefore replace the LRU block if its next miss cost is no greater than the next miss cost of any other block in the same set. However, if the next miss cost o f the LRU block is greater than the next miss cost of one o f the non-LRU blocks in the same set, we may save some cost by keeping the LRU block until the next reference is made to it while replacing non- LRU blocks with lower miss costs. While a high-cost block is kept in the LRU position, we say that the block or blockframe is reserved. In fact, the reservations are not limited to the LRU block. While the blockframe for the LRU block is reserved in the cache, more reservations for other blocks in other locality positions are possible except for the MRU block, which is not subject to reservation. In theory, the cost savings gained by a reservation can be up to the miss cost of the block in the reserved blockframe. However, if the access locality estimate predicted by LRU is reliable, the overall cost savings could be less or even negative, depending on the locality and the cost relationship between the reserved blocks and the blocks that are vic timized. The blocks that are replaced instead of a reserved block could be accessed and missed in the cache, possibly several times, before the reserved block is finally accessed and hit in the cache. If a reserved block is never accessed again nor invalidated after a long period of time, the blockframe reservation may become counterproductive, and the result ing cost savings may become negative. The accurate prediction o f the cost savings delivered by a block reservation at the time when the reservation decision is made requires tracking many subsequent accesses 45 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. and replacement decisions, and then backtracking. This approach is feasible off-line, as was done in CSOPT, to obtain an optimal replacement schedule. To implement the same approach on-line would entail an extremely large overhead, if it was ever possible. One way to secure some cost savings in real time without relying upon knowledge of the future is to be very conservative on the decisions to start and pursue block reservations. For instance, we may decide to invoke reservations only when the block to reserve in cache has a cost much larger than others and/or when the low-cost block to evict from cache has much lower access locality. These conservative approaches yield very few reservations, and probably no reservation at all when the cost differences between blocks are small. Thus a compromise must be struck between pursuing as many reservations as possible, and avoiding fruitless pursuits of misguided reservations. To summarize, the goal is to devise general replacement algorithms that can reli ably reduce the aggregate miss cost in caches with a wide range of miss costs and various configurations. In the design o f LRU-based cost-sensitive algorithms, the following ques tions must be answered: (i) when to invoke a blockframe reservation, (ii) which low-cost block among multiple low-cost blocks to victimize in the presence of reserved block- frames for high-cost blocks, (iii) when and how to terminate fruitless blockframe reserva tions to avoid negative cost savings, and (iv) how to handle multiple blockframe reservations. 4.4 LRU-based Cost-sensitive Replacement Algorithms The replacement algorithms we propose in this section are locality-centric. The locality property for a block, as predicted by LRU, plays a dominant role in the replace ment decision. Let c(i) be the miss cost of the block that occupies the i-th position from the top o f the LRU stack in a set of size s. Thus, c(l) is the miss cost of the MRU block, and 46 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. c(s) is the miss cost o f the LRU block in an 5-way associative cache. To select a victim block whenever a reservation is active, we consider two different schemes. The first scheme simply selects the block with the least miss cost regardless of its LRU stack position. This is a cost-centric decision, which sacrifices all notion o f locality. Our experience has shown that this scheme leads to poor performance by evicting blocks with high locality even if the cost difference among replacement candidates is marginal. The second scheme, which we have adopted, selects the first block in the LRU stack order whose cost is lower than the cost of the LRU block. Thus there might be lower cost blocks with higher locality to victimize in the set. This scheme seeks more secure cost savings by replacing the block with the least access locality among replacement candi dates but may squander opportunities for higher cost savings. To avoid negative cost savings from fruitless block reservations, we terminate res ervations by depreciating the miss costs of the reserved LRU blocks. Ideally we should depreciate the cost of the reserved LRU block whenever another block in the set is victim ized in its place and is accessed again before the next access to the reserved LRU block. However, detecting this condition exactly is not easy in general. The BCL and DCL algo rithms try to approximate this condition. 4.4.1 Basic Cost-sensitive LRU Algorithm (BCL) In basic cost-sensitive LRU algorithm (BCL), we decrease the cost of a reserved LRU block whenever a block is replaced in its place, regardless of whether the replaced block is referenced again. BCL handles multiple reservations in a similar way. While a primary reservation for the LRU block is in progress, a secondary reservation can be invoked, if c(5) < c(5-l) and there exists a block i< s - 1 whose cost is lower than c(5-l). More reservations are pos sible at following positions in the LRU stack while the primary and the secondary reserva- 47 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. tions are in progress. The maximum number o f blocks that can be reserved is s-1. The MRU block can never satisfy the condition for reservation as there is no block in the set with lower locality. When multiple reservations are active, BCL only depreciates the cost of the reserved LRU block. find_victim() for (/ = 5-1 to 1) // from second-LRU toward MRU if (c[i] < A cost) A c o st < — A cost - c[/]*2 return i return LRU upon_entering_LRU_position() A cost < — c[s] 1 1 assign the cost o f new LRU block Figure 4.1. Algorithm for Basic Cost-sensitive LRU (BCL) Figure 4.1 shows the BCL algorithm in an s-way set-associative cache. Each blockframe is associated with a miss cost c(i) which is loaded at the time of miss. The block with i = 1 refers to the MRU block. As blocks change their position in the LRU stack, their associated miss costs follow them. The blockframe in the LRU position has one extra field, called Acost. Whenever a block takes the LRU position, Acost is loaded with c(s), which is the miss cost of the new LRU block. Later Acost is depreciated upon reservations by the algorithm. To select a victim, BCL searches for the block position i in the LRU stack such that c(i) < Acost and i is closest to the LRU position. If BCL finds one, BCL reserves the LRU block in the cache by replacing the block in the i-th position. Oth erwise, the LRU block is replaced. In order to depreciate the miss cost o f the reserved LRU block, Acost is reduced by twice the amount o f the miss cost of the block being replaced. Using twice the cost instead of once, the cost is safer because it accelerates the depreciation of the high cost. It is a way to hedge against the bet that the high-cost block in 48 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the LRU position will be referenced again [23]. When Acost reaches zero the reserved LRU block becomes the prime replacement candidate for the next replacement. The algorithm in Figure 4.1 is extremely simple. Yet, in all its simplicity, it incor porates the cost depreciation of reserved blocks and the handling of one or multiple simul taneous reservations as dictated by BCL. 4.4.2 Dynamic Cost-sensitive LRU Algorithm (DCL) The beauty o f BCL is its simplicity. However it has some drawbacks. In BCL, the cost of a reserved LRU block is depreciated whenever a valid non-LRU block is victim ized in its place. This approach is based on the pessimistic assumption that the replaced non-LRU blocks will always be accessed before the reserved LRU block so that their misses will accrue to the total cost. However, this assumption can be quite wrong. If the victimized non-LRU block is not referenced again, then the cost of replacing it instead of the reserved block is zero. In this case the replacement algorithm made a correct choice and thus it should not be handicapped by depreciating the cost of the reserved block. By depreciating its cost too rapidly BCL will evict the reserved LRU block earlier than it should. Thus BCL may squander cost savings opportunities by being too conservative. This is especially the case when the cost differential between blocks is relatively small. The dynamic cost-sensitive LRU algorithm (DCL) is slightly more complex, but it overcomes the shortcomings of BCL. In DCL, the cost of the reserved LRU block is depreciated only when the non-LRU blocks victimized in its place are actually accessed before the LRU block. To do this, DCL keeps a record for each replaced non-LRU block in a directory called the Extended Tag Directory (ETD) similar to the shadow directory proposed in [56]. On a hit in ETD, the cost of the reserved LRU block is depreciated. For an 5-way associative cache, we only need to keep ETD records for the 5-1 most recently replaced blocks in each set because accesses to blocks that were replaced before 49 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. these 5-1 most recently replaced blocks would miss in the cache if the replacement was LRU. To show this, consider the s most recently replaced blocks since a reservation started and assume that they are not in cache. This means that they have not been referenced since they were replaced and that s misses have occurred since the oldest block was replaced. If the replacement was pure LRU just before the oldest block was replaced, then it would be impossible for the oldest block to still be in cache currently since 5 misses for different blocks have occurred since then and there are only s blockffames per set. Thus, 5-1 ETD entries are attached to each set in an 5-way set-associative cache. Each ETD entry consists of the tag of the block, its miss cost and a valid bit. Initially, all entries in ETD are invalid. When a non-LRU block is replaced instead o f the LRU block, an ETD entry is allocated, the tag and the miss cost of the replaced block are stored in the entry, and its valid bit is set. To allocate an entry in ETD, LRU replacement policy is used, but invalid entries are allocated first. ETD is checked upon every cache access. The tags in ETD and in the regular cache directory are mutually exclusive. If an access misses in the cache but hits in ETD, then the cost of the reserved LRU block in the cache is reduced accordingly (as in BCL) and the matching ETD entry is invalidated. If an access hits on the LRU block in the cache, then all ETD entries are invalidated. O f course, when an invalidation is received for a block present in the ETD (as may happen in multiprocessors, for example), the ETD entry is invalidated. Although ETD is not excessively costly, there are several ways to further reduce the size of ETD. The miss cost field size can be equal to the logarithm base 2 of the num ber o f different costs. Moreover, we do not need to store the entire tag, just a few bits of the tag. Of course this creates aliasing, but this aliasing only affects performance, not cor rectness. We will explore the performance effects of tag aliasing in ETD in Chapter 5 and will examine the hardware complexity o f DCL more closely in Section 4.5. 50 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.4.3 Adaptive Cost-sensitive LRU Algorithm (ACL) Both BCL and DCL pursue reservations of LRU blocks greedily, whenever a high- cost block is in a low locality position. Although reservations in these algorithms are ter minated rather quickly if they do not bear fruit, the wasted cost of these attempted reserva tions accrue to the final cost of the algorithm. If these aborted reservation trials can be filtered out by focusing on reservations with high chances of cost savings, the perfor mance of the replacement algorithm can be further improved and made more reliable. We have observed that, in some applications, the success of reservation varies greatly with time and also from set to set. Reservations yielding cost savings are often clustered in time, and reservations often go through long streaks o f failure. We also have observed that this pattern of alternating streaks of success and failure varies significantly among cache sets. s u c c e s s *■ f a i l Figure 4.2. ACL automaton in each set The adaptive cost-sensitive LRU algorithm (ACL) derived from DCL implements an adaptive reservation activation scheme exploiting the history of cost savings in each set. To take advantage of the clustering in time o f reservation successes and failures, we associate a counter in each cache set to enable and disable reservations. Figure 4.2 shows the automaton implemented in each set using a two-bit counter. The counter increments or decrements when a reservation succeeds or fails, respectively. When the counter value is 51 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. greater than zero, reservations are enabled. Initially the counter is set to zero, disabling all reservations. To trigger reservations from the disabled state, we use a simple scheme that uti lizes the ETD differently while reservations are disabled. When reservations are disabled, an LRU block enters the ETD upon replacement if another block in the set has a lower cost. ETD is checked upon every cache access, as before. An access hit in the ETD strongly indicates that we might have saved some amount of cost if the block had been reserved in the cache. Thus upon a hit in ETD, all ETD entries are invalidated, and reser vations are enabled by setting the counter value to 2, with the hope that a streak of reserva tion successes has just started. 4.5 Implementation Considerations In this section we evaluate the hardware overhead required by the four cost-sensi tive algorithms over LRU, in terms of hardware complexity and its effect on cycle time. Tag fields and cost fields are needed. There are two types o f cost fields: fixed cost fields which store the fixed cost of the next miss, and computed (depreciated) cost fields which store the cost of a block while it is depreciated. We first consider the hardware storage needed for each cache set. In an 5-way asso ciative cache, BCL requires 5+1 cost fields (one fixed cost for each block in the set and one computed cost for Acost). GD requires 2s cost fields (one fixed cost and one computed cost for each block). DCL requires 2s cost fields (5 fixed costs and 1 computed cost in cache and 5-1 fixed cost in ETD) and 5-1 tag fields, and ACL adds a two-bit counter plus one bit field1 to DCL. All these additional fields can be part of the directory entries which are fetched in the indexing phase o f the cache access with little access overhead. 1. This bit is associated with the LRU blockframe and indicates whether or not the block is currently reserved so that the counter of successful/failed reservations can be updated. 52 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In a four-way associative cache with 25-bit tags, 8-bit cost fields and 64-byte blocks, the added hardware costs over LRU algorithm are around 1.9%, 2.7%, 6.6% and 6.7% for BCL, GD, DCL and ACL, respectively. If the target cost function is static and associated with the address, a simple table lookup can be used to find the miss cost. In this case, the algorithms do not require the fixed cost fields and the added costs are 0.4%, 1.5%, 4.0% and 4.1%, respectively. The hardware requirement of DCL and ACL can be further reduced if we allow aliasing for tags in ETD resulting in more aggressive cost depreciation due to false tag matches in ETD. The algorithms affect the cache cycle time minimally, if any. The only operations on a hit are restoring the miss cost of the MRU block in GD, the setting of Acost for the LRU block in BCL, DCL and ACL, and the lookup o f ETD for DCL and ACL. Given the number o f bits involved, these are trivial operations. The major work is done at miss time when blocks are victimized and the amount o f work is very marginal, compared to the complexity o f operation of a lockup-free cache. 4.6 Summary In this chapter, we have introduced three new on-line cost-sensitive cache replace ment algorithms extended from LRU to reduce the aggregate miss cost rather than the aggregate miss count. The proposed algorithms are locality-centric and integrate locality and cost based on two key ideas: blockframe reservation and cost depreciation. These two ideas are implemented on the basic cost-sensitive algorithm, BCL. To enhance the perfor mance and the reliability of BCL, we have implemented dynamic cost depreciation and adaptive reservation activation schemes leading to DCL and ACL. The major strength o f the algorithms comes from their simplicity and careful design. Their hardware cost is very marginal and their effect on cache access time is negligible. 53 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 5 IMPROVING MEMORY PERFORMANCE OF MUL TIPROCESSORS Now that we have introduced the on-line cost-sensitive replacement algorithms, we evaluate their effectiveness in a simple model with two static costs, and a complex sit uation where costs are multiple, variable, dynamic, and must be predicted. 5.1 Static Case with Two Costs 5.1.1 Evaluation Methodology In general there are multiple costs associated with blocks, and the costs are dynamic and vary with time. This makes it very complex to analyze and visualize the effects o f various parameters on the behavior of a particular algorithm. In this section we present the set of experiments that we have run to lead us to the replacement algorithms in Section 4.4. Much can be learned from these experiments based on a system with two static costs, the simplest case possible. The baseline architecture is a CC-NUMA multiprocessor system in which cache coherence is maintained by invalidations [13]. The memory hierarchy in each processor node is made of a direct-mapped LI cache, an L2 cache to which we apply cost-sensitive replacement algorithms, and a share o f the distributed memory. 54 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The cost o f a block is determined purely by the physical address mapping in main memory and this mapping does not change during the entire execution. Based on the address o f the block in memory, misses to the block are assigned a low cost of 1 or a high cost of r. In a first set of experiments we assign costs to blocks randomly based on each block address. This approach gives us maximum flexibility since the fraction of high-cost blocks can be modified at will. To evaluate our algorithms in more practical situations, we ran a set of experiments in which blocks are allocated in memory according to the first- touch policy, and we assigned a low cost o f 1 to locally mapped blocks and a high cost of r to remotely mapped blocks. The same traces used for the evaluation of CSOPT are also applied to the evalua tion of the proposed on-line cost-sensitive algorithms. Their main features are listed in Table 3.1. We dropped Water-Sp since it gives no additional insight. In our evaluations, the most important parameters are the cost ratio r, the cache associativity s and the cache size. We vary r from 1 to 32 to cover a wide range o f cost ratio. We also consider an infinite cost ratio by setting the low cost to 0 and the high cost to 1. The infinite cost ratio experiments also give us the maximum possible savings for all cost ratios above 32. We vary the cache associativities from 2 to 8, and 64-byte blocks are used throughout our evaluations. The LI cache is 4 Kbytes and direct-mapped, and the L2 cache is 16 Kbytes and 4-way set-associative based on the cache scaling experiment dis cussed in Section 3.3. S. 1.2 Random Cost Mapping In this section, we present the results with random cost mapping based on the block address. This approach allows a systematic performance analysis and characteriza tion of on-line cost-sensitive replacement algorithms. With this approach, we can easily vary the high-cost access fraction (HAF) in a trace by assigning a random value to each 55 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. new block address as the trace is scanned from the beginning. Then during our simula tions, all references mapped to a block whose assigned value is greater than a programma ble threshold are assigned a high cost. We can thus systematically vary the HAF with a random cost distribution by simply changing the threshold. 5.1.2.1 Relative Cost Savings The relative cost savings of a replacement algorithm is calculated as the ratio between the cost savings afforded by the replacement algorithm as compared to LRU and the aggregate cost of LRU. Figure 5.1 and Figure 5.2 show the relative cost savings gained by four cost-sensi tive algorithms over LRU in our basic cache. We vary the cost ratio r from 2 to an infinite value and HAF from 0 to 1 with a step o f 0.1. We add two more fractions at 0.01 and 0.05 to examine the detailed behavior between HAF = 0 and HAF = 0.1. Table 5.1 displays pre cise numbers with two high-cost access fractions at 0.2 and 0.6. In all algorithms and benchmarks, the relative cost savings increases with r, as expected. With r infinite, the graphs show the theoretical upper-bound of cost savings. In this case, the cost-sensitive replacement algorithms systematically replace low-cost blocks instead of high-cost blocks whenever low-cost blocks exist in the cache, since the cost depreciations of reserved blocks have no effect. The relative cost savings does not increase proportionally to r, as r increases from 2 to 32. The relative savings quickly increases with small r, but for larger r, the relative savings tapers off. The cost savings increases linearly with r in absolute terms, but not in relative terms, as the aggregate cost of LRU also increases with r. In BCL, however, we observe that the relative savings are somewhat disappointing especially with small r as compared to the relative savings by DCL. For instance, with 56 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. — 100 5 S , DCL s, BCL es, GD Bam i ?Ivr=16 . r- Q « 60 7_^r=16 7 7 r^a —1 0 0 aq r---- si 9 o j P ' X < » a n . N o.i 0:2 0:3 0:4 0:5 0:6 0:7 0:8 0:9 h ig h - c o s t a c c e s s fra c tio n 0:1 0:2 0.3 0.4 0.5 o:s 0.7 0.8 0.9 h ig h - c o s t a c c e s s fra c tio n 0:1 0:2 0.3 0.4 0:5 0.6 0:7 o;8 0:9 h ig h -c o s t a c c e s s fra c tio n LU, DCL LU, GD LU, BCL Q — 0 ^32 0-® r = 1 6 7 7 r^a Mr=16 v * — a h ig h - c o s t a c c e s s fra c tio n h ig h - c o s t a c c e s s fra c tio n h ig h - c o s t a c c e s s fra c tio n — 100 Ocean, DCL Ocean, GD Ocean, BCL Q^O C = 3 1 0 -^ ? S r= 1 6 ’—' r-a i:i 0:2 0:3 0:4 . . . . . . ... ... h ig h - c o s t a c c e s s fra c tio n ■<n 0:2 0:3 0:4 o.s 0:6 0:7 o:s O .S h ig h - c o s t a c c e s s fra c tio n 1 0:2 0.3 0.4 0.5 0.6 0.7 o.e . h ig h -c o s t a c c e s s fra c tio n — 100 F aytrace, GD R aytrace, DCL Raytrace, BCL 9-0 0— ® a — 17 r= 1b T T r-Q ^ ^ a— y r= 1 b 7 : r-0 0.1 0.2 0:3 0.4 0.5 0.6 o; h ig h - c o s t a c c e s s fra c tio n h ig h - c o s t a c c e s s fra c tio n 0:1 0:2 0:3 o3 o!fe 0.6 07 0:8 . h ig h -c o s t a c c e s s fra c tio n Figure 5.1. Relative cost savings by GD, BCL and DCL with random cost mapping (%) r= 2, the savings is almost negligible in all benchmarks except Bames. This is caused by the imprecise cost depreciation for the reserved blocks in BCL, and this effect stands out when r is small. On the other hand, DCL, which utilizes ETD to depreciate the cost of reserved blocks accurately, significantly outperforms BCL in all benchmarks, especially with small r. 57 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. loo- Barn ss, DCL cn r=lnf r=32 r =16 r=8 4 r= 2 h ig h -c o s t a c c e s s fra c tio n ss, ACL >:i 0:2 0:3 oi4 0:5 0.6 0:7 0.8 0:9 h ig h - c o s t a c c e s s fra c tio n —. 100 LU, ACL LU, DCL —h,= ln f A — A 1 f . r=lnf ®-©£?l T"oT3_o:4 0:5 0:6 0:7 o;e ire • c o s t a c c e s s fra c tio n 2 0.3 0.4 0.5 0.6 0.7 0 h g h - c o s t a c c e s s fra c tio n Ocean, ACL — . 100 Ocean, DCL „___ . r=lnf 0— 0 £32 2 = 2 ^ r r=Inf Q— O £32 0:1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 h ig h -c o s t a c c e s s fra c tio n 0.1 0.2 0.3 0.4 0:5 o.e 0.7 0.8 0 h ig h - c o s t a c c e s s fra c tio n —. 100 R aytrace, ACL Raytrace, DCL f r=lnf I 1 1 5- 0 ^ £I^r=16 ! I r=fl « 60 0:1 0:2 0:3 0 $ o!fe 0:6 0.7 0:8 h ig h - c o s t a c c e s s fra c tio n __________ ... ... 0.7 0.8 . h ig h - c o s t a c c e s s fra c tio n Figure 5.2. Relative cost savings by DCL and ACL with random cost mapping (%) GD outperforms BCL especially with small r, when HAF is lower than 0.5. When HAF is greater than 0.5, BCL outperforms GD across all cost ratios. However, as com pared to the large differences observed with DCL and ACL, the performance of GD looks quite similar to BCL, overall. 5 8 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. High-cost Access Fraction = 0.2 High-cost Access Fraction = 0.6 r = 2 N 1 ! r = 2 1 ! u > GD BCL DCL ACL GD BCL DCL ACL GD BCL DCL ACL GD BCL DCL ACL Barnes 8.82 4.69 24.64 23.75 71.10 72.63 74.18 71.43 6.36 10.09 18.16 16.36 19.94 26.65 26.70 23.30 LU 1.17 0.88 24.02 22.36 64.00 66.37 66.07 62.02 0.72 1.20 9.78 9.09 9.24 14.48 14.49 12.38 Ocean 1.84 0.77 7.32 5.39 30.32 31.00 32.24 23.06 0.70 0.87 0.96 1.55 4.55 6.06 6.11 4.18 Raytrace 1.03 0.84 2.38 3.48 32.87 34.22 36.00 25.85 0.68 0.82 0.43 2.15 6.28 8.49 8.50 4.72 Table 5.1. Relative cost savings with HAF = 0.2 and HAF = 0.6 DCL reliably outperforms GD in every situation with our benchmarks. Although GD can be effective for web caching where the costs vary significantly, our results indi cate that it does not do as well in several situations, especially when r is small. Interestingly, Figure 5.2 shows that the cost savings of ACL is slightly lower than the cost savings of DCL in all situations except for Raytrace with r - 2, which shows very marginal improvements over DCL. This is mainly due to the random cost mapping in this experiment. Under the random cost mapping, the cost savings is less clustered in time, as compared to more realistic cost mappings. Thus, ACL either enables the reservations most o f the time or loses savings opportunities when the streaks of reservation success and fail ure are short-lived. In all algorithms and benchmarks, as HAF varies from 0 to 1, the relative cost sav ings quickly increases, consistently showing a peak between HAF = 0.1 and HAF = 0.3; then it slowly decreases after the peak as HAF reaches 1. Clearly it is easier to benefit from a cost-sensitive replacement algorithm when HAF < 0.5. The reason for this behav ior will be discussed in the next section. Overall the results show that the relative cost savings by DCL over LRU is signifi cant and consistent across all benchmarks. We observe a “sweet spot” for the relative cost savings with respect to HAF and r. 59 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.1.2.2 Reservation Behavior in DCL Figure 5.3 shows the RV (reservation) ratio, the RV success (RVS) ratio, and the average cost savings per RVS in DCL, as a function of HAF and r. The results are based on the same parameters as in the previous section. The RV ratio is the fraction o f replacements from which reservations are invoked. We count each new reservation in a multiple reservation situation but we exclude the res ervations already in progress (we do not count a reservation twice). The RV success ratio is the fraction of reservations for which reserved blocks are eventually accessed in the cache before they are replaced. The average cost savings per RV success is computed as the ratio of the total cost savings to the total number o f RV successes since average_cost_savings = RV_ratio x RVS_ratio x average_cost_savings_per_RVS. The shape of the curves for the RV ratio in DCL can be easily explained using the simple analytical model described in Section 3.2.6. Since DCL invokes a new reservation whenever the LRU block has a high cost and at least one other block in the set has a low cost, the probability of a new reservation in an s-way associative cache is p ■ (l -ps_1), where p is the probability that a cache blockframe contains a high-cost block, p can be closely approximated by HAF. The RV ratio curve forms a bow shape as a function of HAF with a peak of 49% around HAF of 0.5 among benchmarks. The shape can also be explained by the fact that, at both ends o f the curve, one type of block dominates over whelmingly and therefore opportunities for reservations are few in both cases. Opportuni ties are much better when the blocks of both costs are about evenly mixed. The RV ratio is little affected by r. The RV success ratio decreases as HAF increases in all benchmarks. Since less and less low-cost blocks are available for replacements during the reservation of high-cost 60 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Barnes s s s o o I! 5 o 5 5 I < Q » - , > 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.B 0.9 high-cost access fraction high-cost access fraction Barnes Barnes w 80 „ 3Q.»e -o - o e 0.1 0.2 0.3 0.4 0 5 0.6 0.7 0.8 0.9 high-cost access fraction 0.1 0.2 0.3 0:4 0.5 0.6 0.7 0.6 0:9 high-cost access fraction L U 0 — 0^32 Vr=16 d—er=4 *—xr=2 O ' 0.1 0.2 0:3 0.4 0.5 0:6 0.7 o;s 0:9 high-cost access fraction Ocean ■ ' m O — €> r*3 2 A— fr— 4 w —*r=2 o .i 0T2 0.3 0.4 0.5 0.6 0 7 o.'e 0.9 high-cost access fraction ° f f '6 . T o!2 0.3 o! 4 0.5 0.6 0.7 f a 0.9 high-cost access fraction Raytrace ** 50 0.'1 0:2 0.'3 0:4 0.5 0:6 0.7 0:0 0,9 high-cost access fraction Ocean Ocean A r = 8 n 30 ^ 0:1 0:2 0:3 0:4 O S 0.6 0:7 0 6 0 9 high-cost access fraction 0.1 0 2 0 3 0'4 0 5 0 6 0.7 0 8 0 9 high-cost access fraction a— e y = 3 2 7 — y r = 1 6 0 — 0^4 2 3 — 0^32 Raytrace Raytrace w g o O ' ? 0 1 0:2 0 3 0:4 0 5 0:6 0:7 o:a 0:9 high-cost access fraction 0:1 0.2 0 3 0:4 0:5 os 0.7 0.8 0.9 high-cost access fraction Figure 5.3. Reservation Behavior in DCL with random cost mapping blocks, reservations are forced to terminate sooner as HAF increases. They have to termi nate whenever the set contains only high-cost blocks, even if they might have been suc cessful otherwise. The RV success ratio slightly increases with r as expected but is not proportional to r. This is because the reserved blocks with high locality are saved mostly with small r. Additional RV successes with larger cost ratios become more difficult to har vest as the accesses to the corresponding blocks are farther apart. 61 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. LU Barnes n 10 #>-io ta i - 1 5 ' -2 0 ; in -10- in g - 1 5 - -2 0 ; high-cost access fraction high-cost access fraction Raytrace Ocean S' 2 0 C ? 2 0 -2 0 ; 0:2 0 .3 0:4 0 .5 0 .6 0 7 0 :8 0.9 high-cost access fraction high-cost access fraction Figure 5.4. Relative miss rate increase in DCL with random cost mapping The plots in Figure 5.3 show that the average cost savings per RV success is very close to the cost ratio and even exceeds the cost ratio at times. At first, we were quite sur prised by this observation. We did not suspect that the miss rate could play a role since it seemed obvious that tinkering with LRU could only increase the miss rate. However, as shown in Figure 5.4, the relative miss rate increase of DCL over LRU varies widely among benchmarks. The relative miss rate increase is the ratio between the L2 cache miss increase in DCL over LRU and the number o f L2 cache misses in LRU. The L2 cache miss rates of LRU with respect to processor references are 21.3,4.5, 7.2 and 6.8% for Bames, LU, Ocean and Raytrace, respectively. Note that the miss rate of LRU does not vary with either r or HAF. We expected that the number of misses in DCL over LRU would increase to save high-cost blocks by replacing several low-cost blocks. In Raytrace, this is obviously the case. But, surprisingly, the number of misses in every situation in Bames, and in some sit- 62 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. uations in LU and Ocean, is reduced in DCL over LRU. Obviously, LRU is not always the best replacement algorithm to minimize the miss rate. There are situations where the non-LRU blocks replaced instead o f a reserved LRU block have lower locality than the reserved block. For instance, consider a reference string ABCDABCD to a three-way associative cache. Also assume that block C is the only low-cost block. LRU misses on every reference. DCL however will reserve block A by replacing block C to bring block D in the cache. Later, the access to block A hits, follow ing the reservation. The subsequent access to block B hits as well but this hit is not a con sequence o f the reservation of block B. This situation can occur quite often as indicated by the large performance gap between LRU and OPT [31] [60]. It is important to note that BCL, DCL and ACL actively pursue these opportunities on behalf o f high-cost blocks. When this situation occurs, i.e., when LRU goes wrong, DCL takes advantage of it by reducing the number of misses for high-cost blocks and replacing a low-cost block closest to the LRU position. Such cases occur more frequently in the benchmarks from Bames to Raytrace in Figure 5.4. In Bames, the miss rate of DCL is up to 17% lower than the miss rate of LRU, but in many cases, the cost function gains shown in Figure 5.1 are much higher than 17%. In Raytrace the relative miss rate increase over LRU is very large, but DCL is still able to cut the aggregate cost effectively, as shown in Figure 5.1. We see that, in some cases such as Bames, the difference between the numbers of low-cost misses in DCL and in LRU steadily grows as r increases, but the reduction in the number o f high-cost misses exceeds the increase in the number of low-cost misses. In these cases, the miss rate effect “piles up” on top o f the cost improvements actively sought by DCL to yield the impressive results of Figure 5.1. 6 3 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Barnes, 16KB, 2way, DCL Barnes, 16KB, 4way, DCL Barnes, 16KB, 8w ay, DCL 100 100 r= ln f r= 3 2 r= 1 6 r= 8 r=4 r= 2 r= ln f r= 3 2 r= 16 f = 8 r= 4 r= 2 j — r= ln f 5=Sr-32 t z q c 70 « - 5 0 ' t o O 40. o o 3 0 ' | 2 0 ® 10 high-cost access fraction high-cost access fraction high-cost access fraction LU, 16KB, 8way, DCL LU, 16KB, 4way, DCL LU, 16KB, 2way, DCL too 1 0 0 - * 9 0 - IT 80- Q ) £ 70- « 60' " 50' ( O 8 « 30' ^20 £ to 100 j l r=lnf o — e & — A X K r = ^ - j l r=lnf £ 1 % high-cost access fraction high-cost access fraction high-cost access fraction Figure 5.5. Relative cost savings in DCL with different cache associativities 5.1.2.3 Effect of Cache Associativity Figure 5.5 shows the relative cost savings of DCL in Bames and LU as the cache associativity varies in a 16-Kbyte second-level cache. As the associativity varies from 2 to 8, the relative cost savings increases steadily, and the savings peaks occur for larger values of HAF because wider associativities yield more reservation opportunities. As the number of ways goes up, the peaks of the RV ratio as a function of HAF move toward the right side of the graphs following the analytical model for the RV ratio described earlier, whereas the RV success ratio shows the same trend across all associativities. Overall, the simulations indicate that DCL can perform reliably across practical associativities in mod em systems. DCL yields sizable cost savings over LRU even in two-way associative caches. This is usually considered to be very difficult to achieve due to the high locality of MRU blocks. 64 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.1.3 First-Touch Cost Mapping So far, the cost assignment to blocks has been done randomly. We were able to look extensively at the effect o f the cost ratio and high-cost access fraction on the total cost. With the random cost assumption, low-cost and high-cost blocks are homogeneously spread in time and across cache sets. However, in a realistic situation, the assignment of cost is not random and cost assignments may be highly correlated. For example, if an application has an HAF of 0.5, we would expect some improvements of the cost function in DCL, according to the evaluations in the preceding section. However, an HAF of 0.5 over the entire execution could result from an HAF of 0 for half of the execution and of 1 for the other half. Or it could be that HAF is 0 for half of the sets and 1 for the other half. In both cases, the gains from DCL are expected to be dismal, if not negative. Because of this correlation, we always expected that the cost savings would not be as impressive in actual situations as Figure 5.1 would lead us to believe. Thus we have explored a simple, practical situation to investigate the effect of cost correlations among blocks. In this section, we modify the policy to assign costs to blocks as follows. We allo cate blocks to memory according to the first-touch policy. Remote blocks are assigned a high cost and local blocks are assigned a low cost. Table 5.2 shows the relative cost savings by cost-sensitive algorithms over LRU in our basic cache as r varies from 2 to 32. The high-cost access fraction in each benchmark is shown in the first column of Table 3.2. To understand the performance of the replacement algorithms under the first-touch policy to assign costs, we compare their savings to the savings under the random cost mapping at the same HAF (corresponding to the vertical lines in Figure 5.1 and Figure 5.2). Overall, we observe that the differences in cost savings achieved under the random cost mapping and the first-touch cost mapping are moderate except for LU. In LU, the 65 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. savings under the first-touch policy is very poor. It even becomes negative in BCL and DCL although the high-cost fraction falls in the “sweet spot”. Accesses in LU have high locality and their behavior varies significantly across cache sets. In some cache sets, no reservation succeeds. GD BCL r=2 r=A r -8 r=16 r=32 r= 2 r=4 r=8 r=16 r=32 Bames 7.99 20.62 29.14 32.31 33.94 9.98 24.61 36.40 41.11 43.17 LU -0.02 0.30 0.27 0.19 0.47 0.04 -0.03 -0.32 -0.65 -0.76 Ocean -1.51 2.86 14.99 26.08 35.13 oo © 1 -0.94 0.99 12.98 35.32 Raytrace 0.57 3.83 8.91 13.82 17.25 0.16 2.78 7.86 14.59 20.00 DCL ACL r=2 r=4 r=8 r=16 r= 32 r=2 r=4 r=8 r=16 r=32 Bames 25.86 33.10 38.18 41.42 43.31 24.59 31.48 36.28 39.29 41.02 LU -0.37 -0.58 -0.87 -1.19 -1.24 0.24 0.42 0.67 0.97 1.48 Ocean 6.24 12.40 20.88 29.23 36.03 6.21 12.43 20.65 28.49 34.81 Raytrace 2.35 7.15 12.68 17.52 20.94 3.11 6.67 10.80 14.53 17.17 Table 5.2. Relative cost savings with first-touch data placement (%) This observation prompted us to explore adaptive algorithms across sets and time and to come up with the design of ACL. LU takes advantage of ACL, and even shows small positive savings in ACL. The savings behavior by GD is the same as for the random cost mapping. In Bames where HAF is high, the cost savings by GD is much lower than the cost savings by BCL. In other benchmarks, GD outperforms BCL with small r. Table 5.3 shows the relative miss rate increase for each replacement algorithm over LRU in our basic cache. Overall it follows the same pattern as that under the random cost mapping except for LU whose reservation rate is lower with the first-touch data placement. We again observe that when the miss rate is improved over LRU, the cost-sen sitive replacement algorithms take advantage o f it, and when the miss ratio is worse, they still provide cost improvements over LRU. 66 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. GD BCL r=2 r=4 r=8 r= 16 r= 32 r=2 r=4 r=8 r= l6 r= 32 Bames -5.43 -10.67 -12.33 -11.52 -10.61 -6.59 -12.93 -16.23 -15.92 -14.66 LU 0.07 0.17 0.85 2.30 4.28 -0.01 0.58 1.78 4.18 8.95 Ocean 1.85 2.46 2.99 5.84 10.24 1.41 2.34 3.93 5.59 8.74 Raytrace 0.77 1.89 3.74 6.36 9.32 1.38 2.70 5.18 9.05 13.63 DCL ACL r=2 r=4 r= 8 r=l 6 r=32 r= 2 r=4 r=8 r=16 r=32 Bames -17.82 -17.56 -16.76 -15.62 -14.28 -17.17 -17.11 -16.55 -15.71 -14.75 LU 0.59 1.27 2.63 5.34 10.65 -0.09 -0.03 0.09 0.31 0.73 Ocean -1.78 -0.09 2.68 7.21 14.58 -2.37 -1.54 0.19 3.12 7.77 Raytrace 2.64 5.67 9.71 14.28 18.68 0.02 1.60 3.98 6.84 9.59 Table 5.3. Relative miss rate increase over LRU with first-touch data placement (%) Although ACL does not always reap the best cost savings, its cost savings is always very close to the best one among the four algorithms. Moreover, ACL is more reli able across the board as its cost is never worse than LRU’s. 5.2 Dynamic Case with Multiple Latencies In this section, we apply our cost-sensitive replacement algorithms to improve the memory performance o f multiprocessor systems with non-uniform memory access laten cies. In this context the target cost should be the miss penalty, i.e., the impact of the miss on processor performance. However, the penalty cannot be measured easily [53] and tack ling this problem is beyond the scope of this thesis. Thus, in this section, we have settled for the miss latency as a measure of miss cost. In a CC-NUMA multiprocessor, the miss latency is dynamic and depends on the type o f access and the global state of the block at the time of access. When the replacement decision must be made, the latency o f the future miss must be predicted. 6 7 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.2.1 Miss Cost Prediction in CC-NUMA Multiprocessors To select a victim in a cost-sensitive replacement algorithm, the future miss costs of all cached blocks must be known in advance at the time of replacement. The future miss cost of a cached block refers to the miss cost at the next reference to it if the block is vic timized. In general, the future miss cost is affected by the replacement decision. Such a prediction is difficult or even impossible to verify, in general, unless miss costs are fixed and static. current miss read rd-excl occurrence (%) mismatch (% ) avg. lat. error occurrence (%) mismatch (%) avg. lat. error U S E U S E U S E U S E U S E U S E U 22.1 1.5 0.1 0 0 83 0.0 0.0 25.5 2.2 0.1 1.9 0 59 67 0.0 27.6 70.3 read S 0.2 53.8 0.1 0 0 83 0.0 0.0 17.8 0.0 0.3 0.0 0 68 58 0.0 31.2 26.3 last E 0.0 1.2 0.2 67 100 12 19.8 21.1 28.8 0.0 0.1 0.0 42 67 10 33.3 15.6 28.0 miss U 4.6 0.1 0.1 0 0 67 0.0 0.0 38.1 8.9 0.0 0.0 0 57 58 0.0 33.8 40.5 rd- excl S 0.2 0.0 0.1 68 70 67 27.3 43.0 17.3 0.1 0.0 0.0 44 34 58 33.1 26.0 18.3 E 1.9 0.0 0.0 75 67 21 57.4 38.0 33.0 0.3 0.0 0.0 50 57 14 5.2 27.4 30.8 Table 5.4. Latency variation in protocol without replacement hints One way to approach the prediction of miss latencies is to look at the correlation between consecutive unloaded miss latencies to the same block in the normal execution with LRU replacement. Table 5.4 is a two-dimensional matrix indexed by the attributes of the last miss and the current miss to the same block by the same processor. It yields the average absolute difference in latencies between the two misses across all four SPLASH-2 benchmarks in Section 5.1.1 for a MESI protocol without replacement hints [26]. The attributes are the request type (read or read-exclusive) and the memory block state (Uncached, Shared, or Exclusive). For instance, the table shows that read misses followed by another read miss to the same block by the same processor while the block is in mem ory state Shared make up about 54% o f all misses. In such cases the unloaded miss latency 68 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. does not change from one miss to the next. Overall the table shows (in the shaded areas) that 93% of misses are such that their latencies are the same as those of the prior misses by the same processor to the same block. For the remaining 7%, the average difference between past and present latencies varies widely. However, the average latency difference is mostly small, and is much smaller than the local latency, which is 60 cycles in this case. Similar results are obtained in the protocol with replacement hints [29]. In the following we simply use the last measured miss latency to predict the future miss latency to the same block by the same processor. The latency is measured by aug menting each message with a time stamp. When the request is sent, the message is stamped with the current time. When the reply is returned to the requester, the latency is measured as the difference between the current time and the time stamp. In the case where a request receives many replies, the latency is measured at the time when the requested block becomes available. If a nacked request is reissued, the original time stamp is reused. 5.2.2 Evaluation Approach and Setup To measure the performance in multiprocessors with modem ILP processors, we use RSIM [41] which models processor, memory system and interconnection network in detail. We have implemented the four on-line cost-sensitive replacement algorithms as well as LRU in the second-level caches. Table 5.5 summarizes our target system configu ration consisting of 16 processors. Data is placed in main memory using the first-touch policy for each individual memory block. The minimum unloaded remote-to-local latency ratio to clean copies is around 3. To reflect modem processors, we consider 500 MHz and 1 GHz clock speeds. In our evaluations, the sequential memory consistency model is assumed. 6 9 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Processor Architecture Clock Active List Functional Units 500 MHz or 1 GHz 64 entries 2 integer units, 2 FP units, 2 address generation units, 32-entry address queue Memory Hierarchy and Interconnection Network LI Cache L2 Cache Main Memory Unloaded Minimum Memory Access Latency Cache Coherence Protocol Interconnection Network 4 Kbytes, 1-way, write-back, 2 ports, 8 MSHRs, 64-byte block, 1 clock access 16 Kbytes, 4-way, write-back, 8 MSHRs, 64-byte block, 6 clocks access 4-way interleaved, 60 ns access time Local clean: 120ns, Remote clean: 380ns, Remote dirty: 480ns MESI protocol with replacement hints 4x4 mesh, 64-bit link, 6ns flit delay, 64-flit switch buffer Table 5.5. Baseline system configuration The four benchmarks in Section 5.1.1 are used. However, due to the slow simula tion speed o f RSIM, the problem sizes are further reduced. We execute Bames with 4 K particles, LU with 256 x 256 matrix, Ocean with 130 x 130 ocean and Raytrace with tea pot scene. The benchmarks are compiled for Sparc V8 with Sun C compiler with optimi zation level x04. 5.2.3 Execution Times Table 5.6 shows the reduction of the execution time (relative to LRU) for each of the four cost-sensitive replacement algorithms with processors clocked at 500 MHz and 1 GHz. GD, as compared to BCL, shows mixed behavior. GD slightly outperforms BCL in Ocean and Raytrace whereas BCL outperforms GD in Bames and LU. The execution times in LU by GD and BCL are slightly increased. This behavior conforms to the results by trace-driven simulations in Section 5.1.3. Overall BCL yields more reliable improve ments than GD in both processors. However, the difference between BCL and GD is quite small, as compared to the difference with DCL and ACL. 70 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 500MHz Processor 1GHz Processor GD BCL DCL ACL DCL alias ACL alias GD BCL DCL ACL DCL alias ACL alias Bames 4.94 7.36 16.92 16.15 15.90 15.14 6.88 8.51 18.12 17.37 18.41 17.20 LU -0.62 -0.40 3.50 3.93 4.46 5.07 -0.44 -0.29 3.59 4.20 4.75 5.38 Ocean 6.28 5.99 8.29 7.35 7.65 6.84 6.45 6.18 8.46 7.94 8.00 7.12 Raytrace 3.50 2.75 7.19 13.44 5.61 14.56 3.59 2.30 7.82 7.55 6.70 5.68 Table 5.6. Reduction of execution time by cost-sensitive algorithms over LRU (%) DCL yields reliable and significant improvements in execution time in every situa tion. The improvements by DCL over BCL are large in Bames and Raytrace whose data access patterns are rather irregular. Thus it is advantageous to utilize ETD for accurate depreciation of miss costs in these benchmarks. As compared to DCL, the execution times in ACL are slightly longer except in a few cases. This indicates that ACL is rather slow in adapting to the rapid changes o f the savings pattern. Thus ACL filters some chances of cost savings as well as unnecessary res ervations. LU shows consistent but marginal improvements. In LU, the streak of reserva tion failures is extremely long in some cache sets and ACL effectively filters these unnecessary reservations. In Raytrace with 500 MHz processors, the large improvement by ACL over DCL mainly comes from the reduction of synchronization overhead and load imbalance. To reduce the size of ETD, we have the option to store a few bits o f the tag instead o f the whole tag, as explained in Section 4.5. The last two columns in Table 5.6 show the results with tag aliasing in ETD. We reduced the tag sizes to 4 bits. This tag aliasing saves practically 40% to 60% of the tag storage in ETD depending on the data address space in each benchmark. The ratios of false match upon cache misses due to the aliasing are 45%, 43%, 30% and 27% for Bames, LU, Ocean and Raytrace, respectively. The false matches result in a more aggressive depreciation o f the cost of a reserved block, which seems to 71 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. benefit LU. The results show that the effect on the execution time by ETD tag aliasing is very marginal. Table 5.7 shows the relative miss rate increase o f the cost-sensitive algorithms over LRU. Overall the miss rates are reduced by cost-sensitive algorithms over LRU except for a few cases in LU. This indicates that the reduction of high-cost misses exceeds the increase o f low-cost misses. LRU miss rate 500MHz Processor 1GHz Processor GD BCL DCL ACL GD BCL DCL ACL Bames 7.69 -7.54 -9.10 -20.47 -19.91 -7.60 -10.68 -19.51 -19.73 LU 1.07 1.07 0.91 -5.31 -5.80 1.04 0.90 -5.20 -5.77 Ocean 2.78 -1.17 -2.14 -1.31 -1.69 -1.25 -2.42 -1.55 -2.07 Raytrace 0.83 -2.48 -3.96 -7.57 -9.46 -2.10 -3.34 -7.62 -6.81 Table 5.7. Relative miss rate increase over LRU (%) As explained before in Section 5.1, BCL, DCL and ACL pursue errors made by LRU on behalf o f high-cost blocks. Also, the improvements in execution time are very impressive, as compared to the improvements in miss rate. A miss rate improvement of x in a non-blocking L2 cache with an ILP processor does not directly translate into an exe cution time improvement of x in general, as the effects o f a large number o f misses are partially hidden by activity overlap in the processor. What we observe here again is that the miss rate improvement effect piles up on top of the cost reduction effect sought by the replacement algorithms to improve execution time further. Unfortunately it is impossible to separate the two effects and measure their relative importance on the execution time. It is very difficult to correlate the results in the execution-driven simulations in this section and the trace-driven simulations in Section 5.1.3, although both sets of experi ments point to the same direction. There are many differences between the experiments reported in this section and the experiments o f Section 5.1.3 as follows. 72 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • There are multiple costs. In fact we measure accurately the latency of the last miss by time stamping (this includes the effect o f conflicts). • The costs are dynamic as the latency depends on the type of access and the glo bal state of the block. • The miss latency must be predicted. • We measure execution time rather than cost savings. Many other factors besides L2 cache misses affect the execution time. • The data set sizes and the number o f processors in the benchmarks are different. Thus it is remarkable that the general trends are similar in both sets of experiments. The algorithms seem to be very resilient to variability in application, architecture, cost function, metric used, miss rates, quality o f predictions, and hardware employed. This shows the robustness of the ideas behind LRU-based cost-sensitive replacement algo rithms. This is particularly true for ACL whose performance is never worse than LRU with our benchmarks and experimental setup. Overall we believe that the improvements in execution time by DCL are signifi cant. The performance of ACL is often slightly lower than DCL, but ACL gives more reli able performance across various applications. 5.2.4 Implementation Considerations Even if costs are dynamic, it is possible to reduce the number of bits required by the fixed cost fields. For example, instead of measuring accurate latencies as we have done in this section, we can use the approximate, unloaded latencies given in Table 5.5, which can be looked up in a table. In general the number of bits required by the fixed cost fields is then equal to the logarithm base 2 o f the number of costs. In the example shown in Table 5.5, we would need 2 bits for fixed miss cost fields. 7 3 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. On the other hand, the computed cost fields must have enough bits to represent latencies after they have been depreciated. Let’s assume that the greatest common divisor (GCD) of all possible miss costs is G. Then G can be the unit of cost. Let us assume that the largest possible cost is K x G. Then we need log2A’bits for the computed (depreciated) cost field. For example, from Table 5.5, we can use G = 60nsec and K = 8 (the only prob lem is the 380 nsec latency which would be encoded as 360 nsec, a minor discrepancy). Thus 3 bits will be sufficient for the computed cost fields. In this case, with 5 bits for the tags and the valid bit in each ETD entry, the hardware overhead per set over LRU is 1 1 bits in BCL, 20 bits in GD, 32 bits in DCL and 35 bits in ACL. 5.3 Summary In this chapter, we have thoroughly evaluated the on-line cost-sensitive replace ment algorithms in the simple case of two static miss costs using trace-driven simulations to understand their behavior. Overall, the behavior of on-line cost-sensitive algorithms resembles the behavior of CSOPT with respect to the high-cost access fraction and the cost ratio. We have observed that DCL and ACL yield significant cost savings over LRU across various cost ranges and cache configurations. Interestingly, the miss rates of the proposed algorithms are sometimes lower than LRU’s because our on-line algorithms take advantage of poor locality prediction by LRU by actively seeking fruitful reservations contrary to CSOPT where its locality prediction is always accurate. Then, we have applied the algorithms to the L2 caches of a multiprocessor with ILP processors using miss latency as a cost function. In CC-NUMA multiprocessors with a simple latency prediction scheme, we have shown that our cost-sensitive algorithms can significantly improve the execution time of parallel applications using execution-driven simulations and that the applied hardware schemes are very cost-effective. 74 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 6 IMPROVING MEMORY PERFORMANCE OF ILP PROCESSORS 6.1 Targeting Miss Penalty in ILP Processors There are many possible applications for on-line cost-sensitive replacement algo rithms. In this chapter, we investigate the case of ILP processors to improve their memory performance by taking advantage of the penalty difference between a store miss and a load miss. In a processor with a properly designed and enabled store buffer, the penalty of stores is mostly hidden because a store retires as soon as it reaches the top of the active list, whereas, in the case o f loads, the processor needs the value back before the load can retire. It is therefore advantageous in a system to replace load misses with store misses by giving higher replacement priority to cache blocks that are accessed next by a store instruction. In the process, the total miss rate may be increased, but we expect that the aggregate penalty of all misses will be lessened. In this context, the architectural setup would be equivalent to assuming a cost model in which the penalty of stores is zero and the penalty of loads is one, and to mini mizing the aggregate penalty of all misses, which, in this case means minimizing the num ber of load misses. Thus, under this infinite cost ratio, a naive approach to minimizing the number of load misses is to blindly replace one o f the blocks that will be accessed next by a store in the presence o f blocks that will be accessed next by a load. 75 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. However, one problem with this simple replacement policy is that it can cause a dramatic increase o f store misses, and most of them may be unnecessary to reduce load misses because a block replaced due to its next store must be brought back into cache again by a store miss under write-allocate policy. Thus, if the store miss to the replaced block occurs before the load to the reserved block, the policy may simply increase the number o f store misses without saving load misses. Moreover, it can even increase the number o f load misses if a load to the replaced block follows shortly after the store miss and thus the block is unavailable for the load. Additionally, the increase of store misses can adversely affect the memory performance by consuming more memory bandwidth and traffic in practical situations. Thus, the goal o f replacement policy targeting the penalty difference between a store miss and a load miss is to reduce the number of load misses while avoiding undue increases in store misses. The goal could be achieved by assigning a realistic cost to each store miss. However, as mentioned earlier, the accurate measurement and prediction of memory access penalties are difficult problems. In our approach, we instead artificially assign a static finite penalty ratio between a store miss and a load miss and apply our LRU-based cost-sensitive replacement algorithms, i.e., DCL and ACL, to reduce the aggregate miss penalty. To reap the benefits of this idea, we need to effectively predict the next access type (load or store) for every block in a cache set at the time o f replacement. This can be done statically through profiling at compile-time, or dynamically by maintaining Access Type Predictors (ATPs). ATPs have other applications besides penalty-sensitive cache replace ment, but, in this thesis, we concentrate on cache replacements. 7 6 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6.2 Baseline Architecture Figure 6.1 shows the baseline architecture. It consists of a processor with on-chip split instruction/data caches and a store buffer in parallel with the data cache, and a main memory connected through a system bus. The instruction cache is infinite, and the data cache is non-blocking and write-back. The role of the store buffer is to eliminate the penalty of stores [22] [48]. Among possible write policies [22] we use write-allocate and fetch-on-write. When a store misses in the data cache, the store immediately writes the data into the store buffer if an entry is available and the store instruction can retire. With this policy, the only penalty incurred by stores is when the write buffer is full. However this penalty can be minimized even with a small store buffer if a proper retirement policy [48] is used. s y s t e m b u s store buffer CPU main memory l-cache D-cache i i Figure 6.1. Block diagram of the baseline architecture 6.3 Evaluation Methodology We use trace-driven simulations to efficiently evaluate various replacement poli cies and approaches to access type prediction. The traces are generated from eight Spec95 benchmarks [54]. Table 6.1 lists their main characteristics. For the benchmarks marked 77 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. with an asterisk, the inputs have been modified from the default inputs to yield reasonable trace sizes and simulation time. The traces are generated with the Simplescalar tool set [7]. This simulator models an ILP processor with an ISA similar to the MIPS ISA. Fortran programs are translated into C programs using an f2c translator from AT&T. The benchmarks are compiled by gcc2.6.3 with 03 and loop unrolling optimization options. The traces are gathered for the entire execution to avoid the problem of selecting a representative execution window. The traces include all user data references but no instructions. They also include the PCs of every memory reference to simulate access type predictions. Table 6.1 also shows the fraction of loads and stores to the total number of instructions as well as the high-cost access fraction (HAF) in each benchmark. In this case, the HAF refers to the fraction of loads to the total number of memory accesses. Most importantly we notice that the HAF across benchmarks is noticeably large ranging from 60 to 90%. Thus we expect that the potential cost savings targeting the penalty difference will not be impressive. Benchmark Input Inst, count (106) Loads(%) Stores (%) HAF (%) Load PCs Store PCs compress train 292.4 21.7 13.1 62.4 762 569 gcc protoize.i 501.9 26.8 14.7 64.6 32417 14392 Int go train 546.7 21.1 7.6 73.6 11352 5385 •jpeg test 537.6 20.0 8.7 69.6 3966 2783 li train 183.3 25.9 16.6 61.0 1443 1141 vortex* train 742.7 30.5 22.3 57.8 14719 13157 FP apsi train 662.6 24.0 13.9 63.3 6257 4377 mgrid* train 585.9 33.3 3.0 91.9 1903 1297 Table 6.1. The characteristics of the benchmarks The cache is write-back and blocks are allocated in the cache upon misses. The cache block size is 32 bytes and the cache associativity varies from 2 to 8. Selecting a cache size to evaluate replacement policies in a fair way is a difficult problem. If the work ing set is too small or too large, the replacement policy does not affect cache performance 78 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. much. In practice the effectiveness o f improved replacement policies varies significantly with the working set size profile of each application. Thus, we first vary the cache size from 16 Kbytes to 1 Mbytes rather than simply arbitrarily picking a few cache sizes, and then focus on a few representative cache sizes. Our primary performance metric is the reduction of the number of replacement load misses relative to the basic replacement policy (LRU). We only count replacement misses since cold misses are not affected by the replacement policy. A miss is counted only if the missing block was previously replaced from the cache. By doing this, the cache is automatically and precisely warmed up without blindly skipping an arbitrary number of references, and the performance comparisons are fair and noise-free with respect to replacement algorithms. We also measure the memory traffic to estimate the impact of replacement policies on memory bandwidth. 6.4 Perfect Access Type Prediction One key implementation issue in this study is to predict whether the next access to each block in a set will be a load or a store. When this prediction is 100% accurate, we maximize the number of replacement store misses displacing load misses, and we elimi nate the increase in the number of replacement load misses due to bad predictions. Thus, under perfect access type prediction, we can expect a maximum reduction in the number of replacement load misses across all feasible prediction schemes. In this section, we investigate the upper-bound of load miss improvement by DCL in the case o f infinite cost ratio, as we vary the cache size from 16 Kbytes to 1 Mbytes and the associativity from 2 to 8. Figure 6.2 shows three graphs per benchmark. To obtain these numbers, we first scan the traces and mark each memory access to a block with the next access type to the same block; then we use these augmented traces to simulate the system with perfect access 79 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. type predictions. The first graph shows the load miss rate (i.e., the total number of replace ment load misses by LRU divided by the total number o f loads). The second graph shows the load miss savings (i.e., the total number of replacement load misses saved by DCL over LRU). The third graph shows the load miss improvement by DCL over LRU (this is the load miss savings divided by the number of replacement load misses). First o f all we observe that the load miss improvement can be huge in some cases. For example, compress and ijpeg reach more than 80% improvement, and gcc and apsi more than 50%. At first these improvements were surprising, because the fraction of store misses that can displace load misses is rather small. However, the improvements are related to working set behavior rather than relative number o f load and store misses. By correlating the load miss rate and the load miss savings, we observe that the benchmarks can be classified into two groups. In the first group which includes gcc, go, vortex and apsi, the savings decreases almost monotonically with the load miss rate. This means that the savings opportunities are uniform across all cache sizes. In the second group which includes compress, ijpeg, li and mgrid, the load miss savings curves have one or two peaks. The peaks occur for 16-Kbyte caches in ijpeg and for 64-Kbyte caches in compress and li. Mgrid shows two peaks, one for 32-Kbyte and one for 5 12-Kbyte caches. These savings peaks can be explained by the data working set behavior of these applica tions. At a working set size transition (i.e., when the cache becomes large enough to con tain the next working set), DCL effectively increases the cache size for loads by providing more cache space for data next accessed by loads. Even if this effective cache size increase is relatively small the rapid fall o f the miss rate at the working set size transition is directly translated into a rapid rise in the load miss savings. In a range of cache sizes with no working set size transition, the savings are rather moderate and rather insensitive 8 0 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. » 100 7 — 7 8 - w a y x - x 4 - w a y — i 7TT~-g-.1t £ / £ / / / £ £ £ £ £ / / £ £ £ / / / / £ £ £ £ / / / 100.0 1 0 S O . O 7 8 - w a y x — k 4 - w a y T — i l i — 1 W — 1 vortex 7 — 78-w ay x— k 4-way A— A2-way 0.0 . d 4 ^ j j r - a i r & F ^ & « F & $ £ £ ■ y * ^ ^ £ g £ £ £ £ £ £ £ 7 — 7 8 - w a y X " x 4 - w a y A — A 2 * w a y 4 .0 1 1 0 0 .0 1 7 — 7 8 - w a y x — * 4 - w a y ‘ A — A 2 - w a y *— 1 [ 0 .0 t £ £ £ £ £ £ £ £ £ £ £ / / £ £ £ £ £ / / ^ £ £ £ £ / / Figure 6.2. Load misses in DCL with perfect access type prediction (r - inf) to the cache configuration. For very large cache sizes, the data of the application fits in the cache and the load miss savings reaches zero as replacements become very scarce. 81 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In most benchmarks, the load miss improvement slowly increases as the cache size increases and shows a peak before it quickly approaches zero. With smaller cache sizes, the load miss savings is high, but the load miss rate is also high, resulting in moderate load miss improvement. With bigger caches, such that the miss rate is lower, the improvement may show large peaks, especially near working set size transitions. When the cache is extremely large the improvement usually vanishes with the miss rate. Needless to say, in the case where the miss rate is extremely low, the large load miss improvement (e.g., gcc with 256-Kbyte cache) would have little impact on the memory performance. The savings and the improvement increase with cache associativity because the replacement policy has more effect on caches with larger associativity. Table 6.2 shows the weighted arithmetic average of the load miss improvement across all the benchmarks. This table shows that, on the average, the rate o f improvement increases with the cache size and associativity. Thus the cost effectiveness of the hardware needed to implement penalty-sensitive policies improves with the complexity of the cache. 16KB 32KB 64KB 128KB 256KB 512KB 1MB 2-way 3.36 3.91 4.40 5.76 5.46 17.16 10.27 4-way 5.33 9.49 8.21 10.04 9.20 20.30 45.05 8-way 6.58 10.12 10.39 13.37 12.25 25.63 51.77 Table 6.2. Average o f load miss improvement rate (%) by DCL 6.5 Instruction-based Access Type Prediction A 100% accurate prediction is impossible to achieve in practical systems. The next access type must be predicted either statically, dynamically or both (hybrid). In this sec tion, we attach next access type predictions to memory access instructions, as was done in other prediction schemes [11][14][24][25][60][64]. 8 2 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 6.3 shows the motivation for attaching the next access type predictions to load and store instructions in the code. In the simple access sequence in Figure 6.3[a], a load and a store access the same block alternatively. If this pattern is very regular and common to many cache blocks, we can safely predict the next access type solely based on the PC (program counter) of the memory access instruction. In this example, we just need to keep track of which type of memory access instruction followed the load or the store. If the PC of the current access to any block is 100, then the next access type to the same block is predicted as store, and if the current PC is 120, then the next access type is pre dicted as load. Note that the predictions for this sequence can be done easily at compile time, by profiling the execution for example. 120 store 100 load 100 load 120 store [a] [b] Figure 6.3. Examples o f access type sequence In the second sequence as shown in Figure 6.3[b], each instruction is repeated sev eral times before moving to the next instruction. This case is more complex to predict than the previous sequence and compiler-based prediction may fail here. (In general, sequences may be much more unpredictable; in some cases the next memory access to the same block may have many different PCs in a data dependent manner.) Even for a very regular access pattern such as shown in Figure 6.3[b], a dynamic prediction scheme is needed and access history must be kept. For instance, suppose each instruction of Figure 6.3[b] executes twice per each iteration of the outer loop. After sev eral iterations, the next access type history o f PC 100 will be “010101”, where ‘O’ indi- 83 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. cates a load and ‘1’ indicates a store. From the history pattern we can predict that the next access by PC 100 will be a load. Note that in practice predictions must be made on cache blocks instead of individual words. In the following we develop simple static and dynamic access prediction schemes. In the dynamic case we present various design alternatives with cost considerations. 6.6 Static Prediction We use a simple profiling scheme. We first run the program with a given data set and profile it. During the profiling stage, we simply count the number o f times that the next access to the same block following each memory access instruction is either a load or a store. The static prediction for each memory access instruction is derived from the fol lowing two values: / the fraction o f loads to the same block following the instruction, and T the threshold of acceptance. If / > T, the prediction is load and the PC for that prediction is deemed static. If 1-/ > T, the prediction is store and the PC for that prediction is deemed static. Otherwise, the instruction is not predictable and the PC is deemed dynamic. Table 6.3 shows the static prediction accuracy of this simple approach in the case of an 128-Kbyte 8-way cache. To analyze the performance of the static prediction we use the following four metrics: (i) the fraction of memory instructions which are deemed static (column “pc”), (ii) the fraction of data references accessed by static memory instructions (column “ld/st”), (iii) the fraction of stores covered by static prediction (column “st”), and (iv) the fraction of stores excluding MRU hits which are accurately predicted (column “st- n”). The rationale for these latter numbers is that predictions are never needed after MRU hits and thus predictions following MRU hits are useless. To see the effect of the threshold T, we vary T from 0.999 to 0.95. We see that when r = 0.999, which allows less than 0.1% of errors, the weighted average across all the benchmarks shows that 81% o f PCs are static and accurately cover 69% of memory refer- 84 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ences and 67% of stores. As we increase the error rate to around 5%, the static PCs cover up to 74% o f memory references on average. Overall, the coverage of stores is lower than the coverage o f loads and the gap increases as the error rate increases. T = 0.999 7’= 0.99 r = 0.95 program pc ld/st St st-n pc ld/st S t st-n pc ld/st St st-n compress 92.98 77.96 75.86 6.89 93.52 78.67 75.86 6.89 94.74 88.69 90.25 18.14 gcc 77.44 61.94 53.12 17.71 79.45 67.28 57.09 24.33 82.93 73.32 65.19 49.72 go 75.49 53.10 53.22 20.81 78.19 60.56 55.34 21.84 83.31 70.50 58.38 23.17 ijpeg 88.49 68.92 67.08 27.20 89.99 80.15 68.18 27.58 92.21 86.74 76.66 42.34 li 80.58 63.64 61.21 26.69 82.49 71.97 65.55 57.28 83.85 76.94 69.33 62.03 vortex 84.84 70.10 71.25 31.86 85.88 75.06 76.50 37.16 88.12 78.47 79.70 70.48 apsi 90.10 73.57 73.68 33.18 90.92 76.01 74.05 33.22 92.35 79.48 75.07 34.07 mgrid 88.91 76.03 69.90 7.72 91.23 87.67 69.94 7.73 93.56 94.09 70.81 7.76 Average 81.30 68.74 66.89 17.98 83.03 72.28 68.63 23.47 86.08 74.38 70.12 32.53 Table 6.3. Coverage by static prediction A reverse interpretation of the prediction results in Table 6.3 is that 19% of PCs show dynamic behavior and that they cover 31% of memory references and 34% o f stores when T= 0.999. Thus the number of dynamic PCs is far less than the number of static PCs but each dynamic PC covers more references and even more stores. Furthermore, when considering the stores excluding MRU hits, the coverage by static prediction is extremely low, especially in compress and mgrid. On average it ranges from 18% to 33%. This means that the static behavior is concentrated on MRU blocks. Accesses to non-MRU blocks and accesses with wide inter-reference gaps are more dynamic. This strongly suggests that dynamic prediction must be used to improve the cov erage especially for stores excluding MRU hits, although static prediction can be quite accurate in general. 8 5 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. com press ~ r 0.20----- - n. C w wA. X——X H - w a y 4 r — £ 2 - w a y (0 ' a i [ \ S \ I M \ -•0 .1 0 -----1 ---- 1 ---- 1 ----,----1 ---- 0.0- I 1 ---- 1 ---- 1 ----1 — %— * 0.0- I ,--- 1 — V—* — * — * 0 0 — ? r — I W < i o n _ _ _ _ _ _ _ 1 c _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ a «•> n 1 f i i - . — , . ■ 2 0 , O f T i r » - 1 V s vortex A - -- - t ------- 1 — n r 1 1 ■ I ■ 0.12 ----- 7 V 8 - w a y x — x 4 - w a y A — A 2 - w a y - i—i — m > yo .o o y ^ i i r,r^ r/ / / / Figure 6.4. Load miss improvement with static prediction (7 = 0.99) We now apply the static predictions with 1% error rate to DCL. For dynamic PCs, the access type is predicted as load. This scheme results in near perfect predictions of load access type and almost eliminates the cases in which loads are mispredicted as stores. Figure 6.4 shows the load miss improvement by DCL over LRU with this static prediction and with r infinite. To judge the overall performance of the static prediction, we compare Figure 6.2 (perfect prediction) with Figure 6.4 (static prediction) in light of the coverage o f stores that are not MRU hits shown in Table 6.3. Compress and mgrid show almost no improvement due to their low coverages. In general the curves for load miss improvement with static prediction have the same shape as the curves for perfect predictions but the improvements are much lower. Notable excep tions are li and vortex in which the static prediction reaches 50% of the gains obtained with perfect prediction. We also observe that the improvements in static prediction are rel atively better with small caches in gcc, vortex and apsi. This is because more stores are covered as the number of MRU blocks is less in smaller caches. 8 6 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6.7 Dynamic Prediction To predict the next access type dynamically we explore dynamic ATP (Access Type Predictor) structures based on two levels of access type history. Figure 6.5 shows the general structures o f these dynamic ATPs. The hardware consists of three tables: a PC table (PCT), an access type history table (ATHT) and a pattern history table (PHT). p c PCT PPC PHT 120 ---► 1 |o PPC 2 0 0 — f c 0 |, a c c e s s t y p e J [a] one-level p c prediction update PCT ATHT PHT PPG 120 « l « l ......I i |* — * l|0 PPC 200 * | o | ......h i * — ► 0 I 1 a c c e s s ty p e ± J PCT ATHT PHT T JT a c c e s s t y p e . J [b] two-level, global PHT [c] two-level, per-PC PHT Figure 6.5. General structures of dynamic ATPs M SB — - f * These structure are not very different from one-level or two-level branch predic tors [64] except for the addition of the PCT at the front-end. PCT is indexed with the data block address and yields the PC used to last access the block, called the Previous PC or PPC. The remaining tables maintain up to two levels o f access type history. ATHT keeps track o f the next access type history for each memory instruction and is indexed with the PPC, using a simple hashing function. Every time the next access is a “load” a 0 is shifted in the ATHT entry. Every time the next access is a “store”, a 1 is shifted in the ATHT entry. Each entry in PHT contains a saturating two-bit up-down counter, incremented every time the access is a store and decremented every time it is a load. The prediction outcome is store if the counter is greater than one. Four schemes are evaluated. The first dynamic ATP (Figure 6.5[a]) utilizes one level history with no ATHT. The second ATP (Figure 6.5[b]) maintains two levels of 87 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. access type history by analogy to the structure of a PAg branch predictor [64]. In this case, the size o f PHT depends on the size o f each ATHT entry. Aliasing can occur in PHT when the values of several entries in ATHT are the same. To reduce this aliasing PHT must be structured into multiple tables and these tables must be indexed with individual PC or set o f PCs [64], leading to the third ATP (Figure 6.5[c]), which is identical to the second ATP but different PHT tables are used for different groups o f PC. Thus the size of PHT tables may be large. The last ATP (“hybrid”) is a hybrid scheme such that static predictions with a 0.1% error rate override the predictions by the third ATP. There are several possible implementations for PCT. PCT could be integrated into the data cache or could be a separate table. In the first case, the PPC field is simply added to the tag o f the blockffame. 6.7.1 Prediction Updates and History Updates The prediction update is a simple table lookup in PCT, ATHT (except for one- level), and PHT. The PHT entry is then used to predict the store subset in the cache. The store subset contains the blocks that will be accessed next by a store. The PCT is updated on every cache access by storing the current PC into the PCT in a location indexed by the block address. To update the history, PCT is indexed by the current block address, and the PPC from the PCT entry indexes ATHT and/or PHT, and their entries are then updated. Since replacements are invoked only upon cache misses, the latest time at which the store subset must be identified is just after a miss occurs. The predictions made just after an MRU hit are useless, because no replacement takes place then. The MRU blocks are particularly critical because most cache hits are on the MRU blocks and these hits hap pen in streaks. So we only change the store subset on MRU changes. MRU changes [51] occur when non-MRU blocks are hit or just after a cache miss. Figure 6.6 illustrates the situations. Upon a cache hit to a non-MRU block C, as shown in 88 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. MRU A MRU A B B C C LRU D LRU D [a] [b] Figure 6.6. Timing of access type transition Figure 6.6[a], the MRU block A and the second MRU block B are moved down while block C takes the MRU position. Figure 6.6[b] shows the case that block E misses in the cache and takes the MRU position. In both cases, the next access type of block A is pre dicted, and the change in the store subset is limited to block A which was previously the MRU block. In the case shown in Figure 6.6[b], the victim is then selected based on the new store subset. The second operation is to update the history in ATHT and PHT. Because predic tions are made at the time o f an MRU change, history modification on every cache access (including MRU hits) may introduce noise in the prediction. As we will see it is better to update the history for blocks that move into the MRU position at the time of an MRU change, such as blocks C and E in Figure 6.6. 6.7.2 Prediction Accuracy (Infinite Hardware) In this section we assume that an infinite amount of hardware is available for each possible ATP structure so that no aliasing occurs. To select a particular prediction scheme, we measure the prediction accuracy of each access type since they each have a different impact. Although both predictions are closely related to each other, the accuracy of store predictions affects the number of load miss savings opportunities whereas the accuracy on load predictions affects the number of accesses that were wrongly predicted as stores and hence increase the number of load misses. 89 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6.7.2.1 Dynamic-ALL ATPs We first consider ATP structures which update the history on every reference. We call these ATPs dynamic-ALL ATPs. Table 6.4 shows the misprediction rate per access type for every reference and for the references excluding MRU hits in an 128-Kbyte 8- way cache and assuming infinite ATP hardware. First we focus on the misprediction rate including MRU hits (left part of Table 6.4). Overall, the prediction accuracy is excellent for all prediction schemes. As expected, the ATPs using a two-level history outperform the APT with a one-level history. From the comparison between the second and the third schemes, we observe that there is sizable room for improvement if the aliasing on PHT is eliminated. The predictions by the hybrid scheme are slightly improved over the third scheme. Overall the misprediction rate on stores is higher than the rate on loads. Dynamic-ALL with MRU hits Dynamic-ALL without MRU hits Drogram one-level global PHT per-PC PHT hybrid one-level global PHT per-PC PHT hybrid ld% st% ld% st% ld% st% ld% st% ld% st% ld% st% ld% st% ld% st% :ompress 1.60 4.46 1.55 3.73 1.52 3.42 1.52 3.41 11.08 51.26 13.34 53.52 13.44 49.82 13.44 49.81 J C C 3.60 7.35 3.36 6.67 2.35 6.53 2.35 6.40 4.07 27.46 4.35 27.57 3.40 30.20 3.40 30.00 10 4.09 15.36 4.20 16.41 3.45 14.01 3.45 13.93 5.58 41.79 5.76 45.40 5.45 42.85 5.45 42.78 jpeg 1.55 3.20 1.32 3.17 1.05 2.97 1.05 2.91 12.60 41.40 12.82 41.21 12.66 40.84 12.66 40.77 ,i 4.73 6.69 4.19 6.77 2.99 4.85 2.99 4.83 7.65 28.83 6.25 28.09 5.37 28.11 5.37 28.06 rortex 3.37 6.46 1.79 3.24 0.93 2.12 0.93 2.07 1.05 16.39 1.02 15.55 0.84 15.92 0.84 15.40 ipsi 5.24 9.62 4.17 7.96 3.16 6.18 3.16 6.15 7.90 40.46 6.68 37.24 6.05 35.95 6.05 35.92 ngrid 0.59 8.44 0.59 8.19 0.38 8.20 0.58 8.15 11.15 52.99 11.18 52.88 11.18 52.92 11.18 52.90 Average 3.01 7.53 2.43 6.09 1.80 5.03 1.80 4.97 6.45 40.46 6.52 40.03 6.22 39.31 6.22 39.22 Table 6.4. Misprediction rate by dynamic-ALL ATPs (infinite hardware) If we focus now on the prediction on non-MRU hits only (right part of Table 6.5), the misprediction rate is much higher across all the benchmarks, especially for stores. This is because stores have a higher hit rate on MRU blocks than loads especially when compil ers with high optimization levels (such as the one we use) attempt to cluster stores. As a 90 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. result, the more accurate predictions for MRU hits are removed. Unfortunately, the num bers without MRU hits in Table 6.5 are more indicative o f the performance of the predic tors for the problem addressed in this study. 6.7.3 Dynamic-MRU ATPs In this section we focus on dynamic ATPs which update history and predict access type only on MRU changes. We call these dynamic-MRU ATPs. Table 6.5 shows their misprediction rates in an 128-Kbyte 8-way cache (we have reproduced the right part of Table 6.4 for easy comparison). The data show that, on average, the predictions for non- MRU accesses are noticeably improved by dynamic-MRU ATPs. One-level history out performs two-level history with global PHT and is also almost as good as two-level his tory per-PC PHT. The predictions by the hybrid scheme are slightly improved over the third scheme as we observed in dynamic-ALL ATPs. Dynamic-MRU Dynamic-ALL without MRU hits Drogram one-level global PHT per-PC PHT hybrid one-level global PHT per-PC PHT hybrid ld% st% ld% st% ld% st% ld% st% ld% st% ld% st% ld% st% ld% st% :ompress 8.12 48.60 8.50 55.32 8.50 53.19 8.48 53.47 11.08 51.26 13.34 53.52 13.44 49.82 13.44 49.81 gcc 3.61 19.62 4.62 20.96 2.20 30.14 2.20 29.11 4.07 27.46 4.35 27.57 3.40 30.20 3.40 30.00 10 3.13 33.29 3.07 37.31 2.19 41.24 2.19 40.96 5.58 41.79 5.76 45.40 5.45 42.85 5.45 42.78 JPeg 7.40 25.08 9.05 25.04 6.67 25.05 6.67 24.70 12.60 41.40 12.82 41.21 12.66 40.84 12.66 40.77 ,i 5.62 22.37 4.74 23.84 2.00 24.45 1.99 24.11 7.65 28.83 6.25 28.09 5.37 28.11 5.37 28.06 vortex 0.71 11.09 0.68 11.57 0.55 15.78 0.55 14.58 1.05 16.39 1.02 15.55 0.84 15.92 0.84 15.40 apsi 3.14 14.48 3.52 13.40 2.19 11.02 2.19 10.96 7.90 40.46 6.68 37.24 6.05 35.95 6.05 35.92 ngrid 10.75 34.23 10.39 34.74 9.72 34.88 9.72 34.88 11.15 52.99 11.18 52.88 11.18 52.92 11.18 52.90 Average 4.64 25.67 4.78 26.92 3.92 27.83 3.92 27.56 6.45 40.46 6.52 40.03 6.22 39.31 6.22 39.22 Table 6.5. Misprediction rate by dynamic-MRU ATPs (infinite hardware) The number o f MRU blocks in the cache has an impact on history updates in the case o f dynamic-MRU ATPs. Figure 6.7 shows the misprediction rate by two dynamic- 91 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 6.7. Misprediction rate by dynamic-MRU ATPs with different numbers of sets MRU ATPs (one-level and two-level per-PC PHT) as the number of MRU blocks (or cache sets) varies from 64 to 16 K. The misprediction rates mostly increase with the num ber of MRU blocks since more references are filtered out by MRU hits such that remain ing references are more sparse and difficult to predict. The results also indicate that the two-level scheme is not very effective as compared to the one-level scheme. Figure 6.8 shows the load miss improvement by DCL with two dynamic-MRU ATPs and with r infinite. Overall the improvement rates are very close, with a few excep tions. In apsi, the two-level ATP performs better due to better prediction for both loads and stores as shown in Figure 6.7. The results show very low or even negative improve ment at several design points especially with eight-way caches. These results are very dif ferent from the results with perfect predictions in which eight-way caches yield high improvements. The negative improvements in ijpeg and apsi are mostly due to cache con flicts by several blocks whose access types are wrongly and consistently predicted as store and accessed alternatively many times. If this happens, the caches with larger associativity suffer more from the conflicts by leaving other blockframes in a set underutilized. As the cache size increases these conflicts are reduced and DCL improves. 9 2 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.0 40.0 16.0 32.0 •jpeg go gcc com press 20.0 3.0 8.0 0.0 0.0 •20 .0 ' •8.0 0.0 [•16.0 0.0 18.0 100.0 8.0 24.0 mgrid vortex apsi 12.0 75.0 6.0 50.0 4.0 t o 1 2 .0 9 — \ [25.0 2.0 6.0 •6.0 0.0 [a] one-level dynamic ATP 30.0 20.0 10.0 0.0 - 10.0 compressA 7 — 7 8 -way x— x 4-way A— A2-way ........i 1 i ■ - 5 L y » S ______.1 * £ $ £ / / / £ £ £ £ £ / £ £ £ £ £ £ £ ■ iF ii- ^ & ^ i t r f c " .A- O ' * & & 100.0 a E E ■ D ( 0 o vortex 7 —78-way 4-way &— A2*way 0 .0 $ f i i i £ £ £ £ £ / £ £ £ £ £ £ / £ £ £ £ £ £ / £ £ £ £ £ £ / [b] two-level per-PC PHT dynamic ATP Figure 6.8. Load miss improvement with dynamic-MRU ATPs (infinite hardware) In summary, we observe that the one-level scheme does as well as the two-level per-PC scheme and outperforms the two-level global scheme. This advocates the use of the one-level scheme since its implementation cost is significantly lower than the cost of the two-level scheme. 9 3 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6.7.4 One-level Dynamic-MRU ATPs (Finite Hardware) With finite hardware, integrating PCT with the cache in the one-level scheme is simpler and ensures that we at least have the PPCs for all blocks present in the cache, which are the most likely blocks to be accessed next. When the PCT is separate it may not contain the PPCs of all the blocks in the cache, unless its organization is such that it includes the cache. When the PCT is integrated into the cache, its number of entries is the number of cache blocks and the PPC is not available for blocks which are not currently in cache. Thus we cannot update the history upon cache misses, we can only update it on non-MRU hits, and the prediction accuracy may be degraded. Moreover in the case of an access to a block just replaced from a near-MRU position due to a store prediction the history cannot be updated because we do not have the PPC and the prediction remains unchanged. O f course, the smaller the cache, the worse these effects are. To alleviate this problem, we need to allocate more memory for PCT. Keeping with the idea that it is advantageous to match the content of the PCT with the cache con tent, we add PCT entries in each cache set for blocks that have just been replaced. We call this approach extended PPC directory (EPD), which is similar to the shadow directory proposed in [12] and [56]. The use of the shadow directory is to help smart prefetching. The role of the EPD is to support penalty-sensitive replacement policies. The EPD is phys ically implemented as a part of cache but each entry consists of only a PPC field and the block address tag. It is interesting to note that PCT now serves a dual purpose as it can also be used to prefetch. We have not explored the use of the PCT to prefetch selectively based on access type prediction. These open possibilities show the advantage of merging PCT with the cache. 94 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. When a block is replaced, the PPC value is moved into an entry of EPD. EPD is maintained by LRU. If an entry for a block missing in the cache set is found in EPD, the history is updated with the matching PPC. Otherwise the history update is simply skipped. In our evaluation, the number o f entries in EPD is identical to the number of blocks in a cache set. Thus the hardware overhead of EPD goes down with the block size. To determine how many bits should be in the PPC field, we evaluate the effect of the number of entries in PHT. Figure 6.9 shows the weighted average o f the misprediction rates across all the benchmarks by one-level dynamic-MRU ATPs with and without EPD for various numbers o f PHT entries. As expected, the misprediction rates in both ATPs decrease as the number of bits in PPC field increases. The prediction with EPD is always better for both load and store predictions. Figure 6.9. Misprediction rate by one-level dynamic ATP with and without EPD Figure 6.10 shows the improvement of load miss rates in DCL with a PHT of 8K entries. Therefore PPC is 13 bit long. Overall the system with EPD outperforms the sys tem without EPD as expected. Also the performance of dynamic prediction with EPD is very close to the performance o f the dynamic-MRU ATP with infinite hardware shown in Figure 6.8. Exceptions are gcc and go in which the number of distinct PCs are signifi cantly larger than others. Li and vortex perform well in both cases. store a — a w ith o u t E P D » -« with E P D A— a w ithout E P D * -* w ith E P D 9 5 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 50.0 4.0 1 0 .0 1 0 .0 go gcc com press 25 .0 2.0 5.0 ■ 3 [ o .o 0.0 0.0 ■25.0' • 2.0 •5.0 -5 .0 •50.0 -4.0 - 10.0 - 10.0 9 0 .0 40 .0 2B.0 8.0 mgrid vortex apsi 60 .0 20.0 6.0 - I — I [3 0 .0 0.0 14.0 4.0 0 .0 : 7.0 2 .0 , •30.0 0.0 ' •40.0 [a] without EPD 3 2 .0 4 0 .0 10.0 4.5 go com press gcc UPeg 20.0 5.0 3.0 o. E 16.0 0 .0 ' 0.0 < A jj 8.0 0.0 •5.0 S 0 .0 -40.0 [-10.0 ■1.5 2 8 .0 90.0 8.0 18.0 mgrid vortex apsi 60.0 6.0 12.0 30 .0 - 4.0 6.0 0 .0 2.0 8 •30.0 0 .0 •6.0 [b] with EPD Figure 6.10. Load miss improvement by one-level dynamic ATP with 8K-entry PHT 6.8 Injecting Finite Cost Ratios So far, we have designed various access type predictors and evaluated them in DCL with r infinite. From the results of prediction accuracy and the load miss improve ments, we conclude that the one-level dynamic-MRU ATP with EPD is the most cost- effective scheme among the various ATP schemes we have considered. In this section, we vary the cost ratio r from 2 to infinite to understand how DCL and ACL suppress the increase of store misses while reducing the number o f load misses. 9 6 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. We apply two ATPs (the perfect predictor and the one-level dynamic-MRU predictor with EPD and 8K-entry PHT) to our cost-sensitive algorithms and compare their performance. We focus on a 16-Kbyte 4-way cache and avoid larger cache sizes where huge load miss improvements are obtained with very low miss rates. Figure 6.11 shows the load miss improvements and the relative increase of store misses by DCL and ACL over LRU as a function of cost ratio. The relative increase in the number o f store misses is the ratio between the increase in the number of store misses (over LRU) and the number of load misses in LRU. In this way, the reduction in load misses and the increase in store misses are on the same scale and they can be compared directly. We first focus on the case of the perfect predictor in Figure 6.11 [a]. The graphs show that the load miss improvements by DCL quickly saturate after r - 2 in all bench marks. On the other hand, the relative increase in store misses steadily grows with the cost ratio, except for a few benchmarks. With r infinite, the relative increase in store misses reaches almost up to three times the load miss improvements expect for compress and mgrid. In contrast, when r is small, DCL effectively suppresses the increase in store misses in all benchmarks while achieving load miss improvements comparable to the improvements with infinite cost ratio. We observe that this behavior is prominent as com pared to the case of the cost determined by memory mapping where the cost savings rather slowly saturates with r. This results strongly indicate that stores are more clustered than loads such that additional load miss savings becomes much more difficult as the cost ratio increases. Overall, the results indicate that DCL can effectively reduce the number of load misses with a minimal increase in store misses by injecting a small cost into load misses. The graphs also show that ACL is very effective in suppressing the increase in store misses while the load miss improvement is marginally reduced. In mgrid, the increase in store misses is extremely small as compared to the reduction of load misses. 97 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ijpeg gcc compress 5? 6.0 52.0 28.0 32.0 39.0 21.0 24.0 ■§ 3.0 26.0 14.0 16.0 13.0 8.0 0 .0 1 0.01 0.0 cost ratio apsi cost ratio mgrid cost ratio cost ratio vortex 32.0 10.0 S’ 7.5 o - a s t m is s (DCL) »— « s t m is s (ACL) o - e ld m is s (D CL) < 1 — fid m is s (ACL) 2.1 24.0 7.5 16.0 5.0 5.0 0.7 0 .0 . O.Oi 0.0 0.0 cost ratio cost ratio cost ratio cost ratio [a] perfect prediction compress vortex a - o s t m iss (D CL) *—« s t m iss (A CL) o - e l d m iss (D CL) ■ t — fid m is s (A CL) [b] one-level Dynam ic-M RU with 8K-entry PHT Figure 6.11. Relative miss rate changes by DCL and ACL (16-Kbyte 4-way cache) In the case of the one-level ATP, the miss rate changes are scaled down but the shape of the graphs follows the shape o f the perfect prediction, except for ijpeg and mgrid. In ijpeg and mgrid, ACL has almost the same performance as LRU, while avoiding the negative improvements observed in DCL. 9 8 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 6.6 shows the RV rate and the RV success rate by DCL and ACL with r = 2. Overall the RV rate and the RV success rate are very low mainly due to a large fraction of loads in all applications. These low rates are directly translated into low load miss improvements. With the perfect ATP, ACL as compared to DCL effectively cuts unfruit ful reservations by yielding lower RV rates but higher RV success rates across all bench marks especially in mgrid. With the one-level ATP, both the RV rate and RV success rate are reduced due to the misprediction o f access type. In mgrid, the performance o f ACL is exceptional in suppressing unfruitful reservations. Overall, we observe that injecting small cost ratio instead of infinite cost ratio is very advantageous. ACL yields very reliable performance across all benchmarks by sup pressing unfruitful reservations even with large high-cost access fractions. comp gcc go ijpeg li vortex apsi mgrid RV rate DCL 15.3 22.3 20.5 20.1 13.4 10.3 23.8 13.0 perfect ACL 6.5 11.5 16.4 13.8 6.0 4.6 11.7 2.0 ATP RVS rate DCL 8.2 16.6 26.8 25.7 8.7 9.9 13.9 9.0 ACL 11.9 25.5 28.3 29.4 11.9 15.0 16.3 39.2 RV rate DCL 10.8 22.0 17.2 24.3 11.4 10.7 24.8 14.0 one-level ACL 5.1 11.1 13.8 13.3 5.3 4.8 11.0 0.4 ATP RVS rate DCL 6.8 15.4 24.7 15.1 8.2 10.3 9.7 0.3 ACL 8.6 23.3 25.9 20.6 11.1 16.4 12.9 6.9 Table 6.6. RV rate and RVS rate by DCL and ACL (r = 2) 6.9 Summary In this chapter, we have presented a practical application of the LRU-based cost- sensitive replacement algorithms in the case o f a uniprocessor system taking advantage of the difference in cost between loads and stores. In this case, the cost prediction corre sponds to the prediction of the next access type to each block. 9 9 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Contrary to the case of the multiprocessors targeting non-uniform miss latencies, we achieved very marginal cost savings mainly due to large fraction o f high-cost accesses in our applications and low accuracy of access type predictors. Moreover the applied hard ware schemes are deemed less cost-effective. However we found that ACL yields very reliable memory performance even when the cost savings opportunities are rare by effec tively suppressing unfruitful reservations and taking advantage of a small cost ratio. 100 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 7 RELATED WORK 7.1 Targeting Miss Count Replacement algorithms to minimize the miss count in finite-size storage systems have been extensively studied. A variety of replacement algorithms have been proposed and a few of them, such as LRU or one of its approximations [49][51][56] with lower implementation overhead, are widely adopted in caches. Lately several cache replacement algorithms to further reduce the miss count in LRU and other approaches to managing caches have been proposed. These proposals are motivated by the performance gap between LRU and OPT [31], and often require large amounts of hardware to keep track of long access history. O ’Neil et al. [40] proposed LRU-K algorithm for database disk buffering to com bine the access recency and the access frequency of pages. LRU-K replaces a block whose access distance of last K-th reference is the largest among the blocks in the buffer. LRU-1 is identical to the classic LRU algorithm. Since LRU-K must keep the history of blocks not in the buffer, LRU-K practically maintains the history for a certain extended time period. The simulation results using an OLTP database trace show that LRU-2 outper forms LRU-1. Lee et al. [28] further investigated merging LRU and LFU, and proposed LRFU algorithm. LRFU basically takes into account the contribution of every past reference 101 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. rather than only the last K-th reference to better integrate access recency and frequency. However, LRU-K and LRFU are much less applicable to processor caches due to their implementation overhead, and they are better suited for software-controlled buffer man agement. Phalke et al. [44], motivated by high compression ratios observed in address traces, proposed the schemes to predict inter-reference gap to same cache blocks and replace the block with the largest inter-reference gap. Since their schemes work with a lengthy history o f access distance per block, they are less applicable to high-speed on-chip caches, albeit their schemes can obtain large performance gain. Wong and Baer [60] proposed instruction-based prediction schemes to predict the locality per cache block. PCs are physically associated to cache blocks and used to index the locality history table. Blocks that do not show locality are considered first for replace ments over others, however, MRU blocks are never replaced. Tyson et al. [57] proposed cache bypassing. In their scheme, memory operations that generate many misses are first identified. Then the cache blocks accessed by those memory operations bypass the caches so that blocks with high locality stay in the cache longer. Gonzalez et al. [15] proposed the use o f independent caches per locality type to avoid the conflicts between the blocks with different types o f locality. This approach intro duces the problem of cache sizing and more importantly may lower the utilization of caches as the number of blocks with specific locality types varies with applications. Mounes-Toussi and Lilja [36] evaluated state-based cache replacement algorithms under MES1 protocol. They found that a certain static replacement priority based on cache coherence states with Random policy shows very marginal miss rate improvement over Random policy. However, they did not address the cost associated with each cache state and their evaluation was limited to the effect on miss rate. 102 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 7.2 Targeting Miss Cost Recently, the problem of replacement algorithms has been revisited in the context of emerging applications and systems with variable miss costs. Albers et al. [2] classified general caching problems into four models in which the size and/or the latency of data can vary. They proposed several approximate solutions to these problems. In the context o f disk paging, Young proposed Greedy Dual [65]. Greedy Dual was later refined by Cao and Irani [9] in the context of web caching to reduce the miss cost. They found that size considerations play a more important role than locality in reducing miss cost. However when applied to processor caches with small, constant data transfer sizes, our results show that GreedyDual is far less cost-efficient than other more locality- centric algorithms, especially when the cost ratio is small. Srinivasan et al. [53] addressed the performance issues caused by critical loads in ILP processors. Critical loads are loads that have a large miss penalty in ILP processors. They proposed schemes to identify blocks accessed by critical loads. Once detected such critical blocks are stored in a special cache upon cache replacement or their stay in caches is extended. They found that the modification of the replacement policy to extend the life time in cache of critical blocks does not help much due to the large working set of critical blocks. Whether the algorithms proposed in this paper would fare better in the context of that study is unclear. In this thesis, our focus has been on multiprocessor systems. Moga and Dubois [34] evaluated the effectiveness of network caches to reduce remote memory stalls. They showed the use of small but fast network caches or remote victim caches can yield comparable improvements on remote memory stall over the use of large DRAM network caches. This advocates that the proper use of the existing caches has great importance and can reduce remote memory stall in a very cost-effective manner. 103 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Arkin and Silverberg [3] analyzed the complexity of a job-machine mapping prob lem with n jobs and k identical machines where each job is associated with a value, and fixed start and end times. They showed that the problem can be solved in 0(nk+]) time. CSOPT can be reformulated into the job-machine mapping problem and shows similar worst-case time complexity. 7.3 Trace Sampling and Cache Evaluation Techniques Mattson et al. [31] introduced stack algorithms based on the inclusion property. They showed that LRU, OPT and Random are stack algorithms and a rapid evaluation for alternative caches is possible under certain restrictions. CSOPT is extended from their OPT and partially based on their proof of OPT. Hill and Smith [17] extended the stack algorithm to rapidly evaluate uniprocessor caches with different associativities. They presented the effect of varying cache associativ ity in detail based on trace-driven simulations. They attempted to generalize miss rate behavior with respect to associativity. Wu and Muntz [62] further extended the work by Hill and Smith [17] to rapidly evaluate LRU caches in multiprocessors with an emphasis on invalidations. However, their method is not applicable to cost-sensitive algorithms when non-uniform miss costs are considered. Puzak [45] proposed trace stripping and set sampling. Trace stripping is based on a filter cache and only the references that miss in the filter cache are collected to evaluate cache miss ratio accurately. He also showed that trace sampling on one tenth of cache sets leads to a reliable estimate of miss rate. Wang and Baer [58] proposed one-pass simulation techniques using trace reduc tion techniques to efficiently simulate alternative write-back caches and to accurately 104 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. measure write-back counts and data traffic. Their trace reduction techniques can also be applied to multiblock-size traces and multiprocessor traces. Chame and Dubois [10] proposed processor sampling techniques based on cache inclusion for large scale multiprocessor systems. They examined whether cache inclusion can be maintained for different set mapping functions and showed that traces on a smaller number of sets can be expanded to a larger number of sets if caches are evaluated using stack algorithms. 7.4 Prediction Schemes Many prediction schemes utilize instructions to better capture program behavior. In hardware prefetching schemes [ 11 ] [ 14] prefetch stride is predicted per memory access instruction. Kaxiras and Goodman [24] investigated the idea of using PCs to predict vari ous cache coherence activities in multiprocessors. Lai and Falsafi [25] proposed last-touch predictor based on path-based branch predictor [39] to trigger self-invalidations. Tyson et al. [57] proposed using PCs to identify the blocks to bypass caches. Wong and Baer [60] used PCs to predict memory access locality to improve the miss rate over LRU algorithm. The structure of access type predictors in this thesis is similar to the structure of branch predictors. Yeh and Patt [64] introduced two-level adaptive branch predictors based on the correlation among other branch instructions. Nair [39] proposed path-based branch predictors. Later, many studies focused on improving two-level branch predictors by reducing interference and aliasing on hardware history tables. McFarling [32] proposed gshare pre dictor in which branch address and branch history register are XORed to index second- level history table. Sprangle et al. [52] proposed agree predictor. The branch outcome is changed if the dynamic prediction does not match with the outcome of the first time exe cution. Lee et al. [27] proposed bimode predictor. It consists of several branch predictors 105 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. and each predictor is devoted to predicting a specific outcome. Michaud et al. [33] pro posed gskew predictor to reduce conflict on the second-level table based on the principle of skewed associative caches. The above schemes can be applied to ATPs to improve the prediction accuracy. Johnson and Hwu [20] proposed the use of microblocks based on the addresses of memory references to determine fetch size. Mukheijee and Hill [38] applied two-level branch predictors to the prediction of next coherence messages. The idea behind the shadow directory [56] [12] is to gather extended locality infor mation for blocks already replaced from cache. This information is then used for smart prefetching or replacement decisions. In our algorithms, the extended tag directory used to depreciate the cost of reserved blocks and to update access type history for the blocks replaced from cache is similar to the shadow directory. So and Rechtschaffen [51] looked at the effects on MRU blocks. They claim that working set changes when the MRU block changes and the accesses to MRU blocks dom inate over non-MRU blocks in various kind of programs. Karlin et al. [23] introduced competitive snooping algorithms to optimize the snooping overhead in multiprocessors. Cached blocks that incur snooping overhead due to accesses from remote processors are removed from cache based on a dynamic cost adjust ment similar to our cost depreciation scheme. The prediction of next access type has been also addressed in other papers for dif ferent purposes. Mowry [37] proposed exclusive mode prefetches in multiprocessors to save separate ownership requests if the prefetched blocks will be written next. These prefetches are determined statically through compiler analysis. The prediction of migra tory sharing [24] [55] is closely related to the prediction of store access type. 106 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 8 CONCLUSIONS In this dissertation we have developed and evaluated new cache replacement algo rithms that improve the aggregate miss cost rather than the aggregate miss count in the face o f multiple non-uniform miss costs. In CSOPT, we introduced the concept of blockffame reservation to trade off the high-cost and low-cost misses and the pruning schemes in the search of an optimal replacement sequence to make the algorithm feasible. CSOPT has been thoroughly evaluated with theoretical aspects and trace-driven simulations to characterize its behavior in the context of CC-NUMA multiprocessors. We have identified the ranges for the high-cost access fraction (HAF) and for the cost ratio in which CSOPT is effective. The simulation results of all selected SPLASH-2 benchmarks with random and first-touch cost assignments indicate that the room for improvement by taking miss costs into account is significant and not limited to a certain cache configura tion or specific application programs. Our experiments strongly advocate that tuning the replacement algorithm is a very cost-effective performance enhancement as compared to other hardware schemes. CSOPT is unrealizable in real systems. However, the design concepts in CSOPT to optimize the total miss cost and the evaluation o f the benchmarks with CSOPT have given useful hints and guidelines for improving existing cache replacement algorithms. 107 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Then we have introduced new on-line cost-sensitive cache replacement algorithms extended from LRU. The algorithms integrate locality and cost based on two key ideas: blockframe reservation and cost depreciation. From trace-driven simulations with SPLASH-2 benchmarks we observe that our on-line cost-sensitive algorithms yield large cost savings over LRU across various cost ranges and cache configurations. In the application of a multiprocessor with ILP proces sors, execution-driven simulations show significant reduction in execution time when cache replacements vie to minimize miss latency instead of miss count. We also observed that the miss rates of our cost-sensitive algorithms are sometimes lower than LRU’s because our algorithms take advantage of bad locality prediction by LRU by pursuing suc cessful reservations aggressively. Additionally, we have applied our algorithms to uniprocessor systems to reduce the aggregate miss penalty, motivated by the observation that the penalty o f stores is mostly hidden in modem processors. This application relies on the prediction of the next access type to each block. We have explored various access type predictors based on static profiling, instruction-based dynamic and hybrid schemes. Unfortunately most of the accurate predictions are done on MRU hits and these predictions are useless for replacement. Nevertheless, we have explored three types of dynamic ATP, mostly inspired from branch predictors and fcund the best configurations to reap the maximum average benefits across eight SPEC95 benchmarks. Using the best pre dictor we achieved very marginal cost savings mainly due to the large fractions of high- cost accesses in our applications and low access type prediction accuracy. However we have found that ACL yields very reliable memory performance even when the cost sav ings opportunities are rare by effectively suppressing unfruitful reservations and taking advantage of a small cost ratio. 108 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The major strength o f our algorithms comes from their simplicity and careful design. Although their hardware cost varies with target cost function and cost prediction scheme, the added hardware cost is generally very marginal and their effect on cache access time is negligible. The tight integration of the cache hierarchy inside modem pro cessor chips further facilitates the implementation of our algorithms. The application domain o f our algorithms is very broad. They are readily applica ble to the management of various kinds of storage where various kinds of non-uniform cost functions are involved. Moreover, in contrast to the approaches that divide caches into several regions or add special-purpose buffers to treat blocks in different ways [53][15], cost-sensitive replacement algorithms with properly defined cost functions can maximize cache utilization, which is extremely difficult to achieve in schemes relying on cache partitioning. ATPs can be used for other purposes especially in the context o f multiprocessors. With knowledge of the next access type, prefetches could be issued to reduce the cost of ownership transfer. Similarly, if each memory request is tagged with its next access type, home nodes or remote nodes serving the request can optimize future actions in advance not only by knowing the next access type following the external request but also by pre dicting its own next access type to the same block. In this way, migratory sharing [55] and invalidation actions [25] can be better optimized. There are many open questions left to research. In the arena of multiprocessor memory systems, we can imagine more dynamic situations than the ones evaluated here. First the memory mapping o f blocks may vary with time, adapting dynamically to the ref erence patterns of processes in the application, such as is the case in page migration and COMAs [13]. Second, we can imagine that node bottlenecks and hot spots in multiproces sors could be adaptively avoided by dynamically assigning very high costs to blocks accessible in congested nodes. Other areas o f application are power optimization in 109 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. embedded systems or bus bandwidth optimization in bus-based systems. The memory per formance o f CC-NUMA multiprocessors may be further enhanced if we can measure memory access penalty instead o f latency and use the penalty as the target cost function. If we could predict the nature of the next access to a cached block, we could assign a high cost to critical load misses and a low cost to store misses and non-critical load misses, based on the measure of their penalty. O f course, the combination of ILP processors and multiprocessor environment provides richer optimization opportunities for cost-sensitive replacements. Finally, although our evaluations have focused on a specific level o f cache, cost- sensitive replacement algorithms may be useful at every level of the memory hierarchy, including L 1 caches with or without inclusion property [5], in both multiprocessors and uniprocessors. The general approach of pursuing high-cost block reservation and depreci ating cost to resolve locality effects could also be applied to other replacement algorithms besides LRU. 110 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Bibliography [1] S. Abraham, R. Sugumar, D. Windheiser, B. Rau and R. Gupta, “Predictability of Load/Store Instruction Latencies,” In Proceedings o f the 26th International Symposium on Microarchitecture, pp. 139-152, December 1993. [2] S. Albers, S. Arora and S. Khanna, “Page Replacement for General Caching Prob lems,” In Proceedings o f Tenth Annual ACM-SIAM Symposium on Discrete Algorithms, January 1999. [3] E. M. Arkin and E. B. Silverberg, “Scheduling Jobs with Fixed Start and End Times,” Discrete Applied Mathematics, vol. 18, pp. 1-8, 1987. [4] P. Bannon, “Alpha EV7: A Scalable Single-Chip SMP,” Presented at Microprocessor Forum, October 1998. [5] L. Barroso et. al, “Piranha: A Scalable Architecture Based on Single-Chip Multipro cessing,” In Proceedings o f the 27th Annual International Symposium on Computer Archi tecture, June 2000. [6] L. Belady, “A Study o f Replacement Algorithms for a Virtual-Storage Computer,” IBM Systems Journal, v. 5, no. 2, pp. 78-101, 1966. [7] D. Burger and T. Austin, “The SimpleScalar Tool Set, Version 2.0,” Computer Sci ences Dept. Tech. Report #1342, Univ. of Wisconsin-Madison, June 1997. [8] D. Burger, J. R. Goodman, and A. Kagi, “Limited Bandwidth to Affect Processor Design,” IEEE Micro, pp. 55-62, November/December, 1997. [9] P. Cao and S. Irani, “Cost-Aware WWW Proxy Caching Algorithms,” In Proceedings o f the 1997 USENIX Symposium on Internet Technology and Systems, pp. 193-206, December 1997. [10] J. Chame and M. Dubois, “Cache Inclusion and Processor Sampling in Multiproces sor Simulations,” In Proceeding o f ACM Sigmetrics, pp. 36-47, May 1993. [11] T. Chen, “Data Prefetching for High-Performance Processors,” Ph.D. Dissertation, Tech Report 93-07-01, Dept, of CSE, University of Washington, July 1993. [12] J. Collins and D. Tullsen, “Hardware Identification o f Cache Conflict Misses,” In Proceedings o f the 32nd International Symposium on Microarchitecture, November 1999. I ll Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [13] D. Culler. J. P. Singh and A. Gupta, “Parallel Computer Architecture,” Morgan Kaufmann Publishers Inc., 1999. [14] J. Fu, J. Patel and B. Janssens, “Stride Directed Prefetching in Scalar Processors,” In Proceedings o f the 25th International Symposium on Microarchitecture, pp. 102-110, December 1992. [15] A. Gonzalez, C. Aliagas and M. Valero, “A Data Cache with Multiple Caching Strat egies Tuned to Different Types of Locality,” In Proceedings o f ACM International Confer ence on Supercomputing, July 1995. [16] S. Herrod, “Using Complete Machine Simulation to Understand Computer System Behavior,” Ph.D. Disseration, Stanford University, Feb. 1998. [17] M. D. Hill and A. J. Smith, “Evaluating Associativity in CPU Caches,” IEEE Trans actions on Computers, vol. 38, no. 12, pp. 1612-30, December 1989. [18] T. Horel and G. Lauterbach, “UltraSPARC-Ill: Designing Third-Generation 64-Bit Performance,” IEEE Micro, pp. 74-85, May-June 1999. [19] D. Jiang and J. P. Singh, “A Methodology and an Evaluation of the SGI Origin 2000,” ACM Sigmetrics Performance ‘ 98, Madison, Wisconsin, June 1998. [20] T. Johnson, M. Merten and W. Hwu, “Run-time Spatial Locality Detection and Opti mization,” In Proceedings o f the 30th International Symposium on Microarchitecture, December 1997. [21] N. Jouppi, “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers,” In Proceedings o f the 17th Annual International Symposium on Computer Architecture, pp. 364-373, May 1990. [22] N. Jouppi, “Cache Write Policies and Performance,” In Proceedings o f the 27th Annual International Symposium on Computer Architecture, pp. 191-201, May 1993. [23] A. Karlin, M. Manasse, L. Rudolph and D. Sleator, “Competitive Snoopy Caching,” In Proceedings o f 27th Annual IEEE Symposium on Foundations o f Computer Science, 1986. [24] S. Kaxiras and J. Goodman, “Improving CC-NUMA Performance Using Instruc tion-Based Prediction,” In Proceedings o f the 5th International Symposium on High-Per formance Computer Architecture, pp. 161-170, January 1999. 112 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [25] A. Lai and B. Falsafi, “Selective, Accurate, and Timely Self-Invalidation Using Last-Touch Prediction,” In Proceedings o f the 27th Annual International Symposium on Computer Architecture, June 2000. [26] J. Laudon and D. Lenoski, “The SGI Origin: A ccNUMA Highly Scalable Server,” In Proceeding o f 24th International Symposium on Computer Architecture, pp. 241-251, June 1997. [27] C. Lee, I. Chen and T. Mudge, “The Bi-Mode Branch Predictor,” In Proceedings o f the 30th International Symposium on Microarchitecture, pp. 4-13, December 1997. [28] D. Lee, J. Choi, J. Kim, S. Noh, S. Min, Y. Cho and C. Kim, “On the Existence o f a Spectrum o f Policies that Subsumes the LRU and LFU Policies,” In Proceedings o f the 1999 ACM SIGMETRICS Conference, pp. 134-143, May 1999. [29] D. Lenoski et al., “The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor,” In Proceeding o f 17th International Symposium on Computer Architec ture, pp. 148-159, May 1990. [30] T. Lovett and R. Clapp, “Sting: A CC-NUMA Computer System for the Commercial Marketplace,” In Proceeding o f 23rd International Symposium on Computer Architecture, pp. 308-317, May 1996. [31] R. L. Mattson, J. Gecsei, D. R. Slutz and I. L. Traiger, “Evaluation Techniques for Storage Hierarchies,” IBM Systems Journal, vol. 9, pp. 77- 117, 1970. [32] S. McFarling, “Combining Branch Predictors,” Technical Report TN-36, Compaq Western Research Lab., June 1993. [33] P. Michaud, A. Seznec and R. Uhlig, “Trading Conflict and Capacity Aliasing in Conditional Branch Predictors,” In Proceedings o f the 24th International Symposium on Computer Architecture, pp. 292-303, June 1997. [34] A. Moga and M. Dubois, “The Effectiveness of SRAM Network Caches in Clus tered DSMs,” In Proceeding o f Fourth International Symposium on High Performance Computer Architecture, pp. 103-112, Feb. 1998. [35] Motorola Inc., “MPC750 RSIC Microprocessor User’s Manual,” Motolora Inc., August 1997. [36] F. Mounes-Toussi and D. Lilja, “The Effect of Using State-Based Priority Informa tion in a Shared-Memory Multiprocessor Cache Replacement Policy,” In Proceedings o f International Conference on Parallel Processing, pp. 217-224, August 1998. 1 1 3 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [37] T. Mowry, “Tolerating Latency Through Software-Controlled Data Prefetching,” Ph.D. Dissertation, Stanford University, March 1994. [38] S. Mukherjee and M. Hill, “Using Prediction to Accelerate Coherence Protocols,” In Proceedings o f the 25th Annual International Symposium on Computer Architecture, pp. 179-190, June 1998. [39] R. Nair, “Dynamic Path-Based Branch Correlation,” In Proceedings o f the 28th International Symposium on Microarchitecture, pp. 15-23, December 1995. [40] E. O’Neil, P. O’Neil, and G. Weikum, “The LRU-K Page Replacement Algorithm for Database Disk Buffering,” In Proceedings o f ACM SIGMOD International Conference on Management o f Data, pp. 297-306, May 1993. [41] V. Pai, P. Ranganathan and S. Adve, “RSIM Reference Manual,” Technical Report 9705, Department o f Electrical and Computer Engineering, Rice University, August 1997. [42] V. Pai, P. Ranganathan, S. Adve and T. Harton, “An Evaluation o f Memory Consis tency Models for Shared-Memory Systems with ILP Processors,” In Proceedings o f ASP- LOS-VII, October 1996. [43] D. Patterson and J. Hennessy, “Computer Architecture: A Quantitative Approach,” Morgan Kaufmann Publishers Inc., 1995. [44] V. Phalke and B. Gopinath, “Compression-Based Program Characterization for Improving Cache Memory Performance,” IEEE Transactions on Computers, v. 46, no. 11, pp. 1174-86, November 1997. [45] T. R. Puzak, “Analysis o f Cache Replacement Algorithms”, Ph.D. Disseration, Uni versity o f Massachusetts, Amherst, MA. Feb. 1995. [46] A. Seznec and F. Lloansi, “About Effective Cache Miss Penalty on Out-of-Order Superscalar Processors,” IRISA Report #970, November 1995. [47] H. Sharangpani, “Intel Itanium Processor Microarchitecture Overview,” Presented at Microprocessor Forum, October 1999. [48] K. Skadron and D. Clark, “Design Issues and Tradeoffs for Write Buffers,” In Pro ceedings o f the 3rd International Symposium on High-Performance Computer Architec ture, pp. 144-155, February 1997. [49] A. J. Smith, “Cache Memories,” ACM Computing Surveys, vol. 3, pp. 473-530, Sep tember 1982. 114 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [50] J. Smith and G. Sohi, “The Microarchitecture of Superscalar Processors,” In Pro ceedings o f IEEE, v. 83, no. 12, pp. 1609-24, December 1995. [51] K. So and R. RechtschafFen, “Cache Operations by MRU Change,” IEEE Transac tions on Computers, v. 37, no. 6, pp. 700-9, June 1988. [52] E. Sprangle, R. Chappell, M. Alsup and Y. Patt, “The Agree Predictor: A Mecha nism for Reducing Negative Branch History Interference,” In Proceedings o f the 24th Annual International Symposium on Computer Architecture, pp. 284-291, 1997. [53] S. T. Srivivasan, R. D. Ju, A. R. Lebeck, and C. Wilkerson, “Locality vs. Criticality,” In Proceedings o f the 28th International Symposium on Computer Architecture, pp. 132- 143, July 2001. [54] Standard Performance Evaluation Corporation, http://www.specbench.org. [55] P. Stenstrom, M. Brorsson and L. Sandberg, “An Adaptive Cache Coherence Proto col Optimized for Migratory Sharing,” In Proceedings o f the 20th Annual International Symposium on Computer Architecture, pp. 109-118, May 1993. [56] H. S. Stone, “High-Performance Computer Architecture,” 2nd Edition, Addison- Wesley Publishing Company, November 1990. [57] G. Tyson, M. Farrens, J. Matthews and A. Pleszkun, “A Modified Approach to Data Cache Management,” In Proceedings o f the 28th International Symposium on Microarchi tecture, pp. 93-103, December 1995. [58] W. Wang and J. L. Baer, “Efficient Trace-Driven Simulation Methods for Cache Per formance Analysis,” ACM Transactions on Computer Systems, vol. 9, no. 3, pp. 222-241, August 1991. [59] W. Weber, S. Gold, P. Helland, T. Shimizu, T. Wicki and W. Wilcke, “The Memory Interconnect Architecture: A Cost-effective Infrastructures for High-performance Serv ers,” In Proceeding o f 24th International Symposium on Computer Architecture, pp. 98- 107, June 1997. [60] W. Wong and J. Baer, “Modified LRU Policies for Improving Second-Level Cache Behavior,” In Proceedings o f the 6th International Symposium on High-Performance Computer Architecture, pp. 49-60, January 2000. [61] S. Woo et al., “The SPLASH-2 Programs: Characterization and Methodological Considerations,” In Proceedings o f 22nd International Symposium on Computer Architec ture, pp. 24-36, June 1995. 1 1 5 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [62] Y. Wu and R. Muntz, “Stack Evaluation of Arbitrary Set-Associative Multiprocessor Caches,” IEEE Transactions on Parallel and Distributed Systems, vol. 6, no. 9, pp. 930- 942, September 1995. [63] K. Yeager, “The MIPS R10000 Superscalar Microprocessor,” IEEE Micro, pp. 28- 40, April 1996. [64] T. Yeh and Y. Patt, “Alternative Implementation of Two-Level Adaptive Branch Pre diction,” In Proceedings o f the 19th Annual International Symposium on Computer Archi tecture, pp. 124-134, May 1992. [65] N. Young, “The k-server Dual and Loose Competitiveness for Paging,” Algorith- mica, vol. 11, no. 6, pp. 525-541, June 1994. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Appendix A A.1 A lg o rith m of C SO PT A general algorithm for CSOPT with multiple miss costs is presented in the fol lowing program. We assume a set of traces has been preprocessed per cache set with an underlying cache configuration. The forward distance of a reference is set to L+ 1 if the corresponding block is to be invalidated before its next local access, where the length of the trace is L. In CSOPT, a block with a forward distance of L+l is in an invalid state and becomes an immediate candidate for replacement. A linked list P (line 4 to 12) implements a search tree. Initially, P has only one active node which contains null blocks whose forward distances are set to L+l and the cost is zero. CSOPT (line 14 to 18) scans the trace A-by calling S canjree and returns the least cost. S ca n jree (line 20 to line 47) scans every active node in P and updates their costs and states at each reference. When a node misses on,Y[f], the prime candidate for replacement is the block whose forward distance is the largest among the unreserved blocks. In the case that a blockffame is available for a reservation and there exist blocks with lower cost and smaller forward distance (line 32 to 36), the search tree expands by adding new nodes (line 37 to 42). Lastly Prune_searchjree is invoked if any node has a miss on Afr]. Prune_searchjree (line 49 to 55) inspects any pair of active nodes excluding a pair that both hit on A[t], If any of the conditions (line 52 and 54) is met, the corresponding node is removed from the search tree based on the theorem in Section 3.2.4. 117 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The algorithm of CSOPT with multiple miss costs: 1 // X: address trace, D: distance trace 2 // C: current miss cost trace, F: next miss cost trace 3 // P [ ]: nodes in search tree, s: associativity 4 struct { 5 int tag[s]; // block address 6 int distan ce[s]; // forward distance 7 int y[s] // next miss cost 8 int r[s]; // reservation state 9 int co st; // aggregate cost 10 bool m iss Jlag-, / / current state o f cache hit/miss 11 int next_node\ II pointer to next node 12 } P []; // linked list 13 14 C S O P T ( ) 15 for t < r - 1 toL 16 S c a n _ tr e e { t) 17 m in _cost < — m in(P[ ].cosf) 18 return m in _cost 19 20 S c a n _ tr e e { t) 2 1 for every node in P [ ] 22 case cache hit 23 P [n ode].r[h it_pos] < — RELEASE 24 P [n ode\.distan ce[h it_pos\ < — D [t] 25 P [n ode\.j[h itj 90s] < — F[/] 26 P[/iorfe]./wws_J7ag < — HIT 27 case cache miss 28 E[no^e].cos/ < — P [n ode].cost + C [t\ 29 P[noc?e]./nis5J l a g < — MISS 30 sort P [n ode] by distance 31 p o s < — position of the first unreserved block from the bottom of P [node] 32 R V _count < - cou n t(/> [no</e].r[ ] = = RESERVED) 33 if (P [n ode].distan ce\pos] < L+l and RV_count < s-1) 34 for i = p o s -1 to 1 35 for every k, where i < k < p o s 36 if (P[node].J[i\ < P [n ode]J[k]) 37 new < — create_node() 38 bcopy(E[/i0rfe], P [n ew ]) 39 P [n ew ].r\pos] < - RESERVE 40 P [new ].tag[i\ < - A[r] 41 P [n ew ].distance[i] i— D [t\ 42 P[new].J[i] < - F[r] 43 P [n ode].tag[pos] < — 4 4 P[/iocfe].GfaraMce[/?0.s] < — Z D [r] 45 P[«oc/e].y[/ios] < — F[t] Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 46 if (and number of nodes > 1) 47 P r u n e _ s e a r c h _ tr e e Q 48 49 P r u n e _ s e a r c h _ tr e e () 50 for every pair of nodes (k, m) in P [ ] 51 if {P[k] .miss J la g = MISS or P [m ] .m is s jla g = MISS) 52 if (P [k\.cost + dk_+m < P [m ].cost) 53 delete_node(P[w]) 54 else if (P [m ].cost + dm^ k < P [k].cost) 55 delete_node(P[^]) Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Architectural support for efficient utilization of interconnection network resources
PDF
Energy and time efficient designs for digital signal processing kernels on FPGAs
PDF
An efficient design space exploration for balance between computation and memory
PDF
Content -based video analysis, indexing and representation using multimodal information
PDF
Efficient PIM (Processor-In-Memory) architectures for data -intensive applications
PDF
A unified mapping framework for heterogeneous computing systems and computational grids
PDF
Characterization of deadlocks in interconnection networks
PDF
Deadlock recovery-based router architectures for high performance networks
PDF
Extending the design space for networks on chip
PDF
A flexible framework for replication in distributed systems
PDF
High performance crossbar switch design
PDF
Energy efficient hardware-software co-synthesis using reconfigurable hardware
PDF
Intelligent image content analysis: Tools, techniques and applications
PDF
Contributions to content -based image retrieval
PDF
An integrated environment for modeling, experimental databases and data mining in neuroscience
PDF
Enabling clinically based knowledge discovery in pharmacy claims data: An application in bioinformatics
PDF
Induced hierarchical verification of asynchronous circuits using a partial order technique
PDF
An adaptive soft classification model: Content-based similarity queries and beyond
PDF
Application-specific external memory interfacing for FPGA-based reconfigurable architecture
PDF
Intelligent systems for video analysis and access over the Internet
Asset Metadata
Creator
Jeong, Jaeheon (author)
Core Title
Cost -sensitive cache replacement algorithms
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Computer Science,engineering, electronics and electrical,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Dubois, Michel (
committee chair
), Pinkston, Timothy M. (
committee member
), Shahabi, Cyrus (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-239461
Unique identifier
UC11339448
Identifier
3074932.pdf (filename),usctheses-c16-239461 (legacy record id)
Legacy Identifier
3074932.pdf
Dmrecord
239461
Document Type
Dissertation
Rights
Jeong, Jaeheon
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
engineering, electronics and electrical