Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Efficient techniques for sharing on-chip resources in CMPs
(USC Thesis Other)
Efficient techniques for sharing on-chip resources in CMPs
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
EFFICIENT TECHNIQUES FOR SHARING ON-CHIP RESOURCES IN CMPS by Ruisheng Wang A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER ENGINEERING) August 2017 Copyright 2017 Ruisheng Wang Abstract With the emergence of cloud computing and the trend towards rising core counts on a single chip, an increasing number of server workloads are being consolidated onto a single multicore chip to share various costly on-chip resources, such as last-level caches, off-chip memory bandwidth and on-chip networks. The purpose of sharing is to improve efficiency and reduce cost. However, uncontrolled sharing of those on-chip resources could impose a series of negative effects on the system, such as degraded cache associativity for each sharer, unpredictable cache performance, unfair memory bandwidth occupancy and network deadlock, which defeats the purpose of sharing. This research aims to improve system performance by sharing various on-chip resources in an efficient and effective way. Four techniques are proposed to improve the efficiency of sharing off-chip memory bandwidth, last-level caches and the on-chip network. First, an analytical performance model for partitioning off-chip memory bandwidth is proposed, from which four optimal memory bandwidth allocation policies are derived to maximize four system-level performance objectives, i.e., sum of instructions per cycles (IPCs), weighted speedup, fairness, and harmonic weighted speedup, respectively. Second, a replacement- based cache partitioning enforcement scheme, called Futility Scaling, is proposed, which can precisely partition t he cache while still maintain high associativity even with a large number of partitions. Third, a performance model is built to reveal the “streak effect” on the performance of cache protection policies. Based on the model and runtime reuse streak information, a cache protection policy that provides predictable performance is designed. i Last, a virtual cut-through (VCT) switched Bubble Coloring scheme is proposed, which avoids both routing- and protocol-induced deadlocks without the need for multiple virtual channels while still enabling fully adaptive routing on any topology. ii Acknowledgments First and foremost, I would like to sincerely thank my advisor, Professor Timothy Pinkston, for his mentorship and support during my graduate study. He has been a great mentor to me and I learned a lot from him, including time management, critical thinking, effective writing and presentation, and more. I took longer than average to complete my Ph.D. and experienced the panic and despair more than normal. I am very grateful for his patience and encouragement that helped me survive the toughest time of my Ph.D. journey. It is my privilege and great fortune to have him as my advisor. I would like to extend my gratitude to Professor Murali Annavaram for providing me the opportunities to participate in his research group meeting and the advice on my own research work. The discussions in his group meeting has broaden my vision on various research topics in the area of computer architecture, and his insightful advice has helped me to sharpen my work. I also would like to thank the rest of my qualifying and defense committeemembers: AiichiroNakano, SandeepGupta, andJeffreyDraper, fortheirvaluable feedback and suggestions on my thesis. I am also grateful to my collaborators, Dr. Yuho Jin and Dr. Lizhong Chen. Yuho, who was a Postdoc in the group and now an Assistant Professor at New Mexico State University, trained me on setting up my first full-system simulator. Lizhong, who was a Ph.D. Student in the group and now an Assistant Professor at Oregon State University, had many in-depth discussions with me on the relevant topics that are invaluable to my work. I have been fortunate to gain real-world industry experience via a summer internship iii at Ericsson ASIC group. I want to thank my manager Anubrata Mitra for providing me this internship opportunity, and my mentor Arun Balakrishnan for guiding me through the architecture of Ericsson’s next-generation network processor. I also thank all my officemates for building conductive atmosphere for research, Hyeran Jeon, QiuminXu, ZhifengLin, YueShi, FenxiaoChen, KiranMatamandKrishnaGiriNarra. I would like to acknowledge the rest of my colleagues and fellow graduate students for their inspirational discussions: Daniel Wong, Abdulaziz Tabbakh, Gunjae Koo, and Mohammad Abdel-Majeed. Last but not least, I would like to thank my parents, who nurtured my passion for knowledge and always supported me in pursuing my goals. This thesis would not have been possible without all of these. iv Contents Abstract i Acknowledgments iii 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Research Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Background and Related Work 6 2.1 Memory Bandwidth Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Cache Capacity Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Cache Protection Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 Deadlock Avoidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3 Multi-Objective Memory Bandwidth Partitioning 11 3.1 Necessity for Better Memory Bandwidth Partitioning . . . . . . . . . . . . . . . 11 3.2 A Proposed Analytical Performance Model . . . . . . . . . . . . . . . . . . . . . 13 3.2.1 General Structure of the Model . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.2 Harmonic Weighted Speedup . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.3 Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.4 Weighted Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 v 3.2.5 Sum of IPCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4 High-Associativity Cache Partitioning 26 4.1 Partitioning-induced Associativity Loss . . . . . . . . . . . . . . . . . . . . . . . 27 4.2 Futility Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3 Feedback-based Scaling Factor Adjustment . . . . . . . . . . . . . . . . . . . . . 33 4.3.1 Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5 Predictable Cache Protection Policy 41 5.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.1.1 Predictability of Cache Replacement Policy . . . . . . . . . . . . . . . . 41 5.1.2 Cache Protection Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2.2 Reuse Streak Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2.3 Performance Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.2.4 Knee point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.3.2 Application Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.3.3 Putting It All Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.4.1 System Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 vi 5.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.5 Summary and Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.5.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6 Low-cost Deadlock Avoidance Scheme 63 6.1 Need for Reducing Virtual Channel Cost . . . . . . . . . . . . . . . . . . . . . . 63 6.2 Bubble Coloring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 6.2.1 Avoiding Routing-induced Deadlocks . . . . . . . . . . . . . . . . . . . . 66 6.3 Avoiding Protocol-induced Deadlocks . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.4.1 Execution time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.4.2 Energy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.4.3 Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 7 Conclusions and Future Research 80 7.1 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Bibliography 84 vii Chapter 1 Introduction 1.1 Motivation In the era of cloud computing, a large amount of computing workloads are offloaded to the central “cloud”, which requires the servers behind the cloud to provide an efficient way to run those aggregated workloads [1, 2]. As the number of cores on a chip increases, multicore- enabled servers are becoming ideal hardware platforms for cloud computing by providing the ability to execute multiple concurrent workloads cost-effectively on a single chip. In this computing paradigm, various shared on-chip resources, such as last-level caches, off- chip memory bandwidth and the on-chip network are competed for among concurrently running threads. While the purpose of sharing those resources is to improve utilization and reduce cost of the chip area and power, uncontrolled sharing could impose a series of negative impacts on the system, such as unpredictable performance, compromised quality- of-service (QoS), degraded cache associativity, unfairness, starvation or even deadlock, which defeats the purpose of sharing. Nowadays, due to lack of efficient ways to manage shared on-chip resources in chip multiprocessors (CMPs), most cloud severs only operate at less than 30% of their maximum utilization level in order to avoid negative effects caused by aggressive sharing and to maintain an acceptable level of system QoS [3, 4]. In order to 1 CHAPTER 1. INTRODUCTION 2 improve server utilization, it is of paramount importance to manage the sharing of on-chip resources, including last-level caches, off-chip memory bandwidth and the on-chip network, in an effective and efficient way. A complete solution for managing resource sharing logically consists of two parts: an allocationpolicythatdecidestheamountofresourcesthateachsharershouldhaveinorderto satisfyspecifiedQoSobjectives[5, 6]andanenforcementschemethatdecideshowtoactually enforce each sharer to occupy the amount of resources assigned by an allocation policy [7, 8]. For an allocation policy, an optimal resource assignment can be obtained by formulating a constrained optimization problem. The constraints are the total amount of various available resources and the objective function can be expressed by resource sensitivity functions and a system-level performance objective function [9] 1 . Resource sensitivity describes how the performance of a single thread changes with different amounts of various resources. The system-level objectives balance the performance among multiple threads. The commonly used system-level performance objectives include throughput-oriented objectives, such as sum of instruction per cycles (IPCs) and weighted speedup, and fairness (usually measured by the lowest speedup among co-scheduled threads) and harmonic weighted speedup [10] which balances both throughput and fairness. An efficient allocation policy should be precise enough to predict the sensitivity of different applications to various resources while versatile enough to optimize a broad range of performance objectives for different scenarios. For off-chip memory bandwidth management, most existing works are based on heuristics (e.g., equalpartitioning[11, 12])withoutaprecisemodeltopredictthesensitivitiesofco-scheduled applications to memory bandwidth resources and, thereby, are unable to achieve the optimal result [11, 12, 13, 14, 15, 16, 17, 18]. To achieve the optimal memory bandwidth partitioning for different system-level performance objectives, a universal analytical model is needed to understand and optimize how off-chip memory bandwidth partitioning affects various performance objectives. 1 The resource sensitivity function and the system performance objective function are similar to the response-time function and the penalty function in [9], respectively. CHAPTER 1. INTRODUCTION 3 A good enforcement scheme should be able to support precise fine-grain resource parti- tioning without compromising resource efficacy. For example, an ideal enforcement scheme for cache partitioning should be scalable enough to divide a cache into hundreds or even thousands of partitions at the granularity of a single cache line without degrading the asso- ciativity of each partition significantly. While many recent cache partitioning enforcement schemes are proposed to support fine-grain partitioning, they suffer from severe associativ- ity degradation when the number of partitions is high [7, 19]. To enable efficient sharing of large last-level caches (LLCs) in future large-scale CMPs, it is imperative to design a scalable high-associativity cache partitioning scheme. A good enforcement policy should also provide a predictable resource sensitivity function of each sharer. Cache resource allocations are often enforced via cache replacement. Modern high-performance cache replacement policies often use cache protection to avoid cache thrashing. However, existing high-performance cache replacement policies are designed empirically and lack predictability [20, 21, 22] and thus make them less desirable for partitioning of a shared cache. To allow a cache allocation policy to infer how miss rate of a thread changes with different cache sizes efficiently (with- out trial-and-error), a cache protection policy that can provide predictable performance is needed. Deadlock is a critical problem that needs to be avoided by resource allocation. Dead- locksininterconnectionnetworksincluderouting-induceddeadlocksthatarecausedbycyclic dependence in routing functions and protocol-induced deadlocks that are caused by depen- dencies among different types of messages (e.g., a reply message depends on a request mes- sage). To avoid these network abnormalities, virtual channels (VCs) [23, 24] have been used extensively in many deadlock-avoidance schemes. However, high VC count will incur a large overhead in router area, power and frequency. Therefore, it is important to devise efficient deadlock-free schemes that minimize the VC requirement in avoiding routing- and protocol-induced deadlocks. CHAPTER 1. INTRODUCTION 4 1.2 Research Contribution The research addresses various inefficiencies in sharing various on-chip resources, including the last-level cache, off-chip memory bandwidth and on-chip network. The main contribu- tions of this dissertation are the following: • For off-chip memory bandwidth management, an analytical performance model is pro- posed to establish the relationship between different memory bandwidth partitioning and various system-level performance objectives. Based on the proposed model, four optimal partitioning schemes are derived to maximize four system-level performance objectives, including weighted speedup, sum of IPCs, harmonic weighted speedup and fairness, respectively. • For last-level cache management, this research identifies the associativity loss for replacement-basedpartitioninginlargescaleCMPs. Anovelcachepartitioningscheme, named futility scaling, is proposed to largely maintain the cache associativity even with a large number of partitions. • Fordesigningpredictablecacheprotectionpolicies, thisresearchintroducestheconcept of reuse streaks and identifies the streak effect on the performance of cache protection policies. A performance model is built to predict the hit rate and required cache size for a cache protection policy with the input of an insertion ratio and a protecting distance. A low-cost profiler is proposed to track the average reuse streak length of a cache access stream at runtime. Based on the model and the runtime information about the average reuse streak length, a practical cache protection policy that provides predictable performance is proposed. • For on-chip network resource allocation, a virtual cut-through switched scheme, called Bubble Coloring, is proposed to avoid both routing- and protocol-induced deadlocks CHAPTER 1. INTRODUCTION 5 withouttheneedformultiplevirtualchannelswhilestillenablingfully-adaptiverouting on any topology. 1.3 Thesis Organization The remainder of this dissertation is organized as follows. Chapter 2 summarizes prior and ongoing research in the area of sharing of various on-chip resources. Chapter 3 presents an analytical performance model for off-chip memory bandwidth partitioning. Chapter 4 presents a high-associativity cache partitioning scheme. Chapter 5 presents a predictable cache protection policy. Chapter 6 presents the Bubble Coloring scheme, a low-cost dead- lock avoidance technique that only needs one virtual channel to eliminate both routing- and protocol-induced deadlocks. Chapter 7 concludes this dissertation with a discussion of directions for future research. Chapter 2 Background and Related Work There have been many schemes proposed to efficiently and effectively manage various shared on-chip resources in CMPs. The related work for managing the last-level cache, off-chip memory bandwidth and on-chip network is summarized below. 2.1 Memory Bandwidth Partitioning Generally, research work related to scheduling policies for memory requests can be cat- egorized into two groups: (1) policies to increase the bandwidth utilization by reordering various types of memory requests and (2) policies to balance the performance of co-scheduled applications in a shared CMP context by partitioning off-chip memory bandwidth. Proposals from the first category (like FR-FCFS [25] and Virtual Write Queue [26]) focus on improving memory bandwidth utilization by considering the characteristics of modern DRAM systems. The system throughput will increase if off-chip memory bandwidth uti- lization is improved. While FR-FCFS [25] scheduling reduces row buffer miss delay, Virtual Write Queue [26] mitigates write-to-read turnover delay. Both works improve the bandwidth utilization by reducing average memory access delay. Minimalist Open-page policy [27] can increase bandwidth utilization by balancing locality and parallelism. Previous work in this category focuses only on improving overall system throughput without considering the Qual- 6 CHAPTER 2. BACKGROUND AND RELATED WORK 7 ity of Service (QoS) of each individual application (e.g., fairness) in a shared CMP context. With the emergence of multi-programmed workloads for CMP, QoS of each independent workload becomes increasingly important. Various off-chip memory bandwidth partitioning schemes have been proposed to improve fairness among co-scheduled applications [11, 13, 14, 28]. Nesbit, et al., [11] propose to divide bandwidth equally among all the applications to avoid starving low memory-intensive workloads. Mutlu, et al., [13] propose Stall-time Fair Memory Scheduler (STFM) to equalize the memory slowdowns experienced by co-scheduled applications. Most recent works related to memory scheduling focus on improving both throughput and fairness [14, 15]. Parallelism-Aware Batch-Scheduling (PARBS) [14] tries to improve overall QoS objectives without adversely effecting individual workload efficiency. Thread Cluster Memory (TCM) [15] scheduling improves both system performance and fairness by clustering different types of threads together. Self-Optimizing Memory Con- trollers [29] and MORSE [17] use a machine learning approach to select the best scheduling sequence. Although heuristic-based memory scheduling schemes gain system performance by distributing bandwidth among co-scheduled applications in a better way, they do not explicitly specify how much bandwidth should be allocated to each application. Therefore, it is still unclear how bandwidth partitioning affects system performance and what are the best partitioning schemes for different performance objectives. The work of this thesis aims to understand how best to partition off-chip memory bandwidth in terms of various system- level performance objectives and to devise different optimal memory bandwidth partitioning schemes for different performance objectives. 2.2 Cache Capacity Partitioning Broadly,cachepartitioningenforcementschemescanbecategorizedintotwogroups: placement- basedpartitioningandreplacement-basedpartitioning. Placement-basedpartitioningschemes[30, 31, 32]partitionacachebyplacingthecachelinesfromdifferentpartitionsintodisjointphys- CHAPTER 2. BACKGROUND AND RELATED WORK 8 ical cache regions. For example, way-partitioning or column caching [30] statically assigns physical cache ways to each partition. Although way-partitioning is straightforward to im- plement, it cannot support fine-grained partitioning, and cache associativity reduces rapidly as the number of partitions increases. To address these problems, reconfigurable caches [33] and molecular caches [31] are proposed to partition caches by sets instead of ways. Page Coloring [32] is another placement-based partitioning that maps the physical pages of dif- ferent applications to different cache sets. However, all these placement-based partitioning schemes have a common problem: there is a large overhead to resize partitions, i.e., cache data has to be flushed or moved when a partition changes its size. Replacement-based partitioning schemes [8, 7, 19] partition a cache by adjusting the line eviction rate of each partition at replacement. For instance, Cache Quota Violation Prohibition (CQVP [7]) sets a quota for each partition and always chooses the cache lines from the partition that exceeds its quota to evict. Similarly, Probabilistic Shared-cache Management (PriSM [19]) controls the partitioning by adjusting the eviction probability of each partition based on its insertion rate and size deviation from its target. Vantage [8] stabilizes the size of each partition via controlling its “aperture”, where a larger “aperture” incurs a higher eviction rate. In general, if a cache controller evicts lines from a partition at a ratethatishigher(lower)thanitsinsertionrate, thesizeofthepartitionwillshrink(expand). Since partition sizes are changing in the process of replacement, they can be controlled at the granularity of a single line and there is little cost for resizing. However, the existing replacement-based partitioning schemes either suffer from associativity degradation as the number of partitions increases [7, 19] or cannot precisely partition the whole cache [8, 19]. In this thesis, a replacement-based partitioning scheme that achieves both high associativity and precise sizing is proposed. CHAPTER 2. BACKGROUND AND RELATED WORK 9 2.3 Cache Protection Policies Last-level shared caches are managed through the cache replacement procedure. In order to avoid cache thrashing, modern high-performance cache replacement policies often adopt cache protection techniques, i.e., to protect part of, rather than the whole, working set. Existingcacheprotectionpoliciesaredesignedempiricallyandthuslackpredictability, which willresultininefficienciesofcacheallocationpolicies. Generally,therearetwoflavorsofcache protection policies. (1) Insertion-based policy controls insertion ratio (ρ), i.e., what fraction of incoming lines is protected. An example is Bimodal Insertion Policy (BIP [21]) which inserts every 1/32 of incoming cache lines (ρ = 1~32) into the MRU position for protection while the rest of the incoming lines are inserted into the LRU position for quick eviction. (2) Protecting-distance-based policy controls the protecting distance (d p ), i.e., how long existing lines are protected. An example is Protecting Distance based Policy (PDP [22]), in which an inserted/reused line is protected ford p accesses until its next reuse/eviction and an incoming line will bypass the cache if no unprotected candidates are available. In order to predict the performance of a cache protection policy, a parameterized model with both ρ and d p is needed. In this thesis, a model with the input of an insertion ratio and a protecting distance is proposed to predict the hit rate and required cache size of a cache protection policy. The model also requires cache access reuse streak information. Based on the model, a predictable cache protection policy (PCPP) is proposed. 2.4 Deadlock Avoidance Routing-induced deadlocks occur when there is a knotted cyclic dependency between re- sources created by the packets transported through various paths in the network. To break such cyclic dependence, the turn model and its extensions [34, 35, 36] have been proposed, which reduce the degree of routing freedom by disallowing certain paths between source and destination nodes. In order to maintain full routing freedom (i.e., to support fully adaptive CHAPTER 2. BACKGROUND AND RELATED WORK 10 routing), Duato’s Protocol [37] was proposed to avoid routing-induced deadlock by using an additional virtual channel. Deadlock freedom provided by the schemes mentioned above is based on the consump- tion assumption [38] where the end node will consume all packets from the network once the destination is reached. However, when interactions and dependencies are created be- tween packets of different message classes at network endpoints, that assumption may not be valid and protocol-induced deadlocks (also known as message-dependent deadlocks) may occur [38]. The conventional solution for protocol-induced deadlocks is to have separate virtual networks for each message class [39]. As the number of message classes increases, the virtual channel (VC) requirement rises proportionally. Thus, current approaches for designing routing- and protocol-induced deadlock-free schemes with complete routing free- dom require a large number of VCs. However, implementing such a large number of VCs in routers has considerable negative impacts on router area, power and frequency. Hence, it is important to provide a scheme to avoid routing- and protocol-induced deadlocks with minimal requirements on the number of virtual channels. Chapter 3 Multi-Objective Memory Bandwidth Partitioning In this chapter, a unified analytical model is proposed to reveal the relationship between off- chipmemorybandwidthpartitioningandvarioussystem-levelperformanceobjectives. Based on this model, different optimal memory bandwidth partitioning schemes are derived to optimize a broad range of system objectives, including throughput-oriented metrics (i.e., sum of IPCs and weighted speedup), fairness, and harmonic weighted speedup. Experiments are conducted to compare the proposed schemes against previous schemes in multicore systems in terms of different performance objectives. 3.1 Necessity for Better Memory Bandwidth Partitioning In the era of chip multiprocessors, more and more applications are co-scheduled on a single chip to share the off-chip memory bandwidth. This exacerbates the contention problem for shared off-chip memory bandwidth. Thus, sharing memory bandwidth among co-scheduled applications becomes increasingly important to overall system performance. Traditional 11 CHAPTER 3. MULTI-OBJECTIVE MEMORY BANDWIDTH PARTITIONING 12 memory scheduling schemes (e.g., FR-FCFS [25]) improve memory bandwidth utilization through biased scheduling, which will suffer serious starvation problems [11, 12]. Instead of merely increasing bandwidth utilization, various memory bandwidth partitioning schemes have been proposed to improve system QoS performance [11, 12, 13, 15]. However, these works do not have enough generality to answer the question of what is the best way to partition memory bandwidth for any arbitrary performance objective. 0 0.5 1 1.5 2 2.5 Hsp Wsp IPCsum minFairness Normalized to No_partitioning Equal Proportional Square_root Priority_APC Priority_API libquantum-milc-gromacs-gobmk Figure 3.1: Harmonic weighted speedup (Hsp), weighted speedup (Wsp), sum of IPCs and minimum fairness with five bandwidth partitioning schemes of Equal, Proportional, Square_root, Priority_API and Priority_APC Figure3.1showsthepotentialimpactofdifferentmemorybandwidthpartitioningschemes on different system performance objectives. Four SPEC2006 applications (i.e., libquantum, milc, gromacs and gobmk) with five partitioning schemes, including Equal, Proportional, Square_root,, Priority_API and Priority_APC, run on a four-core CMP. Detailed de- scriptions about each partitioning scheme can be found in Section 3.3. Four different system performance metrics including harmonic weighted speedup, minimum fairness [40], sum of IPCs and weighted speedup are compared among all the partitioning schemes. The detailed system configuration is provided in Section 3.3. All performance results are normalized to No_partitioning. As is shown in Figure 3.1, different partitioning schemes favor different system perfor- CHAPTER 3. MULTI-OBJECTIVE MEMORY BANDWIDTH PARTITIONING 13 mance objectives. Square_root yields highest harmonic weighted speedup. Proportional partitioning has best minimum fairness. Priority_APC is best for weighted speedup while Priority_API achieves highest sum of IPCs. Equal partitioning, which is most com- monly used in previous work, can improve the system performance in most cases compared to No_partitioning. However, it is not optimal for any system objective that is evalu- ated. As can be seen from the figure, there is no single partitioning scheme optimized for all the system performance objectives. Different partitioning schemes are required to optimize different objectives. It is, thus, important to understand how off-chip memory bandwidth partitioning affects different system performance objectives. 3.2 A Proposed Analytical Performance Model Table 3.1: Notations used in this chapter. Notation Meaning IPC alone,i Instructions Per Cycle that application i can achieve when it runs alone with the dedicated off-chip memory bandwidth IPC shared,i Instructions Per Cycle that applicationi can achieve when it shares the off-chip memory bandwidth with other applications in a CMP APC alone,i Memory Accesses Per Cycle for application i when it runs alone (i.e., API i ×IPC alone,i ) APC shared,i Memory Accesses Per Cycle for application i when it shares the off-chip memory bandwidth with other applications in a CMP (i.e., API i ×IPC shared,i ) API i Memory Accesses Per Instruction for application i β i Fraction of total bandwidth assigned to application i (note that ∑ N i=1 β i = 1) B Total utilized off-chip memory bandwidth (i.e., total memory ac- cesses served per cycle) In this section, the general structure of an analytical performance model for partitioning off-chip memory bandwidth is first presented. Then a series of optimal partitioning schemes for a broad range of system performance objectives, which includes weighted speedup (W sp ), CHAPTER 3. MULTI-OBJECTIVE MEMORY BANDWIDTH PARTITIONING 14 sum of IPCs (IPC sum ), fairness and harmonic weighted speedup (H sp ) are derived based on the model. The terminology used throughout this paper is listed in Table 3.1. 3.2.1 General Structure of the Model The off-chip memory bandwidth that an application occupies in the system can be measured in terms of memory Accesses Per Cycle (APC). Note that this unit can be easily converted to more commonly used units for off-chip memory bandwidth, such as Bytes Per Second (B/s) if CPU frequency and last level cache line size are given. For example, assume cache line size is 64 bytes and CPU clock frequency is 5Ghz, then 0.01 APC equals 3.2GB/s (i.e., GB~s=APC×Cache_Line_Size×CPU_Frequency). The performance of an application can be measured in terms of Instructions Per Cycle (IPC). The memory Access Per Instruction (API) of one application depends the program itself and the input data set. The API value is not affected by bandwidth partitioning. Hence, the impact of memory bandwidth usage of an application on performance (i.e., IPC) can be expressed in a simple equation: IPC =APC~API (3.1) TheAPIoftheapplicationcanbemeasuredonline, andAPCiscontrolledbythememory bandwidth partitioning. Equation (3.1) shows the relationship between bandwidth usage and performance. This equation also shows the sensitivity of an application to its off-chip bandwidth occupancy. Generally, the performance of an application with higher API is less sensitive to the bandwidth resource. Moreover, total memory accesses per cycle (∑ N i=1 APC shared,i ) equals the total off-chip bandwidth utilized (B), so we have N Q i=1 APC shared,i =B (3.2) CHAPTER 3. MULTI-OBJECTIVE MEMORY BANDWIDTH PARTITIONING 15 In the model, total utilized bandwidthB is assumed to be a constant value with different memory bandwidth partitioning schemes. By doing this, the performance changes caused by bandwidth utilization improvement are factored out, which helps understand the sys- tem performance changes caused by pure memory bandwidth partitioning. If a performance objective can be expressed in term of IPCs (e.g., sum of IPCs), then it can be translated into APC equations based on Equation (3.1). The total bandwidth constraint (i.e., Equa- tion (3.2)) and the APC based performance objective function (e.g., sum of IPCs) can be formulated into a constrained optimization problem. By solving this optimization problem, the optimal bandwidth partitioning scheme for a particular performance objective can be obtained. In the following subsections, different optimal partitioning schemes for different system objectives are derived by formulating constrained optimization problems. 3.2.2 Harmonic Weighted Speedup HarmonicWeightedSpeedup(H sp )[10]isametricthatstrikesabalancebetweenthroughput and fairness. It is defined as the following: H sp = N ∑ N i=1 IPC alone,i IPC shared,i = N ∑ N i=1 APC alone,i APC shared,i (3.3) The maximumH sp expressed in Equation (3.3) with the constraint expressed in Equation (3.2) can be found by using Lagrange Multipliers method: maxH sp = N ×B ∑ N i=1 » APC alone,i 2 (3.4) when APC shared,i =B⋅ » APC alone,i ∑ N j=1 » APC alone,j (3.5) CHAPTER 3. MULTI-OBJECTIVE MEMORY BANDWIDTH PARTITIONING 16 The fraction of bandwidth share of applicationi (β i ) is proportional to its memory access frequency in shared mode (i.e., APC share,i ). Therefore, the optimal memory bandwidth partitioning (i.e., the ratio of bandwidth shared) is: β i β j = APC shared,i APC shared,j = » APC alone,i » APC alone,j (3.6) This equation shows that the optimal bandwidth share of application i is proportional to the square root of its inherent memory access frequency (i.e., APC alone,i ). Hence, this partitioning scheme is referred to as Square_root. As can be seen from this optimal parti- tioning scheme, the optimal bandwidth partitioning for harmonic weighted speedup tends to slightly (but not overly) constrain applications with high miss frequencies, preventing them from dominating the bandwidth usage and starving applications with lower miss frequencies. The weighted speedup expression of Square_root partitioning scheme can also be de- rived, which is W sqrt sp = B N × ⎛ ⎝ N Q i=1 1 » APC alone,i ⎞ ⎠ 2 (3.7) 3.2.3 Fairness A CMP system is fair if the speedups of equal-priority applications running together on the CMP system are the same. So, ideal fairness is achieved when: IPC shared,i IPC alone,i = IPC shared,j IPC alone,j ⇒ APC shared,i APC alone,i = APC shared,j APC alone,j (3.8) To achieve this ideal fairness, we can have optimal off-chip bandwidth partitioning: β i β j = APC shared,i APC shared,j = APC alone,i APC alone,j (3.9) CHAPTER 3. MULTI-OBJECTIVE MEMORY BANDWIDTH PARTITIONING 17 This shows that the optimal bandwidth share of application i is proportional to its inherent memory access frequency (APC alone,i ). Hence, this partitioning scheme is referred to as Proportional partitioning. Although the Proportional partitioning scheme is ideal for fairness, it is suboptimal for harmonic weighted speedup and weighted speedup compared to the Square_root par- titioning scheme. The harmonic weighted speedup and weighted speedup of Proportional partitioning are the same, which are H prop sp =W prop sp = B ∑ N i=1 APC alone,i (3.10) As can be seen from Equation (3.10), the harmonic weighted speedup and weighted speedup of Proportional scheme are worse than those of Square_root partitioning scheme (compared to Equation (3.4) and (3.7), respectively) according to Cauchy’s inequality. Com- pared to the Square_root scheme, the Proportional scheme tends to allocate more band- width resources to bandwidth insensitive applications (i.e., applications with high API), which degrades the overall throughput. This also reflects that different partitioning schemes favor different optimizing objectives. 3.2.4 Weighted Speedup Weighted Speedup [41] aims to measure the overall reduction in execution time, by normal- izing each application’s performance to its inherent IPC value (i.e., IPC alone ). The optimal partitioning scheme can be found by maximizingW sp , expressed in Equation (3.11), subject to the total bandwidth constraint (i.e., Equation (3.2)): W sp = N Q i=1 IPC shared,i IPC alone,i ~N = N Q i=1 APC shared,i APC alone,i ~N (3.11) This optimization problem can be formulated as a fractional Knapsack Problem [42]. Take APC shared,i as the quantity of the item i. Note that APC shared,i can be fractional. The value of each item is 1 ~(N×APC alone,i ). The maximum quantity that we can carry in the CHAPTER 3. MULTI-OBJECTIVE MEMORY BANDWIDTH PARTITIONING 18 bag is B (i.e., total off-chip bandwidth). The goal here is to maximize the sum of values of the items (i.e., maximize W sp ) so that the total quantity must be less than the knapsack’s capacity. The fractional knapsack problem can be solved by a greedy algorithm. The best scheme is to get as many items with higher value (i.e., 1~(N ×APC alone,i )) as possible, in other words, always give the application with lower APC alone higher priority. Note that the maximum bandwidth Application i can occupy (i.e., APC shared,i ) is bounded byAPC alone,i . This priority-based memory scheduling can be considered as a special form of partitioning, whichfirstallocatesenoughoff-chipmemorybandwidthtotheapplicationwithitsmaximum occupancy capacity, and then allocates the remaining bandwidth to the application with secondary priority, and so on. Obviously, this scheme causes starvation for applications with higher APC alone , degrading fairness significantly. This partitioning scheme is referred to as Priority_APC since it prioritizes applications based on their APC alone . 3.2.5 Sum of IPCs Whenlatencyislesscritical, SumofIPCscanbeusedtomeasuretheoverallsystemthrough- put. Togetoptimalperformance, wecanmaximizeEquation(3.12), subjecttotheconstraint of Equation (3.2): IPC sum = N Q i=1 IPC shared,i = N Q i=1 APC shared,i API i (3.12) Similartoweightedspeedup,thisproblemcanalsobeformulatedasafractionalKnapsack Problem and easily solved by a greedy algorithm. To achieve maximum IPC sum , applica- tions with lower APIs should have higher priority. For this priority-based scheduling, some applications will gainmore benefits than others, which implies it haspoor fairness properties. This partitioning scheme is referred to as Priority_API since it prioritizes the applications based on their API. CHAPTER 3. MULTI-OBJECTIVE MEMORY BANDWIDTH PARTITIONING 19 3.3 Evaluation To evaluate the efficacy and versatility of the proposed model, different memory bandwidth partitioning schemes are compared under four system-level performance targets, including (1) sum of IPCs, (2) weighted speedup, (3) fairness, and (4) harmonic weighted speedup on a four-core CMP system. The expression of harmonic weighted speedup, weighted speedup, and sum of IPCs can be found in Equation (3.3) (3.11) and (3.12), respectively. The fairness ismeasuredbyminFairness[40], whichisdefinedasinEquation(3.13). Table3.2summarizes the baseline system configuration. MinF =N × N min i IPC shared,i IPC alone,i ¡ (3.13) Table 3.2: Baseline system configuration. Core 5 GHz out of order processor Decode/Issue/Execute/Retire up to 8 instructions 192-entry reorder buffer Front End 16-bit BTB tag, 4K-entry BTB Tournament branch predictor Caches L1 I-cache/D-cache: 32KB, 2-way, 1 ns, 64B line Private unified L2: 256KB, 8-way, 5 ns, 64B line DRAM 200 MHz bus cycle, 8 GB DDR2-PC3200 Close page policy 8B-wide data bus Latency: 12.5-12.5-12.5ns (tRP-tRCD-CL) Address Mapping: channel/row/col/bank/rank 32 DRAM banks In my experiments, each workload is mixed with four SPEC CPU 2006 benchmarks, as is shown in Table 3.3. The mixed workloads are divided into two categories: heterogeneous and homogeneous. In a heterogeneous workload, applications have very different memory- intensity, while in a homogeneous workload, the memory-intensity of each application are similar. The heterogeneity is defined as the Relative Standard Deviation (RSD) ofAPC alone s of co-scheduled applications. A workload is heterogeneous if its heterogeneity is greater than 30. Otherwise, it is homogeneous. CHAPTER 3. MULTI-OBJECTIVE MEMORY BANDWIDTH PARTITIONING 20 Table 3.3: Workload Construction workload benchmark heterogeneity(RSD) homo-1 libquantum-milc-soplex-hmmer 12.27 homo-2 libquantum-milc-soplex-omnetpp 13.02 homo-3 hmmer-gromacs-sphinx3-leslie3d 18.55 homo-4 hmmer-gromacs-bzip2-leslie3d 19.16 homo-5 h264ref-zeusmp-bzip2-gromacs 19.74 homo-6 h264ref-zeusmp-gobmk-gromacs 24.06 homo-7 h264ref-zeusmp-gobmk-bzip2 29.71 hetero-1 milc-soplex-zeusmp-bzip2 41.93 hetero-2 soplex-hmmer-gromacs-gobmk 45.10 hetero-3 libquantum-soplex-zeusmp-h264ref 47.92 hetero-4 lbm-soplex-h264ref-bzip2 50.31 hetero-5 libquantum-milc-gromacs-gobmk 52.99 hetero-6 lbm-libquantum-gromacs-zeusmp 58.31 hetero-7 lbm-milc-gobmk-zeusmp 69.84 Seven partitioning schemes compared in our experiments are listed as follows: 1. No_partitioning: Thisschemedoesnotmanageoff-chipmemorybandwidthresources with partitioning. The memory controller serves all the memory requests based on a First Come First Served (FCFS) Policy. 2. Equal: This scheme assigns an equal fraction of off-chip memory bandwidth to each individual application. This scheme is proposed in [11]. The fraction of the total off-chip memory bandwidth that application i is assigned to (β i ) is β i = 1 N . 3. 2/3_Power: In this partitioning scheme, the fraction of off-chip memory bandwidth (β i ) that application i is assigned to is proportional to the two-thirds power of its inherent memory access frequency (i.e., APC alone,i ), which is β i = APC alone,i 2~3 ∑ N j=1 APC alone,j 2~3 . Thispartitioningschemeisproposedin[28]asthebestpartitioningschemeforweighted speedup based on their queuing model. 4. Proportional: In this partitioning scheme, the fraction of off-chip memory bandwidth (β i ) that application i is assigned to is proportional to its inherent memory access CHAPTER 3. MULTI-OBJECTIVE MEMORY BANDWIDTH PARTITIONING 21 frequency (APC alone,i ). This is the best partitioning scheme for fairness. The fraction of the amount of total off-chip bandwidth that applicationi should be assigned to (β i ) is β i = APC alone,i ∑ N j=1 APC alone,j . 5. Square_root: In this partitioning scheme, the fraction of off-chip memory bandwidth (β i ) that is assigned to each application is proportional to the square root of their inherent memory access frequencies (i.e., APC alone ). This is the optimal partitioning scheme for harmonic weighted speedup. The fraction of total off-chip bandwidth that application i is assigned (β i ) is β i = » APC alone,i ∑ N j=1 » APC alone,j . 6. Priority_API: This scheme prioritizes the memory requests from applications with lower API over ones with higher API. This scheme is best for sum of IPCs. 7. Priority_APC: This scheme priorities the memory requests from applications with lower APC alone over ones with higher APC alone . This scheme is best for weighted speedup. Figure 3.2 shows the performance comparison of six off-chip memory bandwidth parti- tioning schemes (i.e., Equal, Proportional, Square_root, 2/3_power, Priority_APC and Priority_API) in terms of four system-level performance objectives (i.e., harmonic weighted speedup, fairness, weighted speedup and sum of IPCs). All the performance results are nor- malized to No_partitioning. As can be seen from Figure 3.2, different partitioning schemes favor different system objectives. Generally, for heterogeneous workloads, due to the large variety in applications’ sensitivi- ties to the off-chip memory bandwidth resource, the performance differences among different partitioning schemes are large, e.g., Priority_API has 50% more performance than Propor- tional in terms of sum of IPCs. The performance of homogeneous workloads are less diverse in terms of all performance metrics with different partitioning schemes because all partition- ing schemes are similar. For example, if two applications have exact same inherent mem- ory access frequency (APC alone ), there will be no difference among Equal, Proportional, CHAPTER 3. MULTI-OBJECTIVE MEMORY BANDWIDTH PARTITIONING 22 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 hetero-1hetero-2hetero-3hetero-4hetero-5hetero-6hetero-7average homo-1 homo-2 homo-3 homo-4 homo-5 homo-6 homo-7 average Normalized Harmonic Weighted Speedup Equal Proportional Priority_APC Priority_API Square_root 2/3_power 0 0.5 1 1.5 2 2.5 hetero-1 hetero-2 hetero-3 hetero-4 hetero-5 hetero-6hetero-7 average homo-1 homo-2 homo-3 homo-4 homo-5 homo-6 homo-7 average Normalized Minimum Fairness Equal Proportional Priority_APC Priority_API Square_root 2/3_power 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 hetero-1hetero-2hetero-3hetero-4hetero-5hetero-6hetero-7average homo-1 homo-2 homo-3 homo-4 homo-5 homo-6 homo-7 average Normalized Weighted Speedup Equal Proportional Priority_APC Priority_API Square_root 2/3_power 0 0.5 1 1.5 2 2.5 hetero-1 hetero-2 hetero-3 hetero-4 hetero-5 hetero-6 hetero-7 average homo-1 homo-2 homo-3 homo-4 homo-5 homo-6 homo-7 average Normalized Sum of IPCs Equal Proportional Priority_APC Priority_API Square_root 2/3_power Figure 3.2: Normalized performance to No_partitioning of (a) harmonic weighted speedup (b) minimum Fairness (c) weighted speedup and (d) sum of IPCs with six partitioning schemes of Equal, Proportional, Square_root, 2/3_power, Priority_APC and Priority_API. CHAPTER 3. MULTI-OBJECTIVE MEMORY BANDWIDTH PARTITIONING 23 Square_root partitioning schemes. In the rest of this section, without explicit mention, all the data is from the measurement of heterogeneous workloads. For No_partitioning, high API applications tend to occupy more off-chip bandwidth resourcestostarveoutlowAPIapplications. SincehighAPIapplicationshavelowsensitivity to the bandwidth, No_partitioning has poor overall throughput. Equalpartitioninghasmoderateperformanceimprovementsinharmonicweightedspeedup (17.7%), weighted speedup (23.4%) and sum of IPCs (32.4%) over No_partitioning. These performanceimprovementsarebecauseitincreasestheoverallthroughputbyallocatingmore bandwidth to low API applications. It has relatively poor fairness since the speedups of high API applications are degraded, which causes unbalanced speedups between high API and low API applications. Note that Equal partitioning is not the optimal partitioning scheme for any of the objectives that are evaluated. Square_root partitioningyieldsbestperformance(20.3%)intermsofharmonicweighted speedup, as is expected. It also has moderate performance improvements in terms of both fairness (26.7%) and throughput, e.g., sum of IPC (16.2%) and weighted speedup (16.2%), which implies harmonic weighted speedup itself is a metric that balances both fairness and throughput. Proportional partitioning is best for the fairness metric as expected. It has the worst performance in terms of throughput-oriented metrics (i.e., IPC sum and W sp ) since it does not favor low API applications to achieve high IPC. Note that Proportional partitioning is different from No_partitioning, which implies the bandwidth that an application occupies naturally is not exactly proportional to its inherent memory access frequency (APC alone ). The 2/3_power partitioning scheme partitions bandwidth in between Square_root and Proportional, which implies its performance is also between those two. For example, in terms of fairness, it is better than Square_root and worse than Proportional. In terms of harmonic weighted speedup, it is higher than Proportional but lower than Square_root. Although the 2/3_power scheme is expected to produce highest weighted speedup in [28] CHAPTER 3. MULTI-OBJECTIVE MEMORY BANDWIDTH PARTITIONING 24 based on their model, our experimental results show that this is not the case. 2/3_power only achieves 84.4% of Priority_APC in terms of weighted speedup. The main reason comes from different assumptions. In [28], the memory access frequency (denoted as MA in [28]) of an application is assumed to stay unchanged no matter how fast/slow an application runs, which leads to limited performance gain (W sp ) by always favoring applications with low APC. However, in our model, the invariant is API. Memory access frequency (APC share ) will change based on the application execution speed (IPC share ). When an application runs faster, its memory access frequency will become higher correspondingly. This matches reality moreclosely. While[28]presumestheoptimumbandwidthpartitioningforweightedspeedup should slightly but not overly constrain applications with high miss frequencies, our model indicates that strictly prioritizing low-APC applications over high-APC (i.e., should overly constrain applications with high miss frequencies) is the best scheme for weighted speedup. AlthoughweightedspeedupisproposedtoovercometheunfairnesscausedbysumofIPCsfor multi-programmed workloads [41], our results show thatW sp is a throughput-oriented metric intrinsically and is still not good enough for multi-programmed environments. Systems that are optimized for weighted speedup can still suffer from starvation. Compared to the model in [28], our conclusion is valid due to the use of more realistic assumptions. As is expected, for priority-based partitioning, Priority_API and Priority_APC achieve highest performance for sum of IPCs and weighted speedup, respectively. However, they yield very poor performance for fairness and harmonic weighted speedup since starvation happens. Due to the strict priority policy, memory requests from applications with high APC alone or API may not get served at all. Priority_API and Priority_APC achieve the sameresultforheterogeneousworkloadbecauseapplicationswithhigherAPIareapplications with higherAPC alone . However, for homogeneous workloads, high API applications may not be highAPC alone applications, for example, hmmer has higherAPC alone but lower API than leslie3d. In summary, the proposed optimal partitioning schemes achieve better performance for CHAPTER 3. MULTI-OBJECTIVE MEMORY BANDWIDTH PARTITIONING 25 corresponding performance objectives. Among the rest of partitioning schemes, the closer they are to the optimal, the better results that can be achieved. 3.4 Summary The goal of this work is to understand and optimize how off-chip bandwidth partitioning affectsdifferentsystemperformanceobjectives. Ananalyticalmodelisproposedtoderiveop- timal memory bandwidth partitioning schemes for various system QoS objectives. Four opti- mal off-chip memory bandwidth partitioning schemes, namely Square_root, Proportional, Priority_APC, and Priority_API are derived for four different system-level performance objectives, which are harmonic weighted speedup, fairness, weighted speedup, and sum of IPCs, respectively. Experimental results show that, for heterogeneous workloads, perfor- mance improvements over No_partitioning / Equal_partitioning in terms of harmonic weighted speedup, minimum fairness, weighted speedup and sum of IPCs are 20.3%/2.1%, 49.8%/38.7%, 32.8%/7.6% and 64.2%/24%, on average, with our corresponding optimal par- titioning schemes (i.e., Square_root, Proportional, Priority_APC, and Priority_API), respectively. Chapter 4 High-Associativity Cache Partitioning An ideal cache partitioning enforcement scheme should support (1) smooth resizing: a par- tition can be expanded/shrunk smoothly without incurring large overhead (i.e., no data flushing or migrating); (2) precise sizing: each partition should occupy the exact (i.e., no less and no more) amount of cache space allocated to it; and (3) high associativity: the associativity of a partition should not be reduced as the number of partitions increases. Generally, there are two approaches to enforce the partitioning of a cache: partition- ing by constraining cache line placement and partitioning by controlling cache line re- placement. Placement-based partitioning schemes (e.g., page coloring [32]) are unable to resize smoothly due to the inherent resizing penalty caused by line migration. Recently, replacement-based partitioning schemes are gaining increasing attention because they incur little resizing penalty and can scale to CMPs with a large number of fine-grain partitions. However, the existing replacement-based partitioning schemes either suffer diminishing cache associativity as the number of partitions increases (e.g., CQVP [7] and PriSM [19]) or cannot precisely partition the whole cache (e.g., Vantage [8]). This chapter presents Futility Scaling (FS), a proposed replacement-based partitioning scheme that can precisely partition the whole cache while still maintain high associativity even with a large number of partitions. It starts with a description of the associativity 26 CHAPTER 4. HIGH-ASSOCIATIVITY CACHE PARTITIONING 27 degradation problem in a replacement-based partitioning scheme. A Partitioning-First (PF) scheme is used as an example to show how seriously associativity can be degraded in a replacement-based partitioning scheme as the number of partitions increases. Next, the basic idea of the proposed FS scheme and its implementation are presented, followed by evaluation and simulation results. 4.1 Partitioning-induced Associativity Loss Generally, a cache consists of the following three components [43]: • Cache Array: This implements associative lookups and provides a list of replacement candidates on each eviction. • Futility Ranking: This maintains a strict total order of the uselessness of cache lines within each partition. • Replacement Policy: This identifies the victim from the list of replacement candidates based on their futility and partitioning requirements. The cache array could be a common set-associative cache, skew-associative cache [44] or zcache [45]. It provides a list of replacement candidates on each eviction. The futility of a cache line is used to assess how useless keeping this line in the cache would be. For a partitioned cache, the uselessness of cache lines within each partition is strictly ordered by a specific futility ranking scheme. For example, in LRU, LFU and OPT [46] futility ranking schemes, cache lines are ranked by the time of their last accesses, their access frequencies, and the time to their next references, respectively. A cache line with a higher rank is less useful. To make the rest of the analysis independent of cache size, the futility of a cache line is defined as its rank normalized to [0, 1], i.e., for a cache line ranked as the rth place in a partition with a size of M lines, its futility is f = r ~M, f ∈ [0, 1]. Given that the total CHAPTER 4. HIGH-ASSOCIATIVITY CACHE PARTITIONING 28 number of cache lines is very large, the futility of cache lines is assumed to be continuously and uniformly distributed over the range of [0, 1] in the following analysis. In order to compare associativity across different partitioning schemes regardless of cache organizations, futility ranking schemes, and actual workloads, the universal associativity concept described in [45] is used in this work. The cache associativity is considered as the ability of a cache to evict a useless line on a replacement, i.e., the ability to retain useful lines in the cache. The less useful the evicted lines are, the higher the associativity. Note that this new associativity concept is consistent with the conventional associativity definition for a set associative cache (i.e., the number of ways). A set associative cache with a larger number of ways has a better ability to evict lines that are less useful. In a shared cache with replacement-based partitioning, a replacement policy needs to perform two functions: improve the associativity within a partition and maintain the size ratios among partitions. The first function requires the replacement policy to evict the most useless of candidates as possible, which is the same requirement for the replacement policy in a non-partitioning context. The second function requires the replacement policy to prioritize cache replacement candidates from the oversized partitions over those from undersized ones in victim selection, which is not addressed in a non-partitioned cache. These two criteria often conflict with each other. 1 2 Most useful Partition 1 Partition 2 Least useful replacement candidates irreplaceable blocks Figure 4.1: Associativity and sizing dilemma in a replacement-based partitioning scheme For a list of replacement candidates provided on an eviction, the least useful candidate may not come from an oversized partition. For example, as is shown in Figure 4.1, assume there is a cache with ten cache lines in total. The cache needs to be partitioned equally, i.e., CHAPTER 4. HIGH-ASSOCIATIVITY CACHE PARTITIONING 29 each partition should have five cache lines. However, the current sizes of the two partitions are four and six, respectively. Assume an incoming line that belongs to the second partition invokes the replacement process, and the cache array provides two replacement candidates: one is the least useful line in the first partition, and the other is the most useful line in the secondpartition. Inthisscenario, adilemmaarises. Ontheonehand, fromtheperspectiveof improving associativity, the first candidate should be evicted since it is less useful. However, this would lead to the sizes of both partitions straying even farther away from their target sizes. On the other hand, from the perspective of enforcing the desired size ratio, the second candidate should be evicted since it belongs to the oversized partition. However, evicting the most useful cache line would inevitably hurt the associativity of the second partition. Therefore, how to enforce the partitioning while still largely maintaining the associativity of each partition becomes a challenging problem. Algorithm 4.1: Partitioning-First Scheme Data: The ith candidate is from partition p i and has a futility of f i , 1≤i≤R, 1≤p i ≤N, 0≤f i ≤ 1 /* Step 1: Partition Selection (PS) */ 1 max_over←−∞ 2 chosen_partiton←none 3 for i← 1 to R do 4 if max_over < N A p i −N T p i then 5 max_over← N A p i −N T p i 6 chosen_partition←p i /* Step 2: Victim Identification (VI) */ 7 max_futility←−∞ 8 chosen_victim←none 9 for i← 1 to R do 10 if p i =chosen_partition then 11 if max_futility <f i then 12 max_futility←f i 13 chosen_victim←i 14 return chosen_victim Tofurtherillustratethepreviouslydescribedassociativitydegradationproblem,apartitioning- first scheme is analyzed. Algorithm 4.1 represents a Partitioning-First (PF) scheme which CHAPTER 4. HIGH-ASSOCIATIVITY CACHE PARTITIONING 30 Table 4.1: Notations used in this chapter Term Meaning N Total number of partitions S i Fraction of cache space for Partition i,∑ N i=1 S i = 1 I i Fraction of total evictions from Partition i,∑ N i=1 E i = 1 E i Fraction of total evictions from Partition i,∑ N i=1 E i = 1 N A i Actual number of cache lines of Partition i N T i Target number of cache lines of Partition i R Number of replacement candidates on an eviction 0.0 0.2 0.4 0.6 0.8 1.0 Eviction futility 0.0 0.2 0.4 0.6 0.8 1.0 Associativity CDF of 1st partition 1P(AEF=0.95) 2P(AEF=0.82) 4P(AEF=0.74) 8P(AEF=0.66) 16P(AEF=0.60) 32P(AEF=0.56) F WC (x)=x (a) Associativity CDF for mcf mcf omnetpp gromacs h264ref astar cactusADM libquantum lbm 1 2 4 8 16 32 Number of partitions (N) 1.0 1.2 1.4 1.6 1.8 2.0 Number of misses of 1st partition (normalized to N=1) (b)Numberofmissesof8benchmarks 1 2 4 8 16 32 Number of partitions (N) 0.75 0.80 0.85 0.90 0.95 1.00 IPC of 1st partition (normalized to N=1) (c) IPCs of 8 benchmarks Figure 4.2: Comparisons of the PF scheme in a cache with different numbers of partitions (N = 1, 2, 4, 8, 16, 32) always evicts lines from the partition that exceeds its target size the most. Some of the common terms used throughout this chapter are listed in Table 4.1. The algorithm has two steps: Partition Selection (PS) and Victim Identification (VI). In the PS step, the algo- rithm chooses the partition whose current actual size (N A i ) most exceeds its target size (N T i ) among all the candidates’ partitions. Then in the VI step, it evicts the line with the largest futility among the candidates belonging to the partition chosen by the PS step. In sum, the PF scheme first does its best effort to ensure the size of a partition is as close to its target as possible in the PS step, and then in the VI step, it aims to achieve the best possible associativity by evicting the most useless cache line from a reduced list of candidates. The quality of associativity is measured by Associativity Distribution (defined in [45]), CHAPTER 4. HIGH-ASSOCIATIVITY CACHE PARTITIONING 31 which is the probability distribution of evicted lines’ futility. For a fully-associative cache, it always chooses to evict the line with the largest futility, f FA evict = 1.0. In general, for a partitioning scheme, the more skewed the distribution of a partition is towards f evict = 1.0, the higher the associativity. Figure 4.2a shows cumulative associativity distributions of the first partition for the workloads constructed by the mcf benchmark. As can be seen from the figure, when there is only one partition, its associativity is high, i.e., its average eviction futility (AEF) reaches 0.95. However, as the number of partitions increases, its associativity decreases (i.e., the associativity CDF curve is skewed farther away from f evict = 1.0). This is because, in the PF scheme, as the number of partitions increases, the list of replacement candidates available for the VI step is shortened by the PS step. As the number of re- placement candidates decreases, the probability of this reduced candidate list containing a high-futility line becomes lower, which leads to a lower average eviction futility and, thus, worse associativity. In the worst case (N ≫R), there is always only one cache line in the candidate list that belongs to the chosen partition. The VI step has no option other than to evict this line regardless of its futility (this is very similar to a direct-mapped cache that can only provide one replacement candidate on an eviction). In such case, the futility of evicted lines becomes random, and thus the associativity CDF curve becomes a straight diagonal line F WC (x) = x (AEF=0.5). This results in the futility ranking becoming irrelevant as cache lines with different futility have the same probability of being evicted. As shown in Figure 4.2a, when the number of partitions (N) reaches 32, the associativity CDF (AEF = 0.56) of the first partition is very close to the worst case. Figure 4.2b and 4.2c show the number of misses and IPCs, respectively, of the first application in each workload under the PF scheme. All the results are normalized to the results of the workloads with N = 1. As is shown in the figure, due to the degradation of associativity as the number of partitions increases, the number of misses of each application increases and its IPC decreases correspondingly. Dif- ferent applications have different sensitivities to associativity, e.g., associativity degradation has negligible effect on lbm’s performance since it has a very low rate of cache reuses but mcf CHAPTER 4. HIGH-ASSOCIATIVITY CACHE PARTITIONING 32 has more than 37% increase in cache misses and correspondingly 24% drop in IPC decrease as the number of partitions goes from 1 to 32. Hence, the PF scheme is not scalable for large-scale CMPs due to its associativity degradation. 4.2 Futility Scaling The basic idea of Futility Scaling (FS) is to control the size of each partition by scaling the futility of its cache lines. FS works as follows. Each partition has a scaling factor of α i . On each eviction, the futility of a replacement candidate (f cand ) belonging to Partition i will be scaled by α i , i.e., f scaled cand = f cand ×α i and the candidate with the largest scaled futility will be evicted. By scaling up/down the futility of lines in a partition, cache lines belonging to this partition will be evaluated as less/more useful in the view of the whole cache. Since FS always chooses the least useful replacement candidate (i.e., the candidate with the largest scaled futility) to evict, the more useless lines a partition has, the fewer number of lines belonging to this partition will be kept in the cache. Therefore, by increasing or decreasing the scaling factor of a partition, FS can shrink or expand its size correspondingly. Based on the analysis from an analytical framework, the FS scheme shows the properties of both precise partitioning and high associativity [43]. Compared to the PF scheme, FS trades some smalltemporalsizedeviationsforpreservinghighassociativity. Whenfacingtheassociativity and sizing dilemma described in Figure 4.1, FS may evict the least useful line in the first partition (as long as its scaled futility is larger than the scaled futility of the most useful line in the second partition) and thereby preserve the associativity. Although the actual size of each partition may be deviated further away from its target temporally, FS is still able to maintain the partition’s actual size statistically close to its target. Conceptually, the FS scheme is independent of a futility ranking scheme. To demonstrate its feasibility, a practical feedback-based FS design built on a simple coarse-grain timestamp- based LRU ranking scheme is presented in the following section. CHAPTER 4. HIGH-ASSOCIATIVITY CACHE PARTITIONING 33 4.3 Feedback-based Scaling Factor Adjustment A coarse-grain timestamp-based LRU (proposed in [45]) is used as the base futility ranking scheme for each partition. Each partition has an 8-bit counter for its current timestamp. An incoming cache line is tagged with the current timestamp of its partition and a partition’s currenttimestampcounterisincrementedbyoneeveryK accesses(K = 1 ~16ofthispartition’s size). In a coarse-grain timestamp-based LRU, the timestamp of a cache line having a longer distance from the current timestamp of its partition indicates this line is less recently used. Thefutilityofacachelineisestimatedbythedistancebetweenitstimestampandthecurrent timestamp of its partition. The timestamp-based futility (f ts ) of a cache line belonging to Partition i tagged with timestamp ofx is calculated asf ts = (CurrentTS i +256−x) mod 256, which is just an unsigned 8-bit subtraction in hardware. Algorithm 4.2: Feedback-based scaling factor adjustment Data: N E i : number of evictions. N I i : number of insertions. N A i : actual number of cache lines occupied by Partition i. N T i : target number of cache lines assigned to Partition i. l: interval length. Δα: changing ratio. // For each partition 1 if N E i ≥l or N I i ≥l then 2 if N I i ≥N E i and N A i >N T i then 3 α i ←α i ∗ Δα 4 else if N I i ≤N E i and N A i <N T i then 5 α i ←α i ÷ Δα 6 N I i ← 0 N E i ← 0 A feedback-based approach is used to adjust the scaling factor of each partition dy- namically. In general, enlarging or reducing the scaling factor of a partition will increase or decrease its eviction rate and shrink or expand its size correspondingly. Algorithm 4.2 describes how the scaling factor of each partition should be adjusted. The scaling factor of each partition is adjusted every l insertions or evictions (whichever is achieved first) for the partition in the following fashion. For each partition, every time that the insertion or eviction counter reaches l (i.e., the interval length), if the partition is CHAPTER 4. HIGH-ASSOCIATIVITY CACHE PARTITIONING 34 oversized (i.e., current actual partition sizeN A i is greater than its target sizeN T i ,N A i >N T i ) and has a tendency to grow (i.e., number of insertionsN I i is greater than number of evictions N E i in the last interval,N I i ≥N E i ), the scaling factorα i of this partition will be scaled up by a factor of Δα. Similarly, If N A i <N T i and N I i ≤N E i , the scaling factor α i of Partition i will be scaled down by Δα. At the end of each interval, bothN E i andN I i are then reset to zero. The interval length (l) is determined by both the number of insertions N I i and evictions N E i of a partition so that the scaling factor adjustment process can respond promptly to both the increasing (i.e., N I i reaches l first) and decreasing (i.e., N E i reaches l first) of a partition’s size. By checking the tendency of size changing in the last interval (i.e., N I i ≤N E i or N I i ≥N E i ), FS controller can avoid over-scaling line futility of a partition in the transient period of resizing, e.g., if a partition has a tendency to shrink its size, FS controller will stop increasing the scaling factor of this partition even if its current actual size is still above its target size. For hardware implementation efficiency, Δα is set to 2 so that the scaling factor will always be a power of two and the multiplication of a timestamp-based futility and a scaling factor can be done by a simple bit-shift operation in hardware. The interval length (l) is set to 16 empirically in evaluations. 4.3.1 Putting It All Together The feedback-based FS scheme uses a coarse-grain timestamp-based LRU as the underly- ing futility ranking scheme, which incurs around 1.5% storage overhead of the total cache (as described in [8]). Besides the cost of implementing a coarse-grain timestamp-based LRU, the FS cache controller only needs additional five registers per partition: 16-bit ActualSize and TargetSize, 4-bit InsertionCounter and EvictionCounter, and a 3- bit ScalingShiftWidth. The TargetSize registers are set by an external allocation policy. How to update the rest of registers is explained in the following. On a hit, the timestamp of the accessed cache line will be updated to the current times- CHAPTER 4. HIGH-ASSOCIATIVITY CACHE PARTITIONING 35 tamp of this partition obtained from the coarse-grained timestamp-based LRU procedure [8]. On a miss, the controller calculates the scaled futility for all the candidates, chooses the candidate with the largest scaled futility to evict, and inserts the incoming line. • The timestamp-based futility for a cache line belonging to Partitioni and tagged with timestamp x is calculated by one 8-bit hardware operation as f ts = (CurrentTS i + 256−x) mod 256. Thescaledfutility(f scaled ts )ofacandidatefromPartitioniisobtained throughleftshiftingf ts byScalingShiftWidth i bits,i.e.,f scaled ts =f ts <<ScalingShiftWidth i . Thenthecandidatewiththelargestscaledfutilitywillbeevicted. TheEvictionCouter i and ActualSize i of the evicted line’s partition increments and decrements by one, re- spectively. • The incoming line is inserted with the current timestamp of this partition obtained from the coarse-grained timestamp LRU procedure [8]. Both InsertionCounter i and ActualSize i of the incoming line’s partition increments by one. Additionally, to implement the scaling factor adjustment scheme described in Algorithm 4.2,thescalingfactorofPartitioniisadjustedwhenInsertionCounter i orEvictionCounter i crosses 0. IfAcutalSize i >TargetSize i andInsertionCounter i = 0(thismeansInsertionCounter i reaches l = 16 first and thus there are more insertions than evictions in the last inter- val), ScalingShiftWidth i increments by one. Similarly, if AcutalSize i <TargetSize i and EvictionCounter i = 0,ScalingShiftWidth i decrementsbyone. Otherwise,ScalingShiftWidth stays the same. AScalingShiftWidth register is a 3-bit saturation counter (i.e., the range of its value is from 0 to 7) so that the timestamp-based futilityf ts can be scaled up by a factor of 2 7 = 128 at most. After each adjustment, bothInsertionCounter i andEvictionCounter i are reset to zero. The FS controller only needs a few narrow adders and comparators to implement required counter updates and comparisons. Operations on hits are all about coarse-grain timestamp- based LRU updates, which are fairly simple without increasing the critical path (as described CHAPTER 4. HIGH-ASSOCIATIVITY CACHE PARTITIONING 36 in [8]). On misses, for the cache with R replacement candidates, the controller needs to calculate the timestamp-based futility of each candidate (i.e., R subtraction operations), scale them (i.e., R shift operations) and then identify the candidate with the largest scaled futility (i.e.,R− 1 comparisons). So, in total, there are 3R− 1 simple operations that can be easily performed in parallel and pipelined in hardware. Those operations for the replacement process can be implemented over multiple cycles since they are off the critical path. 4.4 Evaluation Table 4.2: System Configuration Cores 2 GHz in-order, x86-64 ISA, 32 cores L1 Caches split I/D, private, 32KB, 4-way set associative 1-cycle latency, 64B line L2 Cache 16-way set associative, non-inclusive, unified, shared 8-cycle access latency, 64B line 8 MB NUCA, 4 banks, 4-cycle average L1-to-L2 latency Futility Ranking: coarse-grain timestamp-based LRU/ OPT MCU 200 cycles zero-load latency, 32 GB/s peak memory BW The following five cache partitioning schemes with both coarse-grain timestamp-based LRU and the ideal OPT ranking schemes are evaluated. 1. PF: This scheme first chooses the most oversized partition among all the candidates’ partitions and then evicts the candidate with the largest futility from the chosen par- tition, as described in Algorithm 4.1. 2. PriSM [19]: This scheme first selects a partition in accordance with the pre-computed eviction probability distribution and then evicts the least useful replacement candidate belonging to the selected partition. 3. Vantage [8]: This scheme controls the size of each partition by adjusting its “aperture”. In this experiment, Vantage is configured in the same way as it in [8] on a 16-way set CHAPTER 4. HIGH-ASSOCIATIVITY CACHE PARTITIONING 37 FullAssoc,LRU FullAssoc,OPT PF,LRU PF,OPT Vantage,LRU Vantage,OPT PriSM,LRU PriSM,OPT FS,LRU FS,OPT 1 4 7 10 13 16 19 22 25 28 31 Number of subject threads (N subject ) 2800 3000 3200 3400 3600 3800 4000 4200 Average occupancy (# of lines) (a) Average occupancy of subject threads 1 4 7 10 13 16 19 22 25 28 31 Number of subject threads (N subject ) 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Average eviction futility (b) Average eviction futility of sub- ject threads 1 4 7 10 13 16 19 22 25 28 31 Number of subject threads (N subject ) 0.85 0.90 0.95 1.00 1.05 1.10 Average speedup (c) Average speedup of subject threads Figure 4.3: Comparisons among five partitioning schemes with 11 workload mixes that require different cache space guarantees. associative cache, i.e., an unmanaged regionu= 10% , a maximum apertureA max = 0.5 and slack = 0.1. 4. FullAssoc: This scheme is referred to as the PF scheme on a fully-associative cache. It always evicts the least useful cache line from the partition that exceeds its target size most. FullAssocschemeisanidealpartitioningschemethatprovidesexactpartitioning and full associativity for each partition. 5. FS: This is the proposed scheme that controls the size of each partition by scaling its line futility. The associativity and sizing properties of those schemes are compared on a QoS enabled CMP that provides cache space guarantees for 32 concurrently executing threads. The detailed system configuration is shown in Figure 4.2. Each workload mix running in this system is constructed by two types of application threads: subject threads that require cache space guarantees and background threads that have no QoS requirement. In the experiment, each subject thread runs an associativity-sensitive benchmark gromacs while each background thread runs a memory-intensive benchmark lbm. Note that lbm has a CHAPTER 4. HIGH-ASSOCIATIVITY CACHE PARTITIONING 38 much higher miss rate than gromacs which would more aggressively occupy the cache space ifresourcesharingisunregulated. Thesystemallocationpolicyassigns256KBcachecapacity (i.e.,4096cachelines)toeachsubjectthreadanddividestherestofthecachecapacityequally among background threads. To evaluate the efficacy of this proposal, 11 workloads mixes are generated by varying the number of subject threads (N subject ) in each mix (i.e.,N subject is from 1 to 31 in increments of 3), and accordingly the number of background threads in each mix is 32−N subject . Vantage is not evaluated under the workload with 31 subject threads (requiring 31 ~32≈ 97% of total cache space) as it can only manage 90% of cache space. Figure 4.3a compares the average occupancy of subject threads with different partitioning schemes in each workload mix. As shown in the figure, FullAssoc, PF and FS can enforce the cache occupancy of each subject thread very close to its target size. In Vantage, cache space is divided into a managed region and an unmanaged region. Vantage can provide strong isolation for partitions in the managed region if all the evictions are only from the unmanaged region. With strong isolation, Vantage can always over-provision cache space to a thread, i.e., guarantee a thread the exact amount of resources assigned by an allocation policy in the managed region and meanwhile, allow the thread to borrow more space from the unmanaged region. However, on a 16-way set associative cache, 10% unmanaged region (u = 0.1) is not large enough for Vantage to provide strong isolation, i.e., there is 18.5% probability (P ev = (1−u) 16 ≈ 0.185 [8]) that a line in the managed region is forced to be evicted. Due to the relatively weak isolation, Vantage cannot enforce the sizes of the partitions strictly above their targets. As can be seen in the figure, although the average occupancy of partitions in Vantage is relatively close to their targets, it can be at most 3% below its target. Note that Vantage could provide a higher degree of isolation on a cache that provides more replacement candidates (e.g., Z4/52 zcache [8]). In PriSM, the replacement process has two steps: (1) Partition-Selection: choose a partition according to the pre- calculated eviction probability distribution and then (2) Victim-Identification: choose the victimthatbelongstothepartitionselectedbythepartition-selectionstep. However, thereis CHAPTER 4. HIGH-ASSOCIATIVITY CACHE PARTITIONING 39 a possibility of “abnormality” such that no replacement candidate belonging to a partition is identified by the partition-selection step. PriSM is designed to properly enforce the desired partition size only when this “abnormality” is rare. However, in this experiment, due to the large number of partitions (N = 32) and the small number of replacement candidates (R = 16), the possibility of this “abnormality” becomes very high (more than 70%) and consequently PriSM loses the ability to properly enforce the desired partitioning. As shown in the figure, in PriSM, the average occupancy of subject threads is, on average, 20.9% and 9.9% with LRU and OPT, respectively, below the target among all the workloads. Figure 4.3b compares the average eviction futility (AEF) of subject threads with different partitioning schemes in different workloads. As expected, FullAssoc always maintains full associativity (i.e., AEF=1). PF suffers from severe associativity degradation, and its lowest AEF is less than 0.51 across all the workloads. FS provides consistently high associativity, i.e., its AEF is, on average, 0.86/0.84 with LRU/OPT across all the workloads. In Vantage, the lines in a managed region have smaller futility than the ones in an unmanaged region. Due to the high probability of forced evictions from the managed region (i.e., forcing to evict a line with a small futility), the associativity of the partitions in Vantage is slightly degraded, i.e., its AEF is, on average, 0.80/0.79 with LRU/OPT. PriSM partitions a cache in a similar way to PF, i.e., first choosing a partition (Partition-Selection) and then evicting from the selected partition (Victim-Identification). Therefore, PriSM is supposed to suffer from the associativity degradation the same way as PF when the list of replacement candidates is shortened in the Victim-Identification step. However, as shown in the figure, PriSM achieves a better associativity than PF (i.e., its AEFs are above 0.73). This is due to the high occurrencesof“abnormality”, inwhichcase, PriSMwillchooseacandidatewithlargefutility regardless of partitioning requirements and, thus, improve associativity. Figure 4.3c compares the average speedups ( IPC share IPC alone ) of subject threads in each work- load. IPC alone refers to the IPC achieved when a subject thread runs on a 256KB private 16-way set associative cache with LRU ranking. FullAssoc always achieves the best perfor- CHAPTER 4. HIGH-ASSOCIATIVITY CACHE PARTITIONING 40 mance across different mixes, i.e., achieves average speedup of 1.0 and 1.05 with LRU and OPT, respectively. Due to the severe associativity degradation, the average speedups with the PF scheme are only 0.90 and 0.91 with LRU and OPT, respectively. In Vantage, the average speedups of subject threads achieve 0.95 and 1.0 with LRU and OPT, respectively. Because of the high occurrences of “abnormality” when the number of partitions is large, PriSM generally has little control of cache resources, and thus its average speedups are, on average, only 0.90 and 0.98 with LRU and OPT, respectively. Owing to its properties of high associativity and precise sizing, FS has consistently high speedups of, on average, 0.995 and 1.036 with LRU and OPT, respectively, across all workload mixes. 4.5 Summary In this chapter, the associativity degradation problem in a replacement-based partitioning scheme for large-scale CMPs is identified. Then, a novel replacement-based partitioning scheme named Futility Scaling (FS) is presented. The proposed FS can precisely partition the whole cache while still maintaining high associativity even with a large number of par- titions. Simulation results show that, due to its properties of both precise sizing and high associativity, FS provides significant performance improvement over prior art. Chapter 5 Predictable Cache Protection Policy Cache protection policies are often adopted in modern CMPs to achieve high performance. However, existing cache protection policies [21, 22] are designed empirically and thus lack predictability. In this chapter, a performance model for cache protection technique is pro- posed. The model provides insights about the relationship between the performance of a cache protection policy and the reuse streak pattern of an application. In order to predict the performance of a cache protection policy at runtime, a low-cost profiler to estimate the average streak length is designed. With both the model and the information about the average streak length, the design of a practical cache protection policy with predictable performance is proposed. 5.1 Background and Motivation 5.1.1 Predictability of Cache Replacement Policy A cache replacement policy is predictable if the miss rate curve of a single access stream yielded by it can be estimated at reasonable low cost (e.g., <1% chip-area overhead) [47]. To increase utilization of limited cache resources in modern CMPs, last level caches are often shared among multiple concurrent threads. For efficient shared cache management, 41 CHAPTER 5. PREDICTABLE CACHE PROTECTION POLICY 42 0 1 2 3 4 5 6 7 8 9 Cache Size (MB) 0.5 0.6 0.7 0.8 0.9 1.0 Miss rate Predicted LRU points Predicted LRU curve Extreme Points Talus+LRU curve "Knee" point Talus+LRU+"Knee" curve Figure 5.1: Predicted miss rate points/curves of cactusADM (in SPEC CPU2006 benchmark suite) with three predictable cache replacement polices. dynamic cache partitioning schemes are often used to properly divide the cache resource among individual threads in order to achieve certain system level objective (e.g., maximiz- ing throughput, ensuring fairness or satisfying SLA/QoS requirements) [6, 43, 8]. In this context, a cache replacement policy that provides predictable miss rate is highly desirable since miss rates allow a cache partitioning algorithm to infer the impact of cache resources on individual thread performance efficiently (i.e., without trial and errors) and then make intelligent partitioning decisions. Predicting a miss rate curve usually consists of two steps: (1) predicting a set of points (pairs of miss rate and cache size), (2) forming a curve based on the predicted points. Fig- ure 5.1 shows predicted miss curves of cactusADM with three predictable cache replacement policies. By using utility monitors (UMONs) [6], the miss rate of cactusADM at some cache sizes (i.e., a set of points) under LRU can be estimated in practice (blue circles in Figure 5.1). There are two ways to form the curve based on those points. One is simply assuming the miss rate curve monotonically decreases as cache size increases so that the LRU miss rate curve can be estimated conservatively as the green dashed line in Figure 5.1, i.e., the miss rate will not decrease until the next predicted point. The other one is to adopt the Talus technique [47], which can yield a miss rate curve that traces out the convex hull of the given set of points. For example, given a set of predicted points provided by the UMONs (e.g., blue CHAPTER 5. PREDICTABLE CACHE PROTECTION POLICY 43 circles in Figure 5.1), Talus can yield a convex miss rate curve (shown as the red dashed line) by properly distributing a single cache access stream into two pre-sized shadow partitions. Notice that the convex miss rate curve yielded by Talus is bounded by the predicted points provided by the underlying replacement policy, e.g., the red dashed line is bounded by the three blue extreme points in Figure 5.1. Although it is predictable, LRU policies usually yield inferior single thread performance for LLC due to the lack of temporal locality [21, 22]. Modern cache protection policies (e.g., [21, 22]) often have better performance than LRU but lack predictability. If we can predict their miss rates for certain cache sizes, i.e., the purple square in Figure 5.1, we can enforce a better predictable convex miss rate curve by using Talus on it (e.g., the yellow dash line). In this chapter, a performance model for a cache protection policy is built. Based on the model, somecriticalpoints(“kneepoints”)onthemissratecurveofacacheprotectionpolicy can be predicted. With the Talus technique, a protection policy with predicted performance is proposed. 5.1.2 Cache Protection Policies When the working set is too large to fit into the cache entirely, cache protection policies are often used to protect part of, rather than the whole, working set in order to avoid cache thrashing. Generally, there are two flavors of cache protection policies: (1) Insertion-based policy that controls insertion ratio (ρ), i.e., what fraction of incoming lines is protected. An example is Bimodal Insertion Policy (BIP [21]) which inserts every 1/32 of incoming cache lines (ρ= 1~32) into the MRU position for protection while the rest of the incoming lines are inserted into the LRU position for quick eviction. (2) Protecting-distance-based policy that controls the protecting distance (d p ), i.e., how long existing lines are protected. An example is Protecting Distance based Policy (PDP [22]), in which an inserted/reused line is protected for d p accesses until its next reuse/eviction and an incoming line will bypass the cache if no unprotected candidates are available. Notice that insertion ratio ρ and protecting distance CHAPTER 5. PREDICTABLE CACHE PROTECTION POLICY 44 d p are closely related in the context of limited cache capacity. For example, in a cache with a fixed size, including more incoming lines for protection in BIP (i.e., larger ρ) will reduce the duration for each line to be protected in the cache (i.e., smallerd p ), and similarly, increasing protecting distance (larger d p ) in PDP will reduce the chances of an incoming line to be inserted into the cache (smallerρ). Therefore, in order to predict the performance of a cache protection policy, a parameterized model with both ρ and d p is needed. In this chapter, a model with the input of an insertion ratio and a protecting distance is proposed to predict the performance of cache protection policies. The model also requires cache access reuse streak information. Based on the model, we can predict the hit rate and required cache size for protected lines with a given insertion ratio and protecting distance. The performance of cache protection policies are determined by how efficiently it can re- tain useful lines and discard useless lines. Generally, cache protection policies protect reused lines instead of new incoming lines. The reason is based on reuse locality – a reused line is more likely to be reused again. The consequence of this behavior is that a line with a long series of consecutive reuses (a.k.a., reuse streak) is easier to occupy the cache than a line with fewer reuses (see next section for details). In other words, the performance of cache protection policies is significantly influenced by the reuse streak pattern of an application stream. Therefore, in order to predict the cache performance accurately, we need to charac- terize the reuse streak pattern behavior instead of just reuse distance pattern, and integrate it into the model. In this chapter, we introduce a more refined way (compared to reuse distance pattern) to characterize application behavior. Based on this information, a model to predict the miss rate of a cache policy with given ρ, d p and cache sizes is built. Profiling the full reuse streak pattern is very costly. To make it more practical, we show how to profile average streak length with different d p efficiently, at run time, and predict the “knee” point with the average reuse streak information. With “knee” points and the previously proposed Talus technique, a practical cache protection policy with predicted performance is proposed. CHAPTER 5. PREDICTABLE CACHE PROTECTION POLICY 45 5.2 Model 5.2.1 Overview A cache protection policy is modeled as follows: (1) A cache line will be retained in the cache at least d p accesses from its last reference. We say a line is “protected” if the number of accesses from its last reference is no greater than d p (i.e., the protecting distance). After that, it will become “unprotected”. (2) On a miss, an incoming line will be inserted into the cache at a probability ofρ (i.e., the insertion ratio) and correspondingly bypass the cache at the probability of 1−ρ. (3) We assume the unprotected lines in the cache will not be hit. This is generally valid since a real cache replacement policy will evict those unprotected lines very quickly. Given the protecting distance d p , insertion ratio (ρ) and reuse streak pattern (which will be described in the next subsection), the model will calculate the required cache size (s) for protected lines and the corresponding hit rate (h). The notations that are used in this section are summarized in Table 5.1. Table 5.1: Notations used in this section. Symbols Meaning L(d p ) Average reuse streak length when the protecting distance is d p D(d p ) Average reuse distance when the protecting distance is d p H max (d p ) Max hit rate when the protecting distance is d p ρ Insertion ratio d p Protecting distance N streak (l,d p ) Number of d p -protected reuse streaks whose length is l 5.2.2 Reuse Streak Concept To model a cache protection policy, we introduce the concept of protected reuses. We say a cache access is a d p -protected reuse if its reuse distance is no greater than d p . Corre- spondingly, an access is d p -protected non-reuse if its reuse distance is greater than d p . A d p -protected reuse will result in a hit under a cache policy with protecting distance d p if the CHAPTER 5. PREDICTABLE CACHE PROTECTION POLICY 46 cache line is already in the cache. A d p -protected reuse streak is a longest possible sequence of consecutive d p -protected reuses. Figure 5.2 shows an example of the reuse streak pattern of a cache access stream. Let A i denote the access to the address A at timei. From the fig- ure, we can see that the reuse streak pattern can change with different protecting distances. In general, the reuse streak pattern changes as d p changes. When d p = 1, there are only two 1-protected reuse streaks: A 2 and A 5 , both of which have length of 1. When d p = 2, there is only one 2-protected reuse streak A 2 A 4 A 5 of length of 3. When d p = 3, one more reuse streak is formed, i.e., B 3 . Note that, as d p increases, more accesses will become reuses, and the number of reuse streaks may increase (e.g., new reuses may become new reuse streaks), decrease (e.g., a new reuse may concatenate two existing short reuse streaks into one long streak) or stay the same. Therefore, as d p changes, the average reuse streak length (i.e., number of reuses / number of reuse streaks) can fluctuate. Figure 5.3 shows the average reuse streak length of six benchmarks over different d p . From the figure, we can see that various applications have different reuse streak patterns. For example, cactusADM has an average reuse streak length of over 60 around d p = 2 17 , 1.6 around d p = 2 16 and 5.7 around d p = 2 18 . The average reuse streak length can be profiled at low cost (see Section 5.3) Time 1 2 3 4 5 6 Number of Average Reuse Accesses A A B A A B Reuse Streaks Streak Length d p = 1 N streak (1, 1)= 2 2/2=1 d p = 2 N streak (3, 2)= 1 3/1=3 d p = 3 N streak (1, 3)= 1 4/2=2 N streak (3, 3)= 1 Figure 5.2: An example of reuse streak patterns. The length of a line segment represents a reuse distance. The number of dots represents reuse streak length. CHAPTER 5. PREDICTABLE CACHE PROTECTION POLICY 47 0 1 2 3 4 5 6 7 8 Protecting distance (×2 17 ) 0 10 20 30 40 50 60 70 Average streak length >300 astar cactusADM gcc hmmer mcf sphinx3 Figure 5.3: Average reuse streak length of six benchmarks with different protecting distances. 5.2.3 Performance Model For the rest of this section, we assume the protecting distance is always d p , and simplify N streak (l,d p ), H max (d p ), L(d p ) and D(d p ) as N streak (l), H max , L, D. Assume the insertions of incoming lines are independent and the probability that an incoming line is inserted into the cache is ρ. The number of failures N failures before a cache line is successfully inserted into the cache follows a finite geometric distribution and its mean value is E(N failures ) =∑ l i=1 ρ(1−ρ) i−1 i = (1−ρ)(1−(1−ρ) l ) ~ρ. After a line is successfully inserted into the cache, all the remaining accesses in the reuse streak will result in cache hits. Therefore, the average number of hits of a reuse streak with the length ofl at insertion ratio of ρ, h streak (l,ρ), can be expressed by the following: h streak (l,ρ)=l−E(N failures )=l− (1−ρ)1−(1−ρ) l ρ (5.1) ⪆l+ 1− 1 ρ (5.2) Note that when l → ∞, the number of failures N failures will approximately follow an infinite geometric distribution andh streak (l,ρ) can be expressed as Equation (5.2). We refer to Equation (5.1) and Equation (5.2) as the precise and approximate model, respectively. CHAPTER 5. PREDICTABLE CACHE PROTECTION POLICY 48 0 20 40 60 80 100 Streak length (l) 0.0 0.2 0.4 0.6 0.8 1.0 Hit rate (h streak (l)~l) ρ= 1~32 Precise Model Approximate Model Figure 5.4: Hit rate of a single streak with different lengths at insertion rate (ρ) of 1/32. Figure 5.4 shows the hit rate of a single streak with different lengths at insertion rate of 1 ~32. From the figure, we can see that, when l is large (e.g., >80), the hit rate predicted by the approximate model is very close to the one predicted by the precise model. Also, as streak lengthl increases, the hit rate increases. When l> 80, even with smallρ of 1 ~32, more than 60% of the accesses result in cache hits, which suggests that a longer streak has stronger capability to occupy the cache in a cache protection policy whereas short streaks (or even non-reuses) are more susceptible to be “repelled” from the cache. We refer to this effect as streak effect. In other words, the cache protection policy serves as a “filter” that allows the long reuse streaks to occupy the cache while blocking short reuse streaks (and non-reuses) from occupying the cache. The insertion ratio ρ will control the effect of “filtering”, i.e., larger ρ will diminish the effect and, thus, when ρ= 1 there is no “filtering”. Assume that the information about reuse streak pattern of an application, i.e., number of reuse streaks of different lengths N streak (l), is given. We can calculate the hit rate of a cache access stream as follows: h(ρ)= total hits total accesses = ∑ ∞ l=1 N streaks (l)×h streak (l) total accesses (5.3) ⪆H max 1+ 1 L − 1 ρL =H max − H max L 1−ρ ρ substitute with Eq. (5.2) (5.4) CHAPTER 5. PREDICTABLE CACHE PROTECTION POLICY 49 0.0 0.2 0.4 0.6 0.8 1.0 Insertion rate 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Hit rate Precise Model Approximate Model Simulation linear reference line (a) d p =2 17 , L(2 17 )=60.5 0.0 0.2 0.4 0.6 0.8 1.0 Insertion rate 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 Hit rate Precise Model Approximate Model Simulation linear reference line (b) d p =2 16 , L(2 16 )=1.6 0.0 0.2 0.4 0.6 0.8 1.0 Insertion rate 0.0 0.1 0.2 0.3 0.4 0.5 Hit rate Precise Model Approximate Model Simulation linear reference line (c) d p =2 18 , L(2 18 )=5.7 Figure 5.5: Hit rate over insertion ratio for cactusADM under a static cache protection policy with different protecting distances. where H max = total reuses ~total accesses, total reuses =∑ ∞ l=1 N streaks (l)×l, L = total reuses ~total streaks and total streaks=∑ ∞ l=1 N streaks (l) . Equation (5.4) shows that when a cache access stream contains long reuse streaks (i.e., l is large), the hit rate can be approximately estimated with the information about only average reuse streak length (L) rather than the full reuse streak pattern. Figure 5.5 shows the hit rate of cactusADM over insertion ratio for the precise model, approximate model, and simulation results under a static cache protection policy with d p = 2 16 , 2 17 , 2 18 . From the figure, we can see that when L is large (e.g., > 60), the hit rate curves predicted by the approximate and precise models are close. Both of them match the simulation results well. When L is small, the curves skew towards the linear line where the hit rate is proportional to the insertion ratio. This can be deducted from Eq. (5.1). The differences between approximate model and precise model become large. In general, the average cache size can be estimated by the quotient of life time of all lines by total accesses [48]. For a cache protection policy, if a cache line is inserted into the cache, its lifetime includes zero or more hits and one eviction. The lifetime for a line that bypasses the cache is zero. The lifetime of all lines is the sum of the lifetime of all hits (total hits×D) and the lifetime of all evictions (total evictions×d p ). The lifetime of a hit is its reuse distance and the lifetime of an eviction isd p . Note that the total evictions is equal to total insertions, CHAPTER 5. PREDICTABLE CACHE PROTECTION POLICY 50 0 1 2 3 4 5 6 6.7 Cache Size (MB) 0.0 0.1 0.2 0.3 0.4 0.5 Hit rate Precise Model Approximate Model Simulation max point (a) d p =2 17 0 1 2 3 3.8 Cache Size (MB) 0.00 0.05 0.10 0.15 0.20 Hit rate Precise Model Approximate Model Simulation max point (b) d p =2 16 0 1 2 3 4 5 6 7 8 9 1010.7 Cache Size (MB) 0.0 0.1 0.2 0.3 0.4 0.5 Hit rate Precise Model Approximate Model Simulation max point (c) d p =2 18 Figure 5.6: Hit rate curves of cactusADM under a static cache protection policy with different protecting distances. which is ρ×total misses. s(ρ)= lifetime of all lines total accesses = total hits×D+total evictions×d p total accesses = total hits total accesses ×D+ total insertions total accesses ×d p =h(ρ)D+ρ(1−h(ρ))d p (5.5) In summary, the combination of Equation (5.4) and Equation (5.5) is the full model for a cache protection policy, which takes inputs of cache policy’s parameters including insertion rateρ and the protecting distanced p and the characteristics of cache access stream pattern, including average reuse streak length L, average reuse distance D and maximum hit rate H max . Figure 5.6 shows modeled/simulated hit rate curves of cactusADM under a static cache protection policy with d p = 2 16 , 2 17 , 2 18 . From the figure, we can see that, when d p = 2 17 and average streak length is large (>60), both approximate (green line) and precise (blue line) models have reasonably good predictions, i.e., the predicted hit rate curves are close to simulated results. CHAPTER 5. PREDICTABLE CACHE PROTECTION POLICY 51 0 1 2 3 4 5 6 6.7 Cache Size (MB) 0.0 0.1 0.2 0.3 0.4 0.5 Hit rate maximum Precise Model Approximate Model Simulation knee knee_simple reference line (Hmax×D, Hmax) Asymptotic line, L→∞ Figure 5.7: An illustration of “knee point”. 5.2.4 Knee point The “knee” point is the point on the approximate hit rate curve that has the maximum distance from the linear reference line, i.e., the line connects the (0, 0) and “max point”. The gradient of the approximate hit rate curve at “knee point” is equal to the slope of the linear reference line. Therefore, we can derive the “knee point” by the following ∂h ∂s = ∂h ∂ρ ∂s ∂ρ = H max S max = H max H max D+(1−H max )×d p (5.6) By substituting Eq. (5.4) and Eq. (5.5) into Eq. (5.6), we can derive the following: ρ knee = 1 ¼ L− Hmax L(1−Hmax) ≈ 1 » L (5.7) Figure 5.7 shows the “knee” point of cactusAMD under a static cache protection policy withd p = 2 17 . From the figure, we can see that the actual hit rate curve can be approximated by the curve constructed by the three points: (0, 0), “knee” and “max” point (the dotted blue line). The efficiency of a cache protection policy is determined by two aspects: (1) how effi- ciently it can repel non-reused lines, and (2) how efficiently it can retain reused lines. Due to the streak effect, when insertion ratio ρ is small, most non-reuses will be excluded from the cache. If an application has many long reuse streaks (L is large), then the most reused line CHAPTER 5. PREDICTABLE CACHE PROTECTION POLICY 52 will be retained in the cache and reuses become hits. In the extreme case, when ρ→ 0 and ρL→∞, all of the reuses will become hits, i.e., hit rate isH max and all of the non-reused lines will bypass the cache and thus not consume any cache resources, i.e., the required cache size is H max ×D. In this case, the predicted point is the green square shown in Figure 5.7. We refer to this point as the “limit point”. Note that the hit curves predicted by both the precise and the approximate model will always be below this point. When ρ= 1, i.e., no bypass, all the reuses become hits H max and the required cache size is S max =H max D+ (1−H max )d p , We refer to this point as the “max point”. -0.0 4.0 8.0 12.0 Cache size (MB) 0.0 0.1 0.2 0.3 0.4 0.5 Hit rate Approximate Model dp =2 16 Approximate Model dp =2 17 Approximate Model dp =2 18 max points knee points (0,0) final curve Figure 5.8: The construction of the final curve of cactusADM. The final miss rate curve is the convex hull of set of points including (0, 0), knee points and max points at different protecting distances. Figure 5.8 shows the construction of a predictable hit rate curve based on predicted “knee” and “max” points of cactusADM at d p = 2 16 , 2 17 , 2 18 . Based on seven predicted points, including (0,0), three “knee points” (yellow) and three “max points” (blue), Talus can yield a predictable hit rate curve as solid blue shown in Figure 5.8. 5.2.5 Discussion Overall, the precise model with full information about reuse streak pattern can predict the performance reasonably accurately. For the knee-based approximate model, there are three CHAPTER 5. PREDICTABLE CACHE PROTECTION POLICY 53 cases. 1. When L is large (e.g., cactusADM with d p = 2 17 ), the knee-based curve is a reasonable good approximation to the actual hit rate curve because the “knee” is close to the “limit” point. 2. When L is small and the majority of reuse streaks are short (e.g. cactusADM with d p = 2 16 ), the knee-based curve is also close to the actual hit rate curve because the curve is bounded by the “max” point which is always accurate. 3. When L is small but most reuses belong to long reuse streaks (e.g., cactusADM with d p = 2 18 ), the prediction is not accurate, which may result in inferior performance due to the suboptimal d p enforcement. This can be improved by using more refined profiling of the reuse streak pattern. We leave this as future work. 5.3 Implementation 5.3.1 Overview We use a coarse-grain timestamp mechanism to measure the reuse distances and protecting distances. Each partition has an 8-bit counter for its current timestamp. An incoming cache line is tagged with the current timestamp of its partition and a partition’s current timestamp counter is incremented by one every 8,192 accesses. The age of a cache line is measured by the distance between the current timestamp and the timestamp that the line is last used, i.e., (CurrentTS i + 256−LastUsedTS) mod 256. For our cache protection policy, a cache is logically divided into two regions (as is shown in Figure 5.9): (1) protected region, which contains all cache lines that are under “protection” and (2) unprotected regions, which contains all cache lines that are unprotected. In this work, the unprotected cache region takes up 0.1 of total cache size. The target size of the protected region of each partition is CHAPTER 5. PREDICTABLE CACHE PROTECTION POLICY 54 configured by the allocation policy and the sum of all target sizes is equal to 0.9 of total cache size. In general, a cache protection policy has four operations (as is shown in Figure 5.9): (1) insertion, (2) eviction, (3) demotion, and (4) promotion. Promotions are expected to be rare. On a miss, all replacement candidates whose age is greater than the protecting distance, d p , of its shadow partition will be demoted into the unprotected region. If the number of protected lines is smaller than the target size, (i.e., N protected [i] <N target [i]) and there are unprotected candidates, then the incoming line will be inserted into the cache and an unprotected candidate will be evicted out of the cache. Otherwise, the incoming line bypasses the cache. On a hit, the age of this line is reset to zero. If the hit line is unprotected, it will be promoted into the protected region. The reconfiguration interval in our policy is 10M instructions. The rest of this section describes how to decide the size and the protecting distance of each shadow partition. unprotected region Partition 1 Partition 2 protected region Insertions Evictions demotion promotion Figure 5.9: Predictive Cache Protection Policy (PCPP) enforcer. In order to derive the best protecting distance, d p , for each partition from the model, we need to profile the cache access pattern, i.e., D(d p ), L(d p ), H max (d p ) of an application at runtime. CHAPTER 5. PREDICTABLE CACHE PROTECTION POLICY 55 ... A ... A ... A ... d p <D cur D cur ≤d p <D last d p ≥D last D last D cur ΔS start [D cur ]++ ΔS start [D last ]−− (a) Starting of reuse streaks D last >D cur ... A ... A ... A ... d p <D last D last ≤d p <D cur d p ≥D cur D last D cur ΔS end [D last ]++ ΔS end [D cur ]−− (b) Ending of reuse streaks D last <D cur Figure 5.10: Changes of reuse streaks. Dots represent reuses and crosses represent non- reuses. 5.3.2 Application Profiling To monitor the reuse streak pattern of an application, we use a 64× 64 shadow tag array that samples 1/128 of accesses for each thread through a uniform hashing, (as is shown in Figure 5.11). Hence, the tag array behaves like a 32MB cache (i.e., 64×64×128 cache lines). Each entry in the array has three parts: 16-bit partial hashed tag, 8-bit last reuse distance (lastRD) and 12-bit last used timestamp (lastTS). We set the max reuse distance, D max , to 128. Each shadow tag array is associated with a 12-bit current timestamp counter that increments every 64 accesses. The total cost of one shadow tag array is 18kB. In a 8MB cache with four partitions, the total cost of four shadow tag arrays is less than 1% of total cache size. The average streak length can be calculated by dividing the total number of reuses by the number of reuse streaks, N streak (d p ). In order to get the number of reuse streaks at runtime, we need to track the changes of reuse streaks, i.e., when a reuse streak starts/ends, with different protecting distances. Detect the starting and ending of a reuse streak We can detect the start of a reuse streak when there is a d p -protected reuse right after a d p -protected non-reuse. Figure 5.10 shows the dynamics (starting and ending) of d p - CHAPTER 5. PREDICTABLE CACHE PROTECTION POLICY 56 protected reuse streaks. Figure 5.10a shows how starting of reuse streaks changes with different protecting distances. There are three consecutive accesses to A. The reuse distance between the 1st and 2nd access is D last , and the reuse distance between the 2nd and 3rd accesses is D cur . There are three cases: (1) d p <D cur , in which all accesses are non-reuses, and thus no new reuse streak starts, (2) D cur ≤d p <D last , in which the 2nd access to A is a non-reuse and the 3rd access to A is a reuse (this suggests a new reuse streak starts) (3) d p ≥D last , in which both the 2nd and 3rd accesses to A are reuses, thus no new reuse streak starts. In summary, when D cur ≤ d p < D last , there is a start of a reuse streak, therefore, the number of reuse streak starts, S start [d p ], increments by one. To record this efficiently, let ΔS start (i) =S start [i]−S start [i− 1] and S start [0] = 0, i.e., S start [d p ] =∑ i=dp i=1 ΔS start [i]. To record the increment of the starts of reuse streaks in the range of D cur ≤d p <D last , we can simply increment ΔS start [D cur ] by one and decrement ΔS start [D last ] by one. Similarly, we can detect the end of a reuse streak when there is a d p -protected non- reuse right after a d p -protected reuse. As shown in Figure 5.10b, when D last ≤ d p < D cur , S end [d p ] will increment by one. To represent this, we can increment ΔS end [D last ] by one and decrements ΔS end [D cur ] by one, where ΔS end [i]=S end [i]−S end [i− 1], and S end [0]= 0, i.e., S end [d p ]=∑ dp i=1 ΔS end (i). Profiling H max , L and D array As shown in Algorithm 5.1, on an access, if it is a hit, D cur is calculated as curTS − hitBlock.lastTS. If it is a miss, D cur is set to D max + 1. D last is always equal to the lastTS tagged with the hitBlock or victimBlock. Based onD cur andD last , ΔS start , ΔS end can be updated based on the illustration in Figure 5.10. ΔH[x] is the total number of accesses whose reuse distance is x. At each access, ΔH[x] increment by one. Algorithm 5.2 shows how to profile H max , D and L on each reconfiguration. Total d p - protected reuses,H max [d p ], can be calculated by∑ dp i=1 ΔH[i]. Average reuse distance,D[d p ], (i.e., average lifetime of each reuse) can be calculated by dividing total reuses H max [d p ] into CHAPTER 5. PREDICTABLE CACHE PROTECTION POLICY 57 Algorithm 5.1: Update counters on a cache access Data: Let curTS be the current timestamp Data: Each block has two fields: D last and T last // On a cache access 1 if a cache block is hit then // hit 2 D cur ← min{curTS−hitBlock.lastTS,D max + 1} 3 D last ←hitBlock.lastRD // update the line 4 hitBlock.lastTS←curTS 5 hitBlock.lastRD←D cur 6 else // miss 7 D cur ←D max + 1 8 D last ←victimBlock.lastRD // insert a new line 9 victimBlock.tag←the incoming line’s tag 10 victimBlock.lastTS←curTS 11 victimBlock.lastRD←D cur // Update counters 12 increment ΔH [D cur ] 13 if D cur <D last then 14 increment ΔS start [D cur ] 15 decrement ΔS start [D last ] 16 else 17 increment ΔS end [D last ] 18 decrement ΔS end [D cur ] totalReuseLifetime, which is the sum of reuse distances, i.e.,∑ dp i=1 ΔH[i]×i. In order to calculate the average reuse streak length, we need to track the total number of active reuse streaks (numStreaks[d p ]). In each reconfiguration period, numStreaks[d p ] increases by S start [d p ] and then decreases by S end [d p ]. Note that the runtime of calculating H max [d p ], D[d p ], and L[d p ] is Θ(D max ) since all the operations can be efficiently done in one loop. 5.3.3 Putting It All Together Figure 5.11 shows the overall architecture of the proposed scheme. Each partition contains two shadow partitions managed by the Talus policy. On each reconfiguration, there are three steps. CHAPTER 5. PREDICTABLE CACHE PROTECTION POLICY 58 Algorithm 5.2: Update H max [...], D[...]andL[...] 1 for d p ← 1 to D max do // update H max and D 2 H max [d p ]←∑ dp i=1 ΔH[i] 3 totalReuseLifetime[d p ]←∑ dp i=1 ΔH[i]×i 4 D[d p ]← totalReuseLifetime[dp] Hmax[dp] // update L 5 S start [d p ]←∑ dp i=1 ΔS start [i] 6 S end [d p ]←∑ dp i=1 ΔS end [i] 7 numStreaks[d p ]←numStreaks[d p ]+S start [d p ] 8 totalLength[d p ]←totalLength[d p ]+H max [d p ] 9 L← totalLength[dp] numStreaks[dp] 10 numStreaks[d p ]←numStreaks[d p ]−S end [d p ] 11 totalLength[d p ]←totalLength[d p ]−S end [d p ]×L[d p ] // Reset counters 12 for d p ← 1 to D max do 13 ΔH[d p ]← 0, ΔS start [d p ]← 0, ΔS end [d p ]← 0 H max [...],D[...],L[...] M i s s r a t e c u r v e s T a r g e t s i z e s P ro t e c t i n g d i s t a n c e a n d T a r g e t s i z e A c c e s s a d d r e s s 1 / 1 2 8 S a m p l i n g A l l o c a t i o n A l g o r i t h m la s t T S ( 1 2 b i t s ) la s t R D ( 8 b i t s ) h a s h e d T a g ( 1 6 b i t s ) L a s t L e v e l C a c h e P C P P E n f o r c e r 6 4 6 4 S h a d o w T a g A r r a y P re - P r o c e s s i n g P o s t -P ro c e s s i n g Figure 5.11: PCPP implementation. 1. In the pre-processing step, the scheme will collect H max , D and L according to Algo- rithm 5.2. Based on those values, the scheme will calculate all the “max points” and “knee points”. Then, based on those points, the scheme will adopt the Talus technique to generate a convex miss rate curve for each thread, as described in the last section. 2. Based on the predicted miss rate curves of each partition, the allocation algorithm will CHAPTER 5. PREDICTABLE CACHE PROTECTION POLICY 59 then determine the best target size for each partition in order to maximize the system level objective. In the single-core context, there is only one partition and its target size is always 0.9 of total cache size. Note that our scheme works with any existing allocation algorithm. 3. In the post-processing step, based on the target size of each partition and predicted miss rate curve in step (1), also by using the Talus technique, the scheme will generate the target size of each shadow partition, the Talus sampling rate and the protecting distance d p , which are needed to enforce the protecting distance based policy. 5.4 Evaluation 5.4.1 System Configuration Table 5.2: System configuration . Cores 2 GHz in-order L1 caches split I/D, private, 32KB, 4-way set associative 1-cycle latency, 64B line L2 caches private 8-way set associative, 256KB, inclusive 6-cycle latency, 64B line L3 shared, 32-way hashed set associative, 20-cycle latency Memory 200 cycles zero-load latency, 32 GB/s peak memory BW Our simulator models a shared last-level (L3) cache, and an off-chip memory. The sim- ulator is fed with L3 access traces collected from the Sniper simulator [49] that models an in-order core, on-chip L1 caches, L2 caches and a perfect L3 cache (i.e., no L3 misses). During the trace-driven simulation, the memory access latency will be fed back into trace timing and, thus, delay future L3 cache accesses accordingly. The detailed configuration of the system is shown in Table 5.2. The SPEC CPU 2006 benchmarks with reference input is used for the evaluation. For each benchmark, PinPoint [50] is used to select a representa- tive region of one billion instructions. The simulation runs until 1B instructions have been CHAPTER 5. PREDICTABLE CACHE PROTECTION POLICY 60 0 1 2 4 5 6 7 9 10 cache size (MB) 0.4 0.5 0.6 0.7 0.8 0.9 1.0 miss rate LRU DRRIP PDP PCPP Prediction (a) cactusADM 0 1 2 4 5 6 7 9 10 cache size (MB) 0.5 0.6 0.7 0.8 0.9 1.0 1.1 miss rate LRU DRRIP PDP PCPP Prediction (b) lbm 0 3 6 9 12 15 cache size (MB) 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 miss rate LRU DRRIP PDP PCPP Prediction (c) mcf 0 1 2 4 5 6 cache size (MB) 0.0 0.2 0.4 0.6 0.8 1.0 miss rate LRU DRRIP PDP PCPP Prediction (d) gcc 0 1 2 4 5 cache size (MB) 0.0 0.2 0.4 0.6 0.8 1.0 miss rate LRU DRRIP PDP PCPP Prediction (e) sphinx3 0 3 6 9 12 15 18 21 cache size (MB) 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 miss rate LRU DRRIP PDP PCPP Prediction (f) xalancbmk Figure 5.12: Comparsion of miss rate curves of six benchmarks. executed for each thread. The results are calculated by using the statistics for the first one billion instructions of each application. The proposed scheme (precitable cache protection policy, PCPP) is compared with DRRIP [20] and PDP [22]. 5.4.2 Results Figure 5.12 shows the miss rate curves of six benchmarks. From the figure, we can see that the performance of our proposed scheme (PCPP) matches or exceeds that of the previous proposed schemes for five out of the six benchmarks. For mcf, the proposed scheme is inferior to DRRIP due to the inaccuracy of the approximate model. From the figure, we can see that our scheme provides reasonably good prediction, which can help the cache partitioning scheme to efficiently allocate cache resources in the multi-core scenario. CHAPTER 5. PREDICTABLE CACHE PROTECTION POLICY 61 5.5 Summary and Future work 5.5.1 Summary Cache protection techniques are widely adopted in modern high-performance cache replace- ment policies to avoid cache trashing. However, due to the empirical design of existing cache protection policies, their performance, i.e., miss rate curves, is difficult to predict, which make it difficult to partition cache efficiently in multi-core scenario. In this chapter, a cache protection policy with predictable performance is proposed. The contributions are the following: 1. The concept of reuse streaks are introduced for the first time. The streak effect on cache protection policy is revealed to explain the relation between reuse streak pattern and the performance of a cache protection policy. 2. With the reuse steak pattern information of a cache access stream, a precise and an approximate model for a cache protection policy are built to predict the hit rate and the corresponding cache size of a cache protection policy with different insertion ratios and protecting distances. 3. A profiler is designed to track the changes of average reuse steak length at runtime. The cost of the profiler is less than 1% of total cache size. 4. With the information of average streak length, a predictable practical cache protection policy (PCPP) is proposed. Results show that the performance of PCPP matches the state-of-art cache replacement policy and also matches the online prediction. 5.5.2 Future work The performance of a cache protection policy is largely determined by the reuse streak pattern, i.e., how many streaks with length of l under the protecting distance of d p . The CHAPTER 5. PREDICTABLE CACHE PROTECTION POLICY 62 accuracyoftheproposedmodelforcacheprotectionpolicyperformanceislargelydetermined by the precision of the information about reuse streak pattern. More precise information could lead to more accurate prediction. The more accurate prediction can then help the proposed scheme to choose better protecting distances and thus yield better performance. There are two potential ways to extract more accurate reuse streak information. One is to enhance the proposed runtime reuse streak profiler. For example, instead of merely counting the number of reuse streaks with the length ≥ 1, the number of the reuse streaks with the length of ≥k (k is a small integer) can be profiled with some additional cost (e.g., need to record last k reuse distances instead of just last reuse distance (lastRD)), which allow the scheme to infer more precise reuse streak distribution. The other one is to use compiler- assisted cache hints. Due to the streak effect, the performance of a cache protection policy largely depends on long reuse streaks (e.g., with length of ≥ 200). Those streaks often appear inlargeloopsofaprogramandcouldbeidentifedbyacompilerandprovidedtothehardware via cache hints [51]. By separating long reuse streaks from short ones, more precise view of reuse streak pattern can be inferred. We leave it as future work to gather more precise information about reuse streak pattern at acceptably low cost. The proposed cache protection policy has the capability of enforcing cache size for each application, which can be integrated with any high-level cache allocation policy. The pre- dictability provided by the proposed scheme can help an allocation policy to make efficient resource allocation decision (instead of trial-and-error) to optimize various system-level ob- jectives and result in performance improvement. We leave it as future work to design an cache allocation policy that can fully utilize the predictiablity provided by the underlying cache replacement policy. Chapter 6 Low-cost Deadlock Avoidance Scheme Handling routing- and protocol-induced deadlocks is a critical issue in sharing interconnect resources. Generally, to avoid these two types of deadlocks without losing routing freedom requires a large amount of virtual channels (VCs), which imposes significant negative effects on router power, energy and frequency. In this chapter, a topology-agnostic fully adaptive Bubble Ring (BR) scheme is first presented to avoid routing-induced deadlocks without the need for multiple virtual channels. Then, a Bubble Coloring (BC) scheme, an extension of the Bubble Ring scheme, is presented to avoid protocol-induced deadlocks without the need formultiplevirtualchannels. TheproposedBCschemeisevaluatedunderrealmultithreaded workloads in terms of performance and cost of area and energy. 6.1 Need for Reducing Virtual Channel Cost Deadlocks in interconnection networks include routing-induced deadlocks that are caused by cyclic dependence in routing functions and protocol-induced deadlocks that are caused by dependencies among different types of messages (e.g., a reply message depends on a request message). To avoid these network abnormalities, virtual channels (VCs) [23, 24] have been used extensively in many deadlock-avoidance schemes. For example, messages in the MOESI directory cache protocol can be classified into three dependent classes. Within each message 63 CHAPTER 6. LOW-COST DEADLOCK AVOIDANCE SCHEME 64 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 6VC 3VC 1VC Breakdown of router area ( ×10 5 µm 2 ) Non-buffer Buffer (a) Router area 0.00 0.05 0.10 0.15 0.20 0.25 0.30 6VC 3VC 1VC Breakdown of router power (W) Non-buffer Buffer (b) Router power Figure 6.1: Breakdown of router area and power 0 5 10 15 20 6VC 3VC 1VC Router critical path length (in FO4) Figure 6.2: Router critical path length class, two VCs can be used to implement deadlock-free adaptive routing in a 2-D mesh, for instance, by applying Duato’s Protocol [37] – one VC employs deterministic routing (e.g., XY-routing) to provide escape paths while the other VC acts as an adaptive resource to enable fully adaptive routing. Then, to avoid protocol-induced deadlocks, at least three independent virtual networks (VNs) are needed to separate different dependent message classes. The above typical way of avoiding deadlocks, however, is achieved at a large overhead of high VC count manifested in router area, power and frequency. To illustrate the overhead, Figure 6.1 plots the breakdown of router area and power for different number of VCs at 45nm with 3GHz frequency and 1.1V operating voltage (more details of the simulation infrastructurearedescribedinSection6.4. ThefirstbarsinFigure6.1(a)and(b)correspond CHAPTER 6. LOW-COST DEADLOCK AVOIDANCE SCHEME 65 to the above example using 6 VCs to avoid routing- and protocol-induced deadlocks. As can be seen, over 54% of the router’s area and 35% of the router power are consumed by the virtual channel buffers, demonstrating a great need to minimize VC requirements. The overhead of VCs can be reduced in both routing- and protocol-induced deadlock avoidance schemes. Consider first the impacts of reducing VC requirement in avoiding routing-induced deadlocks. The second bars (i.e., the 3VC bar) in Figures 6.1 (a) and (b) show that, if a deadlock-free scheme only needs one VC per VN to avoid deadlocks in the routing algorithm, the overall buffer area and power can be reduced by 50% and 43%, respectively (which corresponds to a savings of 29% and 18% of total router area and power, respectively). However, to compensate for the performance degradation with fewer VCs, the design of such deadlock-free schemes becomes very challenging in order to provide a higher degree of routing freedom, in particular, to support fully adaptive routing with only one VC per VN. AnothereffectivebutmorechallengingwaytoreduceVCrequirementistodeviseschemes to handle protocol-induced deadlocks more efficiently. In the ideal form, such schemes should avoid all possible deadlocks with minimally one VC in total, regardless of the number of dependent message classes. The rightmost bars in Figures 6.1 (a) and (b) highlight the advantages. Compared to the scheme with 6VCs, the buffer area and power is reduced by 83% and 74%, respectively (which corresponds to a savings of 47% and 37% of the router area and power, respectively). However, to realize those schemes, significant modifications are needed from previous approaches. In addition to saving resources, the number of VCs also has a considerable impact on the complexity of router control logic, particularly the VC allocator (VA) and switch allo- cator (SA). The SA is affected as the input-arbitration step in SA is to select one VC from multiple VCs within the same physical channel to participate in the output-arbitration step in SA. The VA and SA stages typically lie on the critical path of the router pipeline and determine the router frequency [52]. Figure 6.2 compares the length of the router critical CHAPTER 6. LOW-COST DEADLOCK AVOIDANCE SCHEME 66 path, calculated from the delay model proposed by Peh and Dally [53]. As shown in the figure, the router critical path can be shortened by 19% and 48% in schemes with 3VCs and 1VC, respectively, indicating a considerable reduction in router latency and increase in throughput when configured at maximum achievable frequency. In summary, considering the direct impacts of reducing the number of VCs on router area, power and frequency, as well as the indirect impact on network and overall system performance, it is imperative to devise efficient deadlock-free schemes that minimize the VC requirement in avoiding routing- and protocol-induced deadlocks. 6.2 Bubble Coloring The Bubble Coloring scheme is built on previous proposed Bubble Flow Control (BFC) [54] and the Critical Bubble Scheme (CBS) [55]. A bubble in BFC is an empty packet-sized buffer. BFS and CBS can avoid routing-induced deadlock in a ring network with single virtual channel by always keeping at least one bubble in the ring after packet injection. The Bubble Coloring (BC) scheme extends the basic notion of bubble flow control and critical bubbles and avoids both routing- and protocol-induced deadlocks on any topologies but without the need for multiple VCs. For the rest of this section, the Bubble Ring (BR) scheme is first presented to avoid routing-induced deadlock. Then, the Bubble Coloring scheme, an extension of the BR scheme, is presented to further avoid protocol-induced deadlock. 6.2.1 Avoiding Routing-induced Deadlocks TheBubbleRing(BR)schemeiscomprisedofabubbleflowcontrolmechanismandarouting algorithm. The basic idea of Bubble Ring flow control is first to construct a unidirectional virtual ring that connects all the nodes in the network. Figure 6.3 shows two examples of virtual rings (represented by dash lines) for a 4× 4 mesh and an irregular topology 1 . The 1 Thisisoneofmanypossibletopologieswhichcouldbeformedbyconstructionorduetofaultsoccurring in a base network such as a 2D mesh with express links. CHAPTER 6. LOW-COST DEADLOCK AVOIDANCE SCHEME 67 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 q5 q1 q2 q3 q4 P2 P4 P3 P1 Empty Buffer Occupied Buffer (a) A 4×4 mesh (b) An irregular network Figure 6.3: Topologies with virtual rings ring is “virtual” because there is no additional links or queues for implementing the ring. Only nominal control logic is needed for the routers to associate the input ports and output ports that are in this virtual ring. Note that this ring is not necessarily a Hamiltonian cycle since a node can be visited more than once as is shown in Figure 6.3 (b). For performance reasons, the virtual ring should be as short as possible. Intuitively, such a virtual ring can always be constructed if the topology is strongly connected. With that, the critical bubble scheme [55] can be applied in such a way to make sure there is always at least one free buffer in the virtual ring. In this way, packets in the ring can always be able to move along the ring to reach any destination. The routing algorithm can use the virtual ring as an escape path. The packet will first try to use its preferred output port based on a specified adaptive routing subfunction which provides a subset of the total routing options supplied by the routing algorithm. If there is no preferred output port available, the packet can be forwarded to the escape outport (i.e., the outport in the virtual ring) provided by the escape routing subfunction also defined by the routing algorithm. For example, in Figure 6.3 (a), packets P1, P2, P3 and P4 are at the CHAPTER 6. LOW-COST DEADLOCK AVOIDANCE SCHEME 68 head of queue q1, q2, q3 and q4, respectively. Each queue is in the associated router R8, R9, R12 and R13. Assume that all these four queues are full. With the specified adaptive routing subfunction, a cyclic dependency on these queues may occur. For example, the adaptive routing subfunction may supply only q2, q3, q4 and q1 to packet P1, P2, P3 and P4, respectively. In this situation, our Bubble Ring routing subfunction can avoid deadlocks by providing a way of escape from such knotted cyclic dependency. In this example, since P1 cannot choose its preferred output port supplied by the adaptive routing subfunction (i.e., q2), it can be forwarded to the output port in the virtual ring, i.e., q5. Note that because BR flow control guarantees that there is always at least one free buffer in the virtual ring, P1 can move to q5 eventually even if q5 is temporally full. After P1 moves to q5, packets P2, P3 and P4 can continue to move to their preferred output ports so that this deadlock is avoided. Different from Duato’s Protocol [37] where the escape path consists of the resources from an additional set of virtual channels, the escape path in our BR scheme is comprised of the same set of resources used for the adaptive paths. For example, q3, q4 and q1 are the queues for both the escape path and the preferred adaptive output queues for P2, P3 and P4. The formal description is as follows. To facilitate the discussion throughout the paper, some basic definitions derived from [37, 56, 54] are first introduced. Definition 1. An interconnection network, I, is a strongly connected directed graph I = G(N,Q). The vertices of the graph,N, represent the set of processing nodes. The arcs of the graph, Q, represent the set of queues associated to the communication links interconnecting the nodes. Each queueq i ∈Q has capacity ofcap(q i ) measured in the number of packets, and the number of packets currently stored in the queue is denoted as size(q i ). The set of queues Q is divided into three subsets: injection queues Q I , delivery queues Q D , and network queues Q N . Each node uses a queue from Q I to send packets that travel through the network employing queues from Q N , and when they reach their destination they enter a queue fromQ D . Therefore, packets are routed from a queue in the setQ IN =Q N ∪Q I CHAPTER 6. LOW-COST DEADLOCK AVOIDANCE SCHEME 69 to a queue in the set Q ND = Q N ∪Q D . Let Q VR be a subset of Q N consisting of all the network queues along some minimal virtual ring that spans all nodes of I. Definition 2. A routing function, R ∶ Q IN ×N → ´{Q ND }, provides a set of alternative queues to route a packet p located at the head of any queue q i ∈Q IN to its destination node n d ∈N. ´{Q ND } denotes the power set of Q ND . A deterministic routing function provides only one alternative queue, R(q i ,n d )=q j , q j ∈Q ND . A routing subfunction, R s , for a given routing function R is a routing function defined in the same domain as R but its range (i.e., set of alternative next queues) is restricted to a subset Q NDs ⊆Q ND . Let R VR be the routing subfunction that provides a next-hop queue in the virtual ring, i.e., R VR (q i ,n d ) = q j , q j ∈ Q VR . Definition 3. A flow control function,F ∶Q IN ×Q ND → {true,false}, determines the access permission for a packet p located at the head of queue q i to enter queue q j ∈R(q i ,n d ). Thus, the packet p is allowed to advance from q i to q j if F (q i ,q j )=true. Letcb(q j ) be the number of critical bubbles atq j . In this work, there is only one critical bubble assumed in the virtual ring, so∑ q i ∈Q VR cb(q i )= 1. To keep one free packet-sized buffer in the virtual ring at all times, our flow control on the bubble ring, F BR , is based on critical bubble flow control: F BR (q i ,q j )=true if ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ size(q j )≤cap(q j )− 1 when q i ∈Q VR or q j ∉Q VR size(q j )≤cap(q j )−cb(q j )− 1 when q i ∉Q VR and q j ∈Q VR (6.1) When a packet outside the virtual ring wants to move into the ring (i.e., q i ∉Q VR and q j ∈ Q VR ), it cannot occupy the critical bubble. In other words, if the critical bubble is present atq j (i.e.,cb(q j )= 1), the packet needs at least two free buffers to move intoq j (i.e., size(q j )≤cap(q j )− 2). Otherwise, it just needs at least one free buffer. CHAPTER 6. LOW-COST DEADLOCK AVOIDANCE SCHEME 70 For the rest of the conditions, if the packet wants to (1) travel outside the ring (i.e., q i ∉Q VR andq j ∉Q VR ), or (2) go out of the ring (i.e., q i ∈Q VR andq j ∉Q VR ), or (3) stay in the ring (i.e., q i ∈Q VR and q j ∈Q VR ), it only needs at least one free buffer at q j . Note that based on the Critical Bubble Scheme [55], if a packet in the ring occupies the critical bubble, the newly freed buffer in the upstream router will be marked as the the new critical bubble, which guarantees that there is still at least one free buffer marked as a critical bubble in the ring. This flow control mechanism will guarantee that there is always a free bubble inside the virtual ring. Due to this free buffer, the packets in the ring cannot be blocked so that the routing algorithm can use this ring as its escape path to avoid routing-induced deadlocks. The formal description of a routing function based on our BR scheme is the following: R(q i, n d )=R adaptive (q i ,n d )∪R VR (q i, n d ) (6.2) The packet will try to choose its preferred queues based on a given adaptive routing subfunction R adaptive (q i, n d ) (e.g., Region Congestion Aware routing [57]) first, and use the queue in the virtual ring (i.e., q j ∈ Q VR ) supplied by the escape routing subfunction (i.e., R VR ) as its alternative escape path. Lemma 1. Packets in virtual ring queues provided by the routing subfunction R VR can always make forward progress under F BR flow control. Proof. This can be proved by contradiction. Assume a deadlock occurs, such that no packet in the virtual ring can move to the next queue. F BR flow control keeps the critical bubble inside the ring. Without losing generality, assume the critical bubble is at router i (i.e.,R i ). If there is a packet in the upstream router (i.e., R i−1 ) along the virtual ring, the routing function supplies a path for the packet to move to routerR i when there is no other preferred output ports for this packet available. Due to the presence of the critical bubble (i.e., one free bubble) at routeri, bubble ring flow control will guarantee the packet can move to routerR i . CHAPTER 6. LOW-COST DEADLOCK AVOIDANCE SCHEME 71 This contradicts with the deadlock assumption. If there is no packet in the upstream router R i−1 , we can continue to search backward along the ring until a packet is found at router R i−j . Due to the free buffer available at that downstream router, this packet can move. Theorem 1. An interconnection network I with a routing function R and flow control F BR is free from routing-induced deadlocks if there exists a virtual ring queue supplied by R VR and controlled by F BR such that, for each cycle, at least one packet occupying the head of a queue can be allocated to the virtual ring queue. Proof. If there are packets in the virtual ring, based on Lemma 1, a packet able to move along the ring can be always found. If the ring is empty, the packets outside the ring can make progress by injecting into the queues in the virtual ring as provided by the routing algorithm. 6.3 Avoiding Protocol-induced Deadlocks Although our proposed BR scheme is able to avoid routing-induced deadlocks, it is still susceptibletoprotocol-induceddeadlocks. Forexample,assumetherearetwomessageclasses A and B, where class A depends on class B (e.g., request-reply classes). Also assume that all the queues in the network are full except for one critical bubble in the ring and that all the packets in the virtual ring are from message class A. In this situation, packets from class A cannot be consumed because they are waiting for the completion of packets from class B. Packets from class B cannot move forward either, because the virtual ring is fully occupied by packets from class A, and there is no room for them to inject. Hence, it is possible for protocol-induced deadlocks to occur under the BR scheme. To avoid protocol-induced deadlocks, the Bubble Coloring scheme is proposed, which is a novel extension of the Bubble Ring scheme. The Bubble Coloring (BC) scheme guarantees that packets in one message class can always reach their destinations even though packets of other message classes are blocked from being consumed at end nodes. The basic idea is to CHAPTER 6. LOW-COST DEADLOCK AVOIDANCE SCHEME 72 R3 R0 R2 R1 P3 P1 P2 Empty Buffer Occupied Buffer Black Bubble VC0 VC1 VC2 VC3 Red Bubble Blue Bubble Figure 6.4: Bubble Coloring flow control reserve one additional bubble (i.e., one free packet sized buffer) in the virtual ring for each messageclassbesidestheoriginalcriticalbubbleusedforavoidingrouting-induceddeadlocks. These bubbles are represented by different colors corresponding to distinct message classes. The colored bubble for a given message class serves as a normal free buffer for the packets from its own message class (i.e., it can be used for injection), but it serves as a critical bubble for all other message classes (i.e., it cannot be used for injection). Figure 6.4 provides an example of the BC scheme. As is shown, VC0, VC1, VC2 and VC3 form the virtual ring. The red packet, P1, which is out of the ring can move into the ring (i.e., into VC1) because it can occupy the bubble having the same color (i.e., red). However, Packet P2 with red color cannot move into the ring (i.e., into VC3) because it cannot occupy a bubble with a different color (i.e., blue). Packet P3 already inside the virtual ring can move to VC3 and pull the blue bubble to VC2 at the same time even it has a different color from the bubble. This is because it is already in the virtual ring as opposed to trying to inject into the ring. By reserving such a colored bubble in the virtual ring for every message class, packets from the same message class can always have a chance to move into the ring, i.e., not be blocked by packets from the other message classes. Since the packets in the ring cannot be blocked due to the existence of the original critical bubble, no protocol-induced deadlocks can happen. To make sure colored bubbles will always stay in the virtual ring, the BC scheme has two CHAPTER 6. LOW-COST DEADLOCK AVOIDANCE SCHEME 73 additional rules beyond the BR scheme: (1) Once a colored bubble is occupied by a packet with the same color at injection into the virtual ring, the packet will be marked as having consumed the colored bubble and carry forward the color mark for that message class as it travels along the ring. (2) When the packet marked as having consumed a colored bubble moves out of the ring, it will leave the color mark to the newly freed buffer so that the colored bubble will reappear in the ring. These two rules will continue a colored bubble in the virtual ring for reuse by other packets having the same color at the injection into the ring. A distinct color is assigned to the original critical bubble, i.e., black, such that there is no black message class. Therefore, no packet can use the black bubble for injection. In this sense, our Bubble Ring scheme can be considered as a special case of our Bubble Coloring scheme. The formal description of the proposed BC scheme is as follows: Letcb k (q j ) be the number of bubbles with colork at queuej andcolor(q i ) be the color of the packet at the head ofq i . Assume there areM+1 colors forM different message classes, where one color (i.e., black) is designated as the original critical bubble. In this paper, there is only one bubble is assumed for each color, so for each message class k,k = 1, 2,...,M+ 1, ∑ q j ∈Q VR cb k (q j )= 1. With this, BC flow control is given by the following: F BC (q i ,q j )=true if ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ size(q j )≤cap(q j )− 1, when q i ∈Q VR or q j ∉Q VR size(q j )≤cap(q j )−∑ M+1 k=1,k≠color(q i ) cb k (q j )− 1, when q i ∉Q VR and q j ∈Q VR (6.3) If a packet at the head of queue q i with color(q i ) wants to move into the virtual ring (i.e., q i ∉Q VR andq j ∈Q VR ), it needs one additional free buffer besides all the bubbles at q j it cannot occupy (i.e., all the bubbles with a different color,∑ M+1 k=1,k≠color(q i ) cb k (q j )). If a packet occupies a colored bubble, it will take the color mark of this bubble with it CHAPTER 6. LOW-COST DEADLOCK AVOIDANCE SCHEME 74 Table 6.1: Full system simulation configuration Network topology 4x4 or 8x8 Mesh Router 4-stage VCT, 2GHz Input buffer 10 flits per VC Link bandwidth 128 flit/cycle Cores Alpha EV5 Core, 2GHz L1 cache (I & D) Private, 2-way, 32kB, 2 cycle L2 cache Private, 8-way, 512kB each, 6 cycle Coherence protocol directory MOESI with 3 message classes Memory controller 4, located one at each corner Memory latency 128 cycles when it travels inside the ring and leave the color mark to the newly freed buffer when it moves out of the ring. In this way, all the colored bubbles will be kept in the ring. A routing function based on our proposed bubble coloring scheme is the same as that based on our BR scheme in Equation (6.2)). That is, packets will be supplied with queues in the virtual ring if their preferred output ports are not available. Theorem 2. An interconnection network I with a routing function R and flow control F BC is free from protocol-induced deadlocks if there exists a virtual ring queue supplied byR VR and controlled by F BC such that, packets from any message class cannot persistently be blocked by packets from other message classes. Proof. The BC scheme reserves one bubble (i.e., free buffer) in the ring for each message class so that packets from different message classes have a chance to move into the ring at sometime. According to Lemma 1, due to the existence of the original critical bubble (i.e., black bubble), packets in the ring can make forward movement eventually, which means packets from any message class can never be blocked by packets from other message classes, thus no protocol-induced deadlocks can form. CHAPTER 6. LOW-COST DEADLOCK AVOIDANCE SCHEME 75 6.4 Evaluation Theproposedschemeisevaluatedexperimentallyunderafull-systemsimulationusingGEM5[58], with Garnet [59] for detailed timing of the interconnection network. DSENT [60] configured with a model of an industrial standard 45nm CMOS process is used for router power and area estimation. A canonical four-stage Virtual Cut-Through (VCT) switched router with credit-based flow control is assumed. A routing algorithm conforming to Equation (6.2) is used, where the adaptive routing subfunction provides output ports along all minimal paths to the destination of the packet. Output ports with more credits are prioritized over those with lower credits. Table 6.1 lists the key parameters of the simulation configuration. All the schemes are compared with mesh networks, on which various deadlock-free schemes have been studied extensively in prior works [37, 24]. With a typical 128-bit link width, short packets of 16-bits only contain a single flit while long packets carrying 64 byte data plus addition control information have 5 flits. The depth of the buffer for each virtual channel is 10 flits which can hold at most two long packets. A directory MOESI coherence protocol with three message classes are used in the simulation. The following different deadlock-free schemes with varying number of VCs are compared in our experiments: 1. XY_3VC: XY routing is used to avoid routing-induced deadlocks and one VC for each message class is assigned to avoid protocol-induced deadlocks. This scheme requires three VCs in total. 2. XY_adaptive_4VC: Compared to XY_3VC, this scheme adds one additional adaptive virtual channel shared by all message classes and uses XY_3VC as its escape path. It requires four VCs in total. 3. BR_3VC: The Bubble Ring scheme is used to allow fully adaptive routing with routing- induced deadlock-freedom, but it requires one VC for each message class to avoid protocol-induced deadlocks. This scheme requires three VCs in total. CHAPTER 6. LOW-COST DEADLOCK AVOIDANCE SCHEME 76 4. BR_adaptive_4VC: Compared to BR_3VC, this scheme adds one additional adaptive virtual channel shared by all message classes and uses BR_3VC as its escape path. It requires four VCs in total. 5. BC_xVC: This group of schemes applies the Bubble Coloring (BC) scheme to avoid both routing- and protocol-induced deadlocks. The number of virtual channels is x (x = 1, 2, 3 or 4). The multi-threaded PARSEC benchmark suite [61] is used in the experiment. Each core is warmed up for sufficiently long time (with a minimum of 10 million cycles) and then run until the end of the parallel region. All PARSEC benchmarks use the simsmall input set. 6.4.1 Execution time 0.85 0.87 0.89 0.91 0.93 0.95 0.97 0.99 1.01 blackscholes bodytrack canneal ferret fluidanimate swaptions Gmean Execution time (normalized to xy_3vc) xy_adaptive_4vc br_3vc br_adaptive_4vc bc_1vc bc_4vc bc_1vc_hf Figure 6.5: Execution time for PARSEC Figure 6.5 shows the normalized execution time for six PARSEC workloads. The results are normalized to the execution time of the workloads using the XY_3VC scheme. For real workloads such as PARSEC benchmarks, their load rates are relative low, i.e., no more than 0.1 packet per node per cycle on average, which makes little performance differences between CHAPTER 6. LOW-COST DEADLOCK AVOIDANCE SCHEME 77 various schemes with the same router frequency since all the schemes have the same zero- load latency. The biggest difference in execution time among all the schemes for a single benchmark (i.e., ferret) is within 5%. On average, even the best scheme (i.e., BC_4VC) has only less than 2% execution time reduction over the worst scheme (i.e., XY_3VC). However, as is discussed in Section 6.1, reducing the number of VCs can reduce the length of the router critical path, which gives the opportunity to increase the router frequency. The BC_1VC_HF scheme, which has the same core frequency (2GHz) as the rest of schemes but increases router frequency slightly from 2Ghz to 2.3Ghz (i.e., 15%), can achieve significant performance improvement, i.e., 12% execution time reduction on average. These results show the ability of the BC scheme with minimal number of VCs to provide an increased opportunity of improving overall system performance. 6.4.2 Energy 0 0.2 0.4 0.6 0.8 1 1.2 1.4 blackscholes bodytrack canneal ferret fluidanimate swaptions Gmean Router energy (normalized to xy_3vc) xy_adaptive_4vc br_3vc br_adaptive_4vc bc_1vc bc_4vc bc_1vc_hf Figure 6.6: Router energy consumption Figure 6.6 shows the normalized router energy consumption for six PARSEC benchmarks with different deadlock-free schemes. The results are normalized to XY_3VC. As shown in the figure, the total router energy is largely affected by the number of virtual channels. The CHAPTER 6. LOW-COST DEADLOCK AVOIDANCE SCHEME 78 more virtual channels are needed, the more buffers that will be used, which causes more static power consumption. Compared to the schemes with 4VCs (e.g., XY_adaptive_4VC), BC_1VC can reduce router energy by up to 51.2%. Under the relatively low traffic load rate, all the schemes will have similar total hop counts since packets in all schemes will likely travel theminimal paths. Therefore, dynamic router energyamongdifferent schemes withthesame routerfrequencyhasnosignificantdifferencessincethetotalamountofrouteractivities(e.g., buffer writes, VA/SA arbitrations) is largely dependent on the average packet hop count (i.e., number of packets being forwarded through a router on average). BC_1VC_HF has a higher router energy consumption than BC_1V since the dynamic power has been increased when router frequency becomes higher. In sum, on the one hand, our BC scheme can save a large amount of router energy (shown in Figure 6.6) without significantly degrading the system performance (shown in Figure 6.5). On the other hand, our BC scheme provides the opportunity to gain significant performance improvement by increasing router frequency with comparable energy consumption. 6.4.3 Area Figure 6.7 shows the comparison of per router area for different schemes. The router area consumption can be divided into buffer or non-buffer (e.g., control logic). Although different schemes have relatively the same non-buffer area, the different number of VCs among differ- ent schemes largely affects router buffer area. As expected, the BC scheme with the minimal one virtual channel (i.e., BC_1VC) consumes the least router area. Compared to schemes with 4VCs (e.g., XY_adaptive_4VC), BC_1VC can save router area by up to 58.3%. Con- sidering the fact that there is little performance degradation with reduced number of VCs (shown in Figure 6.5), it is clear that BC_1VC is most area-efficient among all the schemes. CHAPTER 6. LOW-COST DEADLOCK AVOIDANCE SCHEME 79 0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 Router area (x10 4 μm 2 ) Figure 6.7: Comparison of per router area 6.5 Summary In summary, the simulation results for real multithreaded applications show that the pro- posed Bubble Coloring scheme with minimal VC requirement can achieve significant im- provements in terms of router energy/area efficiency. Compared to conventional XY rout- ing schemes, BC schemes can achieve comparable performance with significant router en- ergy/area savings. The BC scheme also provides the opportunities to significantly improve the system performance by increasing router frequency with no more energy consumption than other schemes. Chapter 7 Conclusions and Future Research This dissertation has presented various techniques to efficiently share various on-chip re- sources. In particular, the research has the following contributions. • This research addressed the inefficiencies in off-chip bandwidth partitioning and builds an analytical model to reveal the relationship between different memory bandwidth partitioning and various system-level performance objectives. Based on the proposed model, four optimal partitioning schemes are derived to maximize four system-level performance objectives, including weighted speedup, sum of IPCs, harmonic weighted speedup and fairness, respectively. • Thisresearchidentifiestheassociativitylossforreplacement-basedpartitioninginlarge scale CMPs. A novel cache partitioning scheme, named futility scaling, is proposed to largely maintain the cache associativity even with a large number of partitions. • This research introduces the concept of reuse streaks and identifies the streak effect on the performance of cache protection policies. A performance model is built to predict the hit rate and required cache size for a cache protection policy with the input of an insertion ratio and a protecting distance. A low-cost profiler is proposed to track the average reuse streak length of a cache access stream at runtime. Based on the model 80 CHAPTER 7. CONCLUSIONS AND FUTURE RESEARCH 81 and the runtime information about the average reuse streak length, a practical cache protection policy that provides predictable performance is proposed. • This research proposes a virtual cut-through switched scheme, called Bubble Coloring, to avoid both routing- and protocol-induced deadlocks without the need for multiple virtual channels while still enabling fully-adaptive routing on any topology. 7.1 Future Research Coordinated Management of Multiple On-Chip Resources With the emergence of multi-programmed workloads for CMPs, QoS of each co-scheduled application on the CMP is increasingly gaining importance. As more and more applications are consolidated in a single chip to compete for both last-level cache capacity and off-chip memory bandwidth, how these two types of limited shared resources are partitioned has an increasing impact on overall system performance. Partitioning of last-level cache capacity and off-chip memory bandwidth has interacting consequences. The performance gain from increasing allocation of one type of resource depends on the allocation of the other. For ex- ample, increasing the cache capacity allocation reduces the number of misses, while reducing the memory bandwidth allocation increases the miss latency (e.g., increased queuing delay at memory controllers). The final execution time is largely determined by the co-effect of the allocation of these two types of resources, i.e., the product of miss-rate and miss latency. The analytical model proposed in Chapter 3 can be extended to establish the relationship between the performance and both cache capacity and memory bandwidth resources. For example, theperformanceofapplicationi(i.e.,IPC shared,i )canbeexpressedasthefollowing: IPC shared,i = APC shared,i API shared,i = memory_bandwidth_share i F i (cache_capacity_share i ) (7.1) CHAPTER 7. CONCLUSIONS AND FUTURE RESEARCH 82 where APC shared,i and API shared,i is memory accesses per instruction and memory accesses per cycle for application i, respectively. Function F i (j) denotes the API of Application i when it has j amount of cache capacity. APC shared,i indicates how much memory bandwidth an application occupies at runtime. API shared,i is determined by the last-level cache capacity share that application i owns (i.e., API shared,i is a function of cache capacity share, F i (j)). With the proposed predictable cache protection policy (in Chapter 5), F i (j) can be provided at runtime to the allocation policy. With Equation (7.1), we can follow the similar methodology used in Chapter 3 to formulate the dual-resource allocation problem as a constraint optimization problem and derived optimal cache and memory bandwidth partitioning in a coordinated fashion. Meeting Service Level Objectives Nowadays, various cloud services often have different service-level objectives (SLOs). For example, online data-intensive (OLDI) services, e.g.,web search, need low tail (worst-case) latency, e.g., 99% of requests complete within 50ms. Offline batch data processing services, e.g., MapReduce, often need high throughput (i.e., long-term average performance). Meeting those SLOs are imperative since they are directly related to the user experience and thus affect the service revenue. Although collocating various types of service workloads into the same chip can increase resource utilization and reduce operational cost, it is becoming challenging to meet different SLOs for different services at the same time by properly sharing multiple on-chip resources. In order to meet the SLOs of a service by controlling on-chip resource allocation, it is important to understand the relationship between the performance of a service and the usageofindividualresources. Theresearchinthisthesisshedssomelightonhowthememory bandwidth and cache capacity affect the performance of an application (in Chapters 3 and 5, respectively), which is crucial in designing an on-chip resource allocation scheme to meet SLOs. CHAPTER 7. CONCLUSIONS AND FUTURE RESEARCH 83 Overall, it is needed to design a holistic resource management scheme that can efficiently manage multiple interacting on-chip resources while still satisfying different SLOs of various services. We leave this endeavor to future work. Bibliography [1] Krste Asanovic, Rastislav Bodik, James Demmel, Tony Keaveny, Kurt Keutzer, John Kubiatowicz, Nelson Morgan, David Patterson, Koushik Sen, John Wawrzynek, David Wessel, and Katherine Yelick. A view of the parallel computing landscape. Communi- cations of the ACM, 52(10):56–67, October 2009. [2] Karthikeyan Sankaralingam and Remzi H. Arpaci-Dusseau. Get the parallelism out of my cloud. In Proceedings of the 2nd USENIX Conference on Hot Topics in Parallelism, HotPar ’10, pages 8–8, Berkeley, CA. USENIX Association, 2010. [3] LuizAndréBarrosoandUrsHölzle.Thecaseforenergy-proportionalcomputing. Com- puter, 40(12):33–37, December 2007. [4] Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle. The datacenter as a computer: an introduction to the design of warehouse-scale machines, second edition. Synthesis Lectures on Computer Architecture, 8(3):1–154, 2013. [5] Seongbeom Kim, Dhruba Chandra, and Yan Solihin. Fair cache sharing and partition- ing in a chip multiprocessor architecture. In Proceedings of the 13th International Con- ference on Parallel Architectures and Compilation Techniques, PACT ’04, pages 111– 122, Washington, DC, USA. IEEE Computer Society, 2004. [6] Moinuddin K. Qureshi and Yale N. Patt. Utility-based cache partitioning: a low- overhead,high-performance,runtimemechanismtopartitionsharedcaches.InProceed- 84 BIBLIOGRAPHY 85 ings of the 39th International Symposium on Microarchitecture,MICRO-39,pages423– 432, Washington, DC, USA. IEEE Computer Society, 2006. [7] Nauman Rafique, Won-Taek Lim, and Mithuna Thottethodi. Architectural support for operating system-driven cmp cache management. In Proceedings of the 15th Interna- tional Conference on Parallel architectures and Compilation Techniques, PACT ’06, pages 2–12, Seattle, Washington, USA. ACM, 2006. [8] Daniel Sanchez and Christos Kozyrakis. Vantage: scalable and efficient fine-grain cache partitioning. In Proceedings of the 38th International Symposium on Computer Archi- tecture, ISCA ’11, pages 57–68, San Jose, California, USA. ACM, 2011. [9] Sarah Bird and Burton J. Smith. Pacora: performance aware convex optimization for resource allocation. In Proceedings of the 3rd USENIX Workshop on Hot Topics in Parallelism, HotPar ’11, 2011. [10] KunLuo,JayanthGummaraju,andManojFranklin.Balancingthoughputandfairness in smt processors. In Proceedings of the 2001 International Symposium on Performance Analysis of Systems and Software, ISPASS ’01, pages 164–171, 2001. [11] Kyle J. Nesbit, Nidhi Aggarwal, James Laudon, and James E. Smith. Fair queuing memory systems. In Proceedings of the 39th International Symposium on Microarchi- tecture, MICRO-39, pages 208–222, Washington, DC, USA. IEEE Computer Society, 2006. [12] Nauman Rafique, Won-Taek Lim, and Mithuna Thottethodi. Effective management of dram bandwidth in multicore processors. In Proceedings of the 16th International Con- ference on Parallel Architecture and Compilation Techniques, PACT ’07, pages 245– 258, Washington, DC, USA. IEEE Computer Society, 2007. [13] OnurMutluandThomasMoscibroda.Stall-timefairmemoryaccessschedulingforchip multiprocessors. In Proceedings of the 40th International Symposium on Microarchi- BIBLIOGRAPHY 86 tecture, MICRO-40, pages 146–160, Washington, DC, USA. IEEE Computer Society, 2007. [14] Onur Mutlu and Thomas Moscibroda. Parallelism-aware batch scheduling: enhancing both performance and fairness of shared dram systems. In Proceedings of the 35th In- ternational Symposium on Computer Architecture,ISCA’08,pages63–74,Washington, DC, USA. IEEE Computer Society, 2008. [15] YoonguKim,MichaelPapamichael,OnurMutlu,andMorHarchol-Balter.Threadclus- ter memory scheduling: exploiting differences in memory access behavior. In Proceed- ings of the 43rd International Symposium on Microarchitecture, MICRO-43, pages 65– 76, Washington, DC, USA. IEEE Computer Society, 2010. [16] Yoongu Kim, Dongsu Han, O. Mutlu, and M. Harchol-Balter. Atlas: a scalable and high-performance scheduling algorithm for multiple memory controllers. In Proceedings of the 16th International Symposium on High Performance Computer Architecture, HPCA ’10, pages 1–12, 2010. [17] J. Mukundan and J.F. Martinez. Morse: multi-objective reconfigurable self-optimizing memory scheduler. In Proceedings of the 18th International Symposium on High Per- formance Computer Architecture, HPCA ’12, pages 1–12, February 2012. [18] Saugata Ghose, Hyodong Lee, and José F. Martınez. Improving memory scheduling via processor-side load criticality information. In Proceedings of the 40th Annual In- ternational Symposium on Computer Architecture, ISCA ’13, pages 84–95, Tel-Aviv, Israel. ACM, 2013. [19] R Manikantan, Kaushik Rajan, and R Govindarajan. Probabilistic shared cache man- agement (prism). In Proceedings of the 39th International Symposium on Computer Architecture, ISCA ’12, pages 428–439, Portland, Oregon. IEEE Computer Society, 2012. BIBLIOGRAPHY 87 [20] Aamer Jaleel, Kevin B. Theobald, Simon C. Steely Jr., and Joel Emer. High per- formance cache replacement using re-reference interval prediction (rrip). In Proceed- ings of the 37th annual international symposium on Computer architecture, ISCA ’10, pages 60–71, Saint-Malo, France. ACM, 2010. [21] Moinuddin K. Qureshi, Aamer Jaleel, Yale N. Patt, Simon C. Steely, and Joel Emer. Adaptive insertion policies for high performance caching. In Proceedings of the 34th Annual International Symposium on Computer Architecture, ISCA ’07, pages 381–391, San Diego, California, USA. ACM, 2007. [22] Nam Duong, Dali Zhao, Taesu Kim, Rosario Cammarota, Mateo Valero, and Alexan- der V. Veidenbaum. Improving cache management policies using dynamic reuse dis- tances. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-45, pages 389–400, Vancouver, B.C., Canada. IEEE Com- puter Society, 2012. [23] William J. Dally. Virtual-channel flow control. IEEE Transactions on Parallel and Distributed Systems, 3(2):194–205, 1992. [24] William J. Dally and Hiromichi Aoki. Deadlock-free adaptive routing in multicom- puter networks using virtual channels. IEEE Transactions on Parallel and Distributed Systems, 4(4):466–475, 1993. [25] Scott Rixner, William J. Dally, Ujval J. Kapasi, Peter Mattson, and John D. Owens. Memory access scheduling. In Proceedings of the 27th International Symposium on ComputerArchitecture,ISCA’00,pages128–138,Vancouver,BritishColumbia,Canada. ACM, 2000. [26] Jeffrey Stuecheli, Dimitris Kaseridis, David Daly, Hillery C. Hunter, and Lizy K. John. The virtual write queue: coordinating dram and last-level cache policies. In Proceedings of the 37th International Symposium on Computer Architecture, ISCA ’10, pages 72– 82, Saint-Malo, France. ACM, 2010. BIBLIOGRAPHY 88 [27] Dimitris Kaseridis, Jeffrey Stuecheli, and Lizy Kurian John. Minimalist open-page: a dram page-mode scheduling policy for the many-core era. In Proceedings of the 44th International Symposium on Microarchitecture, MICRO-44, pages 24–35, Porto Alegre, Brazil. ACM, 2011. [28] Fang Liu, Xiaowei Jiang, and Yan Solihin. Understanding how off-chip memory band- width partitioning in chip multiprocessors affects system performance. In Proceedings of the 16th International Symposium on High Performance Computer Architecture, HPCA ’10, pages 1–12, January 2010. [29] Engin Ipek, Onur Mutlu, José F. Martınez, and Rich Caruana. Self-optimizing memory controllers: a reinforcement learning approach. In Proceedings of the 35th International Symposium on Computer Architecture, ISCA ’08, pages 39–50, Washington, DC, USA. IEEE Computer Society, 2008. [30] Derek Chiou, Prabhat Jain, Srinivas Devadas, and Larry Rudolph. Dynamic cache partitioning via columnization. In Proceedings of Design Automation Conference, 2000. [31] Keshavan Varadarajan, S. K. Nandy, Vishal Sharda, Amrutur Bharadwaj, Ravi Iyer, Srihari Makineni, and Donald Newell. Molecular caches: a caching structure for dy- namic creation of application-specific heterogeneous cache regions. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO- 39, pages 433–442, Washington, DC, USA. IEEE Computer Society, 2006. [32] Jiang Lin, Qingda Lu, Xiaoning Ding, Zhao Zhang, Xiaodong Zhang, and P. Sadayap- pan. Gaining insights into multicore cache partitioning: bridging the gap between sim- ulationandrealsystems.In Proceeding of IEEE 14th International Symposium on High Performance Computer Architecture, HPCA ’08, pages 367–378, February 2008. [33] Parthasarathy Ranganathan, Sarita Adve, and Norman P. Jouppi. Reconfigurable caches and their application to media processing. In Proceedings of the 27th annual BIBLIOGRAPHY 89 international symposium on Computer architecture, ISCA ’00, pages 214–224, Vancou- ver, British Columbia, Canada. ACM, 2000. [34] Christopher J. Glass and Lionel M. Ni. The turn model for adaptive routing. In Pro- ceedings of the 19th Annual International Symposium on Computer Architecture, ISCA ’92, pages 278–287, Queensland, Australia. ACM, 1992. [35] Ge-Ming Chiu. The odd-even turn model for adaptive routing. IEEE Transactions on Parallel and Distributed Systems, 11(7):729–738, 2000. [36] Binzhang Fu, Yinhe Han, Jun Ma, Huawei Li, and Xiaowei Li. An abacus turn model for time/space-efficient reconfigurable routing. In Proceedings of the 38th International Symposium on Computer architecture, ISCA ’11, pages 259–270, San Jose, California, USA. ACM, 2011. [37] José Duato. A necessary and sufficient condition for deadlock-free routing in cut- through and store-and-forward networks. IEEE Transactions on Parallel and Dis- tributed Systems, 7(8):841–854, 1996. [38] Yong Ho Song and Timothy Mark Pinkston. A progressive approach to handling message-dependent deadlock in parallel computer systems. IEEE Transactions on Par- allel and Distributed Systems, 14(3):259–275, March 2003. [39] Shubhendu S. Mukherjee, Peter Bannon, Steven Lang, Aaron Spink, and David Webb. The alpha 21364 network architecture. In Proceedings of the Ninth Symposium on High Performance Interconnects, HOTI ’01, pages 113–117, Washington, DC, USA. IEEE Computer Society, 2001. [40] H. Vandierendonck and A. Seznec. Fairness metrics for multi-threaded processors. Computer Architecture Letters, 10(1):4–7, 2011. [41] Allan Snavely and Dean M. Tullsen. Symbiotic jobscheduling for a simultaneous multi- threaded processor. In Proceedings of the 9th International Conference on Architectural BIBLIOGRAPHY 90 Support for Programming Languages and Operating Systems, ASPLOS IX, pages 234– 244, Cambridge, Massachusetts, USA. ACM, 2000. [42] Hans Kellerer, Ulrich Pferschy, and David Pisinger. Knapsack Problems. Springer, 1st edition, December 2010. [43] Ruisheng Wang and Lizhong Chen. Futility scaling: high-associativity cache parti- tioning. In Proceeding of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-48, pages 356–367, December 2014. [44] André Seznec. A case for two-way skewed-associative caches. In Proceedings of the 20th annual international symposium on computer architecture, ISCA ’93, pages 169–178, San Diego, California, USA. ACM, 1993. [45] DanielSanchezandChristosKozyrakis.Thezcache:decouplingwaysandassociativity. In Proceedings of the 43rd International Symposium on Microarchitecture, MICRO-43, pages 187–198, Washington, DC, USA. IEEE Computer Society, 2010. [46] Laszlo A. Belady. A study of replacement algorithms for a virtual-storage computer. IBM Systems Journal, 5(2):78–101, June 1966. [47] NathanBeckmannandDanielSanchez.Talus:ASimpleWaytoRemoveCliffsinCache Performance. In Proceedings of the 21st international symposium on High Performance Computer Architecture (HPCA-21), February 2015. [48] Nathan Beckmann and Daniel Sanchez. Modeling Cache Performance Beyond LRU. In Proceedings of the 22nd international symposium on High Performance Computer Architecture (HPCA-22), March 2016. [49] Trevor E. Carlson, Wim Heirman, and Lieven Eeckhout. Sniper: exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’11, 52:1–52:12, Seattle, Washington. ACM, 2011. BIBLIOGRAPHY 91 [50] Harish Patil, Robert Cohn, Mark Charney, Rajiv Kapoor, Andrew Sun, and Anand Karunanidhi. Pinpointing representative portions of large intel®itanium®programs with dynamic instrumentation. In Proceedings of the 37th Annual IEEE/ACM Inter- national Symposium on Microarchitecture, MICRO-37, pages 81–92, Portland, Oregon. IEEE Computer Society, 2004. [51] Zhenlin Wang, Kathryn S. McKinley, Arnold L. Rosenberg, and Charles C. Weems. Using the compiler to improve cache replacement decisions. In Proceedings of the 11th International Conference on Parallel Architectures and Compilation Techniques,PACT ’02, Washington, DC, USA. IEEE Computer Society, 2002. [52] Yi Xu, Bo Zhao, Youtao Zhang, and Jun Yang. Simple virtual channel allocation for high throughput and high frequency on-chip routers. In Proceedings of the 16th International Symposium on High Performance Computer Architecture, HPCA ’10, pages 1–11, 2010. [53] Li-Shiuan Peh and William J. Dally. A delay model and speculative architecture for pipelined routers. In Proceedings of the 7th International Symposium on High- Performance Computer Architecture, HPCA ’01, pages 255–266, Washington, DC, USA. IEEE Computer Society, 2001. [54] V. Puente, C. Izu, R. Beivide, J.A. Gregorio, F. Vallejo, and J.M. Prellezo. The adap- tive bubble router. Journal of Parallel and Distributed Computing, 61(9):1180–1208, 2001. [55] Lizhong Chen, Ruisheng Wang, and Timothy Mark Pinkston. Critical bubble scheme: an efficient implementation of globally aware network flow control. In Proceedings of the 25th IEEE International Parallel Distributed Processing Symposium, IPDPS ’11, pages 592–603, 2011. BIBLIOGRAPHY 92 [56] José Duato and Timothy Mark Pinkston. A general theory for deadlock-free adaptive routing using a mixed set of resources. IEEE Transactions on Parallel and Distributed Systems, 12(12):1219–1235, 2001. [57] Paul Gratz, Boris Grot, and Stephen W. Keckler. Regional congestion awareness for load balance in networks-on-chip. In Proceedings of the 14th International Symposium on High Performance Computer Architecture, HPCA ’08, pages 203–214, 2008. [58] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. The gem5 simulator. ACM SIGARCH Computer Architecture News, 39(2):1– 7, August 2011. [59] Niket Agarwal, Tushar Krishna, Li-Shiuan Peh, and Niraj K. Jha. Garnet: a detailed on-chip network model inside a full-system simulator. In Proceedings of the 2009 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS ’09, pages 33–42, 2009. [60] Chen Sun, Chia-Hsin Owen Chen, George Kurian, Lan Wei, Jason Miller, Anant Agar- wal, Li-Shiuan Peh, and Vladimir Stojanovic. Dsent - a tool connecting emerging pho- tonics with electronics for opto-electronic networks-on-chip modeling. In Proceedings of the 6th ACM/IEEE International Symposium on Networks-on-Chip, NOCS ’12, pages 201–210, Washington, DC, USA. IEEE Computer Society, 2012. [61] ChristianBieniaandKaiLi.Parsec2.0:anewbenchmarksuiteforchip-multiprocessors. In Proceedings of the 5th Annual Workshop on Modeling, Benchmarking and Simula- tion, MoBS ’09, June 2009.
Abstract (if available)
Abstract
With the emergence of cloud computing and the trend towards rising core counts on a single chip, an increasing number of server workloads are being consolidated onto a single multicore chip to share various costly on-chip resources, such as last-level caches, off-chip memory bandwidth, and on-chip networks. The purpose of sharing is to improve efficiency and reduce cost. However, uncontrolled sharing of those on-chip resources could impose a series of negative effects on the system, such as degraded cache associativity for each sharer, unpredictable cache performance, unfair memory bandwidth occupancy and network deadlock, which defeats the purpose of sharing. ❧ This research aims to improve system performance by sharing various on-chip resources in an efficient and effective way. Four techniques are proposed to improve the efficiency of sharing off-chip memory bandwidth, last-level caches, and the on-chip network. First, an analytical performance model for partitioning off-chip memory bandwidth is proposed, from which four optimal memory bandwidth allocation policies are derived to maximize four system-level performance objectives, i.e., sum of instructions per cycles (IPCs), weighted speedup, fairness, and harmonic weighted speedup, respectively. Second, a replacement-based cache partitioning enforcement scheme, called Futility Scaling, is proposed, which can precisely partition the cache while still maintain high associativity even with a large number of partitions. Third, a performance model is built to reveal the “streak effect” on the performance of cache protection policies. Based on the model and runtime reuse streak information, a cache protection policy that provides predictable performance is designed. Last, a virtual cut-through (VCT) switched Bubble Coloring scheme is proposed, which avoids both routing- and protocol-induced deadlocks without the need for multiple virtual channels while still enabling fully adaptive routing on any topology.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Design of low-power and resource-efficient on-chip networks
PDF
Hardware techniques for efficient communication in transactional systems
PDF
Performance improvement and power reduction techniques of on-chip networks
PDF
Dynamic packet fragmentation for increased virtual channel utilization and fault tolerance in on-chip routers
PDF
Communication mechanisms for processing-in-memory systems
PDF
Efficient memory coherence and consistency support for enabling data sharing in GPUs
PDF
Enabling energy efficient and secure execution of concurrent kernels on graphics processing units
PDF
Energy efficient design and provisioning of hardware resources in modern computing systems
PDF
Improving the efficiency of conflict detection and contention management in hardware transactional memory systems
PDF
Demand based techniques to improve the energy efficiency of the execution units and the register file in general purpose graphics processing units
PDF
An FPGA-friendly, mixed-computation inference accelerator for deep neural networks
PDF
A framework for runtime energy efficient mobile execution
PDF
Lifetime reliability studies for microprocessor chip architecture
PDF
Improving reliability, power and performance in hardware transactional memory
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Intelligent near-optimal resource allocation and sharing for self-reconfigurable robotic and other networks
PDF
Resource underutilization exploitation for power efficient and reliable throughput processor
PDF
Multi-level and energy-aware resource consolidation in a virtualized cloud computing system
PDF
Energy-efficient computing: Datacenters, mobile devices, and mobile clouds
PDF
Exploiting variable task granularities for scalable and efficient parallel graph analytics
Asset Metadata
Creator
Wang, Ruisheng
(author)
Core Title
Efficient techniques for sharing on-chip resources in CMPs
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Publication Date
06/29/2017
Defense Date
05/09/2017
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
cache partitioning,deadlocks,memory bandwidth partitioning,network-on-chip,OAI-PMH Harvest,on-chip resources
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Pinkston, Timothy (
committee chair
), Annavaram, Murali (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
ruishengwang@outlook.com,ruishenw@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-392389
Unique identifier
UC11265528
Identifier
etd-WangRuishe-5469.pdf (filename),usctheses-c40-392389 (legacy record id)
Legacy Identifier
etd-WangRuishe-5469.pdf
Dmrecord
392389
Document Type
Dissertation
Rights
Wang, Ruisheng
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
cache partitioning
deadlocks
memory bandwidth partitioning
network-on-chip
on-chip resources