Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Efficient memory coherence and consistency support for enabling data sharing in GPUs
(USC Thesis Other)
Efficient memory coherence and consistency support for enabling data sharing in GPUs
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Ecient Memory Coherence and Consistency Support for Enabling Data Sharing in GPUs by Abdulaziz Tabbakh A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Electrical Engineering) May 2018 Copyright 2018 Abdulaziz Tabbakh Dedication This dissertation is dedicated . . . To my father Salah, my mother Faten, and to my family Asma and Omar. ii Acknowledgements First and foremost I would like to thank God for his blessing and bounties and I would like thank many people who helped me throughout my PhD journey. I would like to express my deep grati- tude to my PhD advisor Prof. Murali Annavaram for his continuous guidance, patience, valuable insight. freedom he gave to me to explore dierent research areas are the reason behind the success and the condence that I have right now. I would like to thank my committee members professor Xuehai Qian, and professor Shahram Ghandeharizadeh. I also would like to thank my qualica- tion committee members professor Sandeep Gupta, professor Michel Dubois and professor wyatt Lloyd for their valuable comments. I would like to acknowledge my lab mate Dr. Mohammad Abdel-Majeed, my dearest freind Dr. Abdulaziz Alaswad, and all friends and colleagues who I met during my PhD journey.I would like to thank King Fahd University of Petroleum & Minerals in Dhahran, Saudi Arabia, for providing the funding for my studies. Also I would like express my endless appreciation to my mother Faten AlFahal for her unconditional support, endless help, and continuous prayers. I cannot be what I am now without all the sacrices she and my father made for me throughout each step in my life. I cannot forget my father Salah Tabbakh who passed away just before I start my PhD. He was and will be always with me in every step in my life. I am grateful to have a magnicent wife, Asma Habib who always has faith in me. I am thankful to my little child Omar. I am also thankful to my sisters Alaa, Abrar, Rawan, Maryam, and Dania for their help and support. iii Table of Contents Dedication ii Acknowledgements iii List Of Tables vii List Of Figures ix Abstract xii Chapter 1: Introduction: Memory System Challenges in GPUs 1 1.1 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 GPU Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 GPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.2 Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.3 Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.4 Caches Hardware and Policies . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2.5 CTA Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3 Coherence and Memory Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Chapter 2: Power Ecient Sharing-Aware GPU Data Management 15 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Data Sharing in GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.1 Inter-CTAs Data Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.2 Inter-SM Data Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 Data Locality Aware CTA Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4 Sharing-Aware GPU Cache Management . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4.1 The Need for Better Cache Management . . . . . . . . . . . . . . . . . . . . 24 2.4.2 Cache Management Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4.2.1 Tracking Data Sharing . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4.2.2 Sharing-Aware Cache Allocation Policy . . . . . . . . . . . . . . . 27 2.4.2.3 Sharing-Aware Replacement . . . . . . . . . . . . . . . . . . . . . 28 2.5 Implementation and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.5.1 Microarchitecture Implementation . . . . . . . . . . . . . . . . . . . . . . . 29 2.5.2 Preventing Dead Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.5.3 L1 Cache Bypassing and Performance . . . . . . . . . . . . . . . . . . . . . 32 2.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.6.1 CTA Scheduler Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.6.2 Cache Management Scheme Evaluation . . . . . . . . . . . . . . . . . . . . 36 iv 2.6.3 Bypass Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.6.4 Kepler Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.7.1 CTA Scheduling and Management . . . . . . . . . . . . . . . . . . . . . . . 40 2.7.2 Cache Management and Bypassing in GPUs . . . . . . . . . . . . . . . . . . 40 2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Chapter 3: G-TSC: Timestamp Based Coherence for GPUs 43 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.2.1 Time-based Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.2.1.1 Globally Synchronized Clock . . . . . . . . . . . . . . . . . . . . . 47 3.2.1.2 Cache Inclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2.1.3 Lease-Induced Stall and Contention . . . . . . . . . . . . . . . . . 48 3.3 G-TSC: GPU Cache Coherence Using Timestamp Ordering . . . . . . . . . . . . . 49 3.3.1 Timestamp Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.3.2 Timestamps in GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3.3 Principles of G-TSC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.4 G-TSC Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.4.1 Private Cache Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.4.1.1 Load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.4.1.2 Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.4.2 Shared Cache Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.4.2.1 Loads from L1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.4.2.2 Stores from L1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.4.3 DRAM Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.4.4 Private Cache Operation After Response from Shared Cache . . . . . . . . 58 3.4.5 Example of G-TSCOperation . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.5 GPU-Related Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.5.1 Update Visibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.5.2 Request Combining in GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.5.3 Non-Inclusive Caches in GPUs . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.5.4 Timestamp Over ows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.6 Evaluation and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.6.1 Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.6.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.6.3 Coherence Trac . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.6.4 Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.6.5 Characteristics of G-TSC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Chapter 4: An Ecient Sequential Consistency Implementation with Dynamic Race Detection for GPUs 80 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.2 Hardware-Assisted Dynamic Race Detection . . . . . . . . . . . . . . . . . . . . . . 84 4.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.2.2 Dynamic Race Detection Implementation . . . . . . . . . . . . . . . . . . . 87 4.2.3 GPU Speculative Execution Details . . . . . . . . . . . . . . . . . . . . . . 93 4.2.4 An Example Dynamic Race Detection with Speculative Execution . . . . . 96 4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 v 4.3.1 Simulation Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.3.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.3.3 Dynamic Race Detection Scheme Conguration . . . . . . . . . . . . . . . . 102 4.3.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.3.5 Race Detection Granularity and Signature Size . . . . . . . . . . . . . . . . 105 4.3.6 Cache Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.3.7 Energy Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.3.8 G-TSC for Sequential Consistency and Dynamic Race Detection Scheme . . 109 4.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Reference List 113 Appendix A Detailed Results of The Dynamic Race Detection Scheme Evaluation . . . . . . . . . . . 119 vi List Of Tables 2.1 GPGPU-Sim Conguration for Cache Management Scheme Evaluation. . . . . . . 34 3.1 Contents of Requests and Response Exchanged Between Private and Shared Caches. 61 3.2 Absolute Execution Cycles of TC and Baseline (BL) in Millions . . . . . . . . . . . 68 3.3 GPGPU-Sim Conguration for G-TSC Evaluation. . . . . . . . . . . . . . . . . . . 68 3.4 List of Evaluated Benchmarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.1 GPGPU-Sim Conguration for Dynamic Race Detection Scheme Evaluation. . . . 101 4.2 Access Energy and Leakage Power for RDU, AVU, and Register Renaming Table . 108 A.1 Individual Benchmarks Normalized Execution Time (in Cycles) (1) . . . . . . . . . 120 A.2 Individual Benchmarks Normalized Execution Time (in Cycles) (2) . . . . . . . . . 121 A.3 Individual Benchmarks Normalized Execution Time (in Cycles) (3) . . . . . . . . . 122 A.4 Synchronization Points Classication Per Benchmark (Ideal Conguration) . . . . 123 A.5 Synchronization Points Classication Per Benchmark (No Speculation Conguration)124 A.6 Synchronization Points Classication Per Benchmark (Byte-Level Conguration) . 125 A.7 Synchronization Points Classication Per Benchmark (Word-Level Conguration) . 126 A.8 Synchronization Points Classication Per Benchmark (Block-Level Conguration) . 127 A.9 Synchronization Points Classication Per Benchmark (Half-Page-Level Congura- tion) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 A.10 Coherence Misses Classication Per Benchmark (Ideal Conguration) . . . . . . . 129 A.11 Coherence Misses Classication Per Benchmark (No Speculation Conguration) . . 130 A.12 Coherence Misses Classication Per Benchmark (Byte-Level Conguration) . . . . 131 vii A.13 Coherence Misses Classication Per Benchmark (Word-Level Conguration) . . . . 132 A.14 Coherence Misses Classication Per Benchmark (Block-Level Conguration) . . . . 133 A.15 Coherence Misses Classication Per Benchmark (Half-Page-Level Conguration) . 134 A.16 Detailed Energy Consumption for Individual Benchmarks (1) . . . . . . . . . . . . 135 A.17 Detailed Energy Consumption for Individual Benchmarks (2) . . . . . . . . . . . . 136 A.18 Detailed Energy Consumption for Individual Benchmarks (3) . . . . . . . . . . . . 137 A.19 Detailed Energy Consumption for Individual Benchmarks (4) . . . . . . . . . . . . 138 A.20 Detailed Energy Consumption for Individual Benchmarks (5) . . . . . . . . . . . . 139 A.21 Detailed Energy Consumption for Individual Benchmarks (6) . . . . . . . . . . . . 140 A.22 Detailed Energy Consumption for Individual Benchmarks (7) . . . . . . . . . . . . 141 A.23 Detailed Energy Consumption for Individual Benchmarks (8) . . . . . . . . . . . . 142 A.24 Detailed Energy Consumption for Individual Benchmarks (9) . . . . . . . . . . . . 143 A.25 Detailed Energy Consumption for Individual Benchmarks (10) . . . . . . . . . . . 144 A.26 Detailed Energy Consumption for Individual Benchmarks (11) . . . . . . . . . . . 145 A.27 Detailed Energy Consumption for Individual Benchmarks (12) . . . . . . . . . . . 146 viii List Of Figures 1.1 SM Architecture in GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 CUDA Hierarchy in GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3 Memory System Architecture in GPUs . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4 Round-Robin CTA Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1 CTA Distance for Shared Data Blocks . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2 Sample Cuda Codes from Three Dierent Benchmarks. Varibles in Bold are either passed as an arguments or they are constants . . . . . . . . . . . . . . . . . . . . . 19 2.3 The CDF of Shared L2 Data blocks vs. Number of Sharers . . . . . . . . . . . . . 20 2.4 Proposed CTA scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.5 Code sample of (Gaussian) Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.6 Execution Time variation of CTAs for 70 Kernels in Gaussian Benchmark . . . . . 23 2.7 Cache Block Placement in GPU Memory System . . . . . . . . . . . . . . . . . . . 25 2.8 The Flowchart of The Proposed Allocation Mechanism in Caches . . . . . . . . . . 30 2.9 Proposed Replacement Candidate Selection Flowchart for L1 Cache . . . . . . . . 31 2.10 Cache Freezing Issue Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.11 Hardware Extension Needed for Our Design . . . . . . . . . . . . . . . . . . . . . . 33 2.12 Eect of CTA scheduler Queue size . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.13 Execution Time, Power and Energy consumption between the proposed CTA sched- uler and BCS with and without cache management technique . . . . . . . . . . . . 36 2.14 Normalized L1 Miss Rate, L2 MPKI, and DRAM Trac with qx adpt Scheduler . 37 ix 2.15 Normalized Energy Consumption of L2 cache, NoC, and DRAM with qx adpt Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.16 Normalized DRAM Power, Energy, and O-chip Trac with qx adpt Scheduler . . 39 2.17 Normalized Energy Consumption of L2 cache, NoC, and DRAM with Kepler Con- guration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.1 The Finite State Machine of both L1 and L2 Caches. The prex Pr denotes the messages received from the SM, DRAM denotes the messages received from the DRAM and Bus denotes the messages exchanged with the NoC. . . . . . . . . . . 51 3.2 The Flowchart of the Load Request From SM . . . . . . . . . . . . . . . . . . . . . 52 3.3 The Flowchart of the Store Request From SM . . . . . . . . . . . . . . . . . . . . . 53 3.4 The Flowchart of the Read Request Sent to L2 from L1 Cache . . . . . . . . . . . 54 3.5 The Flowchart of the Write Request Sent to L2 from L1 Cache . . . . . . . . . . . 55 3.6 The Flowchart of DRAM Fill and Eviction . . . . . . . . . . . . . . . . . . . . . . 56 3.7 Flowcharts of Private Cache Operation. . . . . . . . . . . . . . . . . . . . . . . . . 57 3.8 The Flowchart of Fill Response from LLC . . . . . . . . . . . . . . . . . . . . . . . 57 3.9 G-TSC Operation Example. The contents of the caches of each SM is shown with the wts and rts of each block in the parenthesis. . . . . . . . . . . . . . . . . . . . . 59 3.10 Example of Update Visibility Challenge in GPUs . . . . . . . . . . . . . . . . . . . 63 3.11 Example of Multiple Requests Challenge in GPUs . . . . . . . . . . . . . . . . . . 65 3.12 Performance of GPU Coherence Protocols with Dierent Memory Models . . . . . 70 3.13 Pipeline Stalls due to Memory Delay in G-TSC and TC Normalized to Stalls in No-L1-Cache Conguration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.14 Performance of G-TSC-RC with Dierent Lease Values . . . . . . . . . . . . . . . . 73 3.15 NoC Trac of GPU Coherence Protocols with Dierent Memory Models . . . . . . 74 3.16 NoC Trac of G-TSC with Dierent Lease Values . . . . . . . . . . . . . . . . . . 75 3.17 Total Energy Consumption of GPU Coherence Protocols with Dierent Memory Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.18 L1 Cache Energy (in joules) of GPU Coherence Protocols with Dierent Memory Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.19 GPU Energy Consumption of G-TSC with Dierent Lease Values . . . . . . . . . 77 x 4.1 Motivational Example For Dynamic Race Detection Scheme for SC in GPUs. . . . 83 4.2 Simple Example of Dynamic Race Detection Scheme . . . . . . . . . . . . . . . . . 84 4.3 Proposed Architecture to Implement Dynamic Race Detection Scheme in GPUs . . 88 4.4 Example of the Mechanism of the RDU. . . . . . . . . . . . . . . . . . . . . . . . . 92 4.5 Example of execution of that shows what updated locations are kept in the RDU and which are kept in the AVU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.6 Example Code to Explain the Dynamic Race Detection . . . . . . . . . . . . . . . 97 4.7 The Timeline of The First Scenario. The Square Brackets Next to The Epoch Number Shows The Contents of The AVU Signature. . . . . . . . . . . . . . . . . . 99 4.8 The Timeline of The Second Scenario. The Square Brackets Next to The Epoch Number Shows The Contents of The AVU Signature. . . . . . . . . . . . . . . . . . 100 4.9 Performance of GPU with Dynamic Race Detection Scheme. . . . . . . . . . . . . . 103 4.10 Percentage of Synchronization Points by Race Detection and Synchronization Points due to Signature Saturation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.11 Percentage of Coherence Misses in Private Cache. . . . . . . . . . . . . . . . . . . . 106 4.12 Energy Consumption Breakdown for Dynamic Race Detection Scheme. . . . . . . . 108 4.13 Performance Comparison Between G-TSC and Dynamic Race Detection Scheme for Sequential Consistency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 xi Abstract Graphics Processing Units (GPUs) are designed primarily to execute multimedia, and game ren- dering applications. These applications are characterized with streaming data that have little to no data sharing between threads. Because of their high power eciency, massive parallel com- putational capability, and high o-chip memory bandwidth, GPUs are now making in-roads into executing general purpose applications that have signicant, but somewhat irregular, parallelism. The improvements in the programming interfaces such as CUDA and OpenCL accelerate the adoption of GPUs for general purpose applications. However, these new application usages do not align well with the underlying GPU architecture. In particular, some of the irregular applications do share data between threads and they also exhibit inter-thread communication patterns that are not well supported in current GPU hardware. Unlike traditional graphics applications, that mostly deal with streaming data, the new class of applications also shows some level of temporal and spatial locality between threads executing in the same kernel or thread block. But GPUs have limited cache capacity and do not support ecient inter-thread communication through memory. As such the programmer/compiler ought to nd ad-hoc solutions to tackle these challenges. This thesis presents a set of unifying GPU memory system improvements that enable ecient data sharing between threads, and also a comprehensive coherence and consistency models to enable ecient inter-thread communication. The rst part of this thesis shows that there is signicant data sharing across threads in a GPU while executing general purpose applications. However, due to poor thread scheduling data sharing leads to the replication of data in multiple private caches across many streaming multiprocessor xii cores (SMs) in a GPU, which in turn reduces the eective cache size. To tackle this challenge this thesis presents an ecient data sharing mechanism that reduces redundant data copies in the memory system. It includes a sharing-aware thread block (also called Cooperative Thread Array (CTA)) scheduler that attempts to assign CTAs with data sharing to the same SM to reduce redundant storage of data in private L1 caches across SMs. The design is further enhanced with a sharing-aware cache allocation and replacement policy. The sharing-aware cache management approach dynamically classies private and shared data. Private blocks are given higher priority to stay longer in L1 cache, and shared blocks are given higher priority to stay longer in L2 cache. The evaluation experiments show that the proposed design reduces the o-chip trac by 19% which translates to an average DRAM power reduction of 10% and performance improvement of 7%. The second part of the thesis focuses on supporting intuitive memory coherence and consistency models that programmers are familiar with in the CPU domain. The thesis presents a GPU- centric Time Stamp Coherence (G-TSC), a novel cache coherence protocol for GPUs that is based on timestamp ordering. G-TSC conducts its coherence transactions in logical time rather than physical time and uses time stamp based self invalidation of cached data, which reduce the coherence trac dramatically. The thesis demonstrates the challenges in adopting timestamp coherence for GPUs which support massive thread parallelism and have unique microarchitecture features, and then presents a number of solutions that tackle GPU-centric challenges. Evaluation of G-TSC shows that it outperforms time-based coherence by 38% with release consistency. The third part of the thesis explores ecient approaches to enforce sequential consistency in GPUs. The main intuition behind this work is that a signicant fraction of the coherence trac can be curtailed by simply delaying the propagation of updated data values across SMs until the end of of an epoch, where an epoch is broadly dened as the time between two data race occurrences. A data race occurs when two threads concurrently access data where at least one access is a write access. The thesis presents a simple bloom lter based signature generation xiii mechanism that keeps track of write-sets from each SM in a signature and uses the signature to dynamically detect races. Data updates are propagated when a race is detected from the signatures which in turn provides sequentially consistent execution. The evaluation of the proposed scheme shows that it can achieve sequential consistency with performance overhead as low as 5% and with energy overhead as low as 2.7%. xiv Chapter 1 Introduction: Memory System Challenges in GPUs Graphical processing units (GPUs) are parallel accelerators that are designed to execute graphics applications. With their ability to execute thousands of threads simultaneously, GPUs are able to break the tera ops performance barrier with high power eciency [30, 56]. The outstanding growth in arithmetic throughput and memory bandwidth in GPUs, that surpass their CPU coun- terparts, make them attractive to execute a broad range of parallel applications, that are not just graphics-oriented. Signicant resources are being invested to broaden the use of GPUs for non-graphic applications such as scientic, medical, nancial and engineering applications [55, 56]. The continuous enhancement of CUDA, a general purpose parallel computing platform and ap- plication programming model created by Nvidia [18], along with OpenCL [57], have facilitated the adoption of GPUs to solve complex computational problems, that go beyond graphics and multi-media applications, in an ecient way. Since GPUs have emerged as a high performance computing (HPC) platform for general pur- pose applications, there is also a need to evolve the microarchitecture of GPUs for their broader adoption. In particular, the memory system behavior of non-graphics applications has been shown to be signicantly dierent than traditional graphics applications. For instance, prior cache ef- ciency evaluation studies [40, 37, 61, 71] show that both L1 and L2 caches suer from high miss rates even though the access patterns of the cached data showed a high locality. Traditional 1 graphics applications had limited data reuse across threads as these applications exhibit streaming data accesses with limited inter-thread data reuse. As such, each streaming multiprocessor (SM) core has a relatively small private L1 cache. Since each SM can execute thousands of threads simultaneously the available private L1 cache space per thread is limited to around 10 bytes. Due to the massive multithreading in GPUs, even a reasonably small per-thread working set will result in premature eviction of useful data, and cache thrashing [60]. Hence any available locality within each thread of a general purpose application is entirely hidden by the cache thrashing. Our own analysis shows that only 9% of L1 cache misses are compulsory misses. Given that only a small fraction of cache misses are compulsory misses, we can infer that general purpose applications running on GPUs show strong data locality among executing threads. In particular, 91% of the cache misses are either con ict or capacity misses that could have been reduced if the eective cache size is increased The second challenge adapting GPUs to general purpose parallel applications is the limited support for inter-thread communication. Since graphics applications have limited inter-thread communication GPUs traditionally supported limited memory coherence and did not support a reasonable memory model that programmers can rely on to reason about parallel thread interac- tions. GPUs employ a simple software-based coherence protocols that assume no data races and coarse-grained synchronizations. Currently one common approach to implement coherence is to separate data producers and consumers into two dierent kernels and allow the data producer kernel to run to completion before allowing the data consumer kernel to execute. In particular, the application developer may simply invalidate the private caches at the beginning of a producer kernel execution and ush the dirty data back to some shared memory (typically the CPU host DRAM) at the end of the kernel execution. Then the modied data is copied back to the GPU memory before initiating the data consumer kernel to execute. In this model ne-grained syn- chronizations are assumed to be infrequent. If ne-grained synchronization is desired one current solution is to simply disable the use of private caches and execute all memory operations at the 2 shared L2 cache, or use atomic operations that bypass the private L1 cache entirely [66, 25]. To summarize the challenge, the GPU memory system does not provide any mechanism to propagate updates during kernel execution. Since general purpose applications may require inter-thread com- munication during kernel execution, cache coherence in classical GPUs can be achieved through disabling the private L1 caches. Although this seems to be a feasible solution, it leads to signicant performance loss and increase in the interconnection network trac to access the shared L2 cache. Prior work has indicated that supporting hardware cache coherence on GPUs would enable e- cient porting of a broad range of parallel applications. For example, CUDA-Cut benchmark [73] implementation on coherent GPU memory is supposed to be 30% faster than its implementation on a GPU which disables private L1 cache [67]. Prior work also suggests that GPU applications that requires cache coherence can achieve up to 88% performance improvement over disabling L1 cache with an ideal coherence mechanism [67]. Cache coherence is also a necessary building block to design memory consistency models that allow programmers to reason about data propagation. This thesis addresses the aforementioned issues, specically the poor cache management, lack of coherence and consistency models in GPUs. The rst part of this thesis shows that there is signicant data sharing across threads in a GPU while executing general purpose applications. However, due to poor thread scheduling data sharing leads to the replication of data in multiple private caches across many streaming multiprocessor cores (SMs) in a GPU, which in turn reduces the eective cache size. To tackle this challenge this thesis presents an ecient data sharing mechanism that reduces redundant data copies in the memory system. It includes a sharing- aware thread block (also called Cooperative Thread Array (CTA)) scheduler that attempts to assign CTAs with data sharing to the same SM to reduce redundant storage of data in private L1 caches across SMs. The design is further enhanced with a sharing-aware cache allocation and replacement policy. The sharing-aware cache management approach dynamically classies private and shared data. Private blocks are given higher priority to stay longer in L1 cache, and shared blocks are given higher priority to stay longer in L2 cache. 3 The second part of the thesis focuses on supporting intuitive memory coherence and consistency models that programmers are familiar with in the CPU domain. The thesis presents a GPU- centric Time Stamp Coherence (G-TSC), a novel cache coherence protocol for GPUs that is based on timestamp ordering. G-TSC conducts its coherence transactions in logical time rather than physical time and uses time stamp based self invalidation of cached data, which reduce the coherence trac dramatically. The thesis demonstrates the challenges in adopting timestamp coherence for GPUs which support massive thread parallelism and have unique microarchitecture features, and then presents a number of solutions that tackle GPU-centric challenges. The third part of the thesis explores ecient approaches to enforce sequential consistency in GPUs. The main intuition behind this work is that a signicant fraction of the coherence trac can be curtailed by simply delaying the propagation of updated data values across SMs until the end of of an epoch, where an epoch is broadly dened as the time between two data race occurrences. A data race occurs when two threads concurrently access data where at least one access is a write access. The thesis presents a simple bloom lter based signature generation mechanism that keeps track of write-sets from each SM in a signature and uses the signature to dynamically detect races. Data updates are propagated when a race is detected from the signatures which in turn provides sequentially consistent execution. 1.1 Thesis Statement The overarching goal of this thesis is to improve the memory system design of GPUs to enable ecient execution of general purpose applications on GPUs. It achieves this goal by rst de- signing a new cache management scheme and a thread block scheduling algorithm that allows ecient sharing of data across thread blocks. Second, the thesis describes a novel cache coherence scheme based on self invalidation of data using logical time stamps. Finally, the thesis presents a 4 mechanism to propagate data updates in a lazy manner only when a data race is detected, which in turn ensures that sequential consistency can be achieved in GPUs. 1.2 GPU Background 1.2.1 GPU Architecture Although GPUs designed by dierent manufacturers can dier in the detailed micro-architectural design but they all share many of the broader features that support their parallel execution paradigm. In this work, Nvidia GTX480 Fermi Architecture is used as a baseline and as such we will use Nvidia's terminology while describing the GPU hardware and software models, while much of the description can be generalized to other GPU architectures such as the AMD and Intel GPU models. GTX480 consists of 16 parallel-processing cores called streaming multiprocessor (SM) or shader cores connected to six memory modules through an interconnection network. Each SM is equipped with 32 streaming processors (also called CUDA cores) that are able to execute integer, single and double precision oating point operations per clock for a thread. Thus a total of 512 streaming processors (32 X 16 SMs) are provisioned within each GPU. Along with the CUDA cores, each SM has 16 load/store units that are able to calculate the source and destination addresses of sixteen memory requests per clock, and 4 special function units (SFU) that are able to executed transcendental instructions such as sin, cosine, reciprocal, and square root per thread per cycle. Figure 1.1 shows the overall architecture of GTX480 SM. 1.2.2 Execution Model GPUs are currently treated as a co-processor to the CPU, where the co-processor is able to concurrently execute several threads in parallel. To take advantage of GPU, applications must be written in a specialized programming language such as CUDA or OpenCL. GPU applications are written as a series of kernels, which are embedded within the CUDA or OpenCL program. 5 Figure 1.1: SM Architecture in GPUs The CPU starts executing the main program and when it encounters a kernel it o-loads it to the GPU. Each kernel performs a parallel computational task on the GPU and when the task is complete GPU returns the control back to the CPU, and the CPU may launch the next kernel at that time. A kernel can be treated as a function that is executed on the GPU with many dierent threads doing the same computation but with dierent data. GPU codes are known to be data-parallel and/or compute-intensive functions. Each kernel is organized as a grid of thread blocks, called Cooperative Thread Array (CTA), or work group, and it is considered the basic workload that can be assigned to Streaming Multi- processor (SM) core. Threads in the same CTA can be synchronized where threads are suspended until all threads reach a synchronization instruction. On the other hand, threads in dierent CTAs cannot be synchronized. A CTA can be identied by its CTA ID, which represents its number within a kernel. CTA ID, which can be 2 or 3-dimensional ID number, can be used in memory addressing. Similarly, a thread can be identied by its Thread ID, which represents its number 6 within a CTA. Thread ID, which can be 2 or 3-dimensional ID number, can be used in memory addressing. All CTAs of the same kernel have the same number of threads (i.e. have the same size and dimensions). Much of the hardware resources needed to execute a CTA are statically assigned to the CTA at the time the CTA is launched for execution on an SM. For instance, all the registers and memory needed by a CTA are pre-assigned at the start of the CTA execution. As such there is a maximum limit on the number of CTAs in a kernels and also a maximum limit on the number of threads per CTA. The maximum number of CTAs that can be assigned to an SM is limited by the resources availability (e.g. number of thread contexts, size of shared memory, or size of register le). This number varies between kernels depending on how much resources are needed by each CTA. Each CTA is split further into subgroups of the same number of threads called warps that are executed in a lockstep fashion. Each warp is assigned a set of consecutive threads (have consecutive thread ids). Figure 1.2 explains the relationship between kernels, CTAs, warps, and threads. As the gure shows, each As the gure shows, the GPU application may contain multiple kernels. Each kernel is organized into three-dimensional array of CTAs. Hence each CTA can be identied by a three-dimensional ID. Also, each CTA is organized into three-dimensional array of threads and each thread can be identied by a three-dimensional ID within a CTA. The dimensions of the kernel and CTA are set by the programmer and can be used in the kernel code for memory addressing or other computations. A warp is typically a collection of 32 threads and all threads execute using a single instruction multiple thread (SIMT) execution model. In the SIMT model all threads are allowed to execute either the same instruction, but on dierent data items, or a subset of threads may be inactive during the execution. In other words, it is not possible to allow the 32 threads in a warp to execute two dierent instruction streams concurrently. Finally, recall that a single SM has only 4 special function units (SFUs). Hence, when a warp needs to execute an SFU instruction the warp needs 8 cycles to execute all its 32 threads using 7 the limited number of SFUs. Also note that each SM has a 128KB register le and each SM can store the architecture context of at most 48 warps. Hence, the maximum number of threads per SM is 1536. Figure 1.2: CUDA Hierarchy in GPUs 1.2.3 Memory Hierarchy GPU memory hierarchy consists of register le, L1 caches (data, texture, constant, and instruc- tion), L2 cache, and o-chip GDDR DRAM [17, 55]. A 64KB on-chip congurable memory is attached to each SM that can be congured as 48KB shared memory/16KB L1 data cache or 16KB shared memory/48KB L1 data cache. Each SM is connected to an instruction cache, a constant cache and a texture cache which are all read-only caches. Constant and texture caches are managed by the programmer/compiler. In order to conserve memory bandwidth, memory requests issued by the same warp are grouped, when possible, by a coalescing unit before they access caches. Thus if the 32 threads in a warp access consecutive memory words a single warp load instruction may only generate a few coalesced memory requests to the L1 cache. L1 caches are private per SM and can be addressed by all warps that execute in that SM. L2 cache is shared between all SMs and can be accessed by any thread in that executed kernel. L2 cache is parti- tioned into multiple banks that are connected to each DRAM channel. The SMs are connected to the L2 cache banks by an interconnection network. Each L2 cache bank can cache the blocks 8 Figure 1.3: Memory System Architecture in GPUs that are fetched from the DRAM channel connected to it [17, 55]. In both levels, Miss Holding Status Registers (MSHRs) record pending misses, similar to a CPU design. Figure 1.3 shows the organization of GPU memory system, which is similar to modern GPUs from NVIDIA and AMD. 1.2.4 Caches Hardware and Policies Allocation policies in caches dier in L1 and L2; cache space is allocated upon miss in L2 cache (ON MISS) or upon ll in L1 cache (ON FILL). According to this policy, in L2 cache, the victim block is determined and evicted when an access results in a miss, while in L1 cache, the victim block is determined and evicted when the new block is lled from lower level cache [8]. GPU L2 cache uses write-back with write allocation policy similar to conventional CPU's last level cache (LLC). L2 cache evicts the victim block upon misses in order to free and reserve space for the recent miss requests. When an access misses in L2 cache, the replacement policy selects a victim block and evicts it. This eviction may generate a writeback request if the evicted block is dirty. The emptied cache block is then RESERVED for the requested block until it is served by the 9 main memory. Once the data is fetched from the main memory, it is placed in the reserved space and is marked as VALID. It can be marked as MODIFIED if it is updated by any of the executing threads. GPU L1 data cache allocates blocks upon receiving a ll response with a new data from the lower level cache (L2 cache). L1 cache adopts a write-through with write allocation for global data and write-back for local data. The data that can be accessed by any thread in the kernel is called global data while the data that can be accessed by only one thread is called local data. All the data is global by default but the programmer/compiler can annotate some data as local. Both CUDA and OpenCL provide programming annotation features that allow a programmer to mark the data as local or global. Typically global data uses a write-through policy. Using write-through for global data allows other cores to observe the most recently updated data through shared L2 cache while there is no need to share the updated local data since they can be accessed by the a single thread only. These policies and data categorizations are somewhat unique to GPUs and are usually considered better than simple write-back policy. GPU caches do not enforce inclusion. In fact, NVIDIA GPU caches adopt non-inclusive, non-exclusive caches meaning that cache blocks brought into L1 cache are also cached in L2 cache but they do not generate a recall requests to L1 caches when they are evicted or replaced [6, 14]. Figure 1.4: Round-Robin CTA Scheduler 10 1.2.5 CTA Scheduler Recall that each SM may need to execute multiple CTAs and hence scheduling CTAs appropriately is critical for performance. On NVIDIA GPUs [55], a GigaThread Engine is responsible for CTA scheduling. However, there are not much details about how this engine works. In this thesis we assume a baseline round-robin (RR) CTA scheduler where CTAs are assigned to each SM in a round-robin manner. Each SM is assigned the maximum number of CTAs based on the available resources. Once a CTA nishes execution, another CTA is scheduled to the empty CTA slot until all CTAs are assigned to SMs. After calculating the maximum number of CTAs per SM, the CTAs are assigned to SMs in a round-robin manner as shown in gure 1.4. In RR, each SM has a pointer points to the last assigned CTA to that SM. When there is a CTA context available in any SM, its pointer jumps to the nearest CTA and assigns it to that SM. For example, CTA 1 is assigned to SM 1 , CTA 2 is assigned to SM 2 , and so on, until all cores are assigned a CTA. The same procedure is repeated until each core is assigned the maximum number of CTAs. After this initial round-robin allocation new CTAs are allocated after an existing CTA nished execution. In the gure, the SM that rst nish one of its already assigned CTAs will get CTA 4 to execute. This scheduling scheme guarantees fair load balance distribution between core and also maximums utilization of a single core. Other scheduling policies are used in GPUs such as greedy-then-oldest (GTO) and two-level scheduler [22]. These policies along with the round-robin (RR) are mainly used for warp sched- uler. The GTO issues instructions from a single active warp for as long as possible, without stalling, before selecting another ready warp. Once the selected warp is stalled (e.g. waiting for a long memory operation to complete), the scheduler picks the oldest ready warp and starts issuing instruction from that warp. The Two-level warp scheduler places all warps waiting on long latency events, such as memory accesses, in the pending queue and keeps the active warps, that are either waiting on a short latency dependency or whose input operands are already available in 11 the register le, in the active queue. The scheduler issues instructions from the active warps queue in a round-robin manner. 1.3 Coherence and Memory Consistency We provide a basic background on memory coherence and consistency models. This background is necessary as we discuss GPU-centric coherence and consistency models that are described in the later parts of this thesis. Coherence is typically dened with the "single writer multiple reader" invariant. At any given moment in time, there is either a single writer or multiple readers for any given memory location [69]. The implementation of a cache coherence protocol typically involves three aspects: 1) propagating the new value to all sharers either by invalidating or updating private copies; 2) acknowledging the global performance of store operations; 3) maintaining write atomicity [67] when required (i.e. value from the store operation is atomically seen by all threads at once). Some coherence protocols disregard some aspects partially or entirely [69]. While coherence deals with how values are propagated for a single memory location, it is gen- erally not sucient to reason about parallel thread interactions where multiple memory locations may be accessed. The memory consistency is dened as "a specication of the allowed behavior of multithreaded programs executing with shared memory" [69]. It dictates the valid ordering of memory operations to dierent locations. Hence, the memory consistency model species the possible values a dynamic loads may return and the possible nal state of memory [2, 69]. In this work, we consider the implementation of Sequential Consistency (SC) and Release Consistency (RC) on GPUs. Sequential consistency (SC) [43] requires that the memory operations of a program appear to be executed in some global sequence, as if the threads are multiplexed on a uniprocessor. SC restricts many architecture and compiler optimizations and usually leads to lower performance [2]. 12 Release Consistency (RC), which is a relaxed memory consistency model that allows re-ordering of memory operations to dierent addresses. RC also relaxes the write atomicity requirements. The programmers can arm the order between memory operations using fence. In summary, SC and RC are considered as two extreme examples as SC is the most restrictive memory model and RC is a more relaxed memory model. There are models in between such as Total-Store-Order (TSO) [2]. The GPU memory implements a weak form of the release consistency without coherence. It assumes no inter-thread communication during kernel execution and hence it does not implement any mechanism to propagate the updates between threads in dierent CTAs. However, the pro- grammer can achieve coherence by disabling the private L1 cache either fully or partially and relying on the shared L2 cache. The modern GPUs allow the programmers to annotate some accesses as coherent accesses and hence they are executed at the shared L2 cache. They also allow the programmers to bypass L1 cache for all data accesses. Bypassing L1 cache and performing memory operations at the share L2 cache provides a coherent execution of memory operations but it increases the interconnection trac and aects the performance. The GPUs also do not provide any mechanism to synchronize all threads in the kernel. The GPU memory provides ne synchronization in form of atomic operations that are executed at the shared L2 cache. Although these atomic operations are useful, they are limited and not enough for complex operations. Ker- nels boundaries are used for coarse synchronizations. The private L1 caches spill the dirty data to the main memory at the end of kernels execution and ush the clean data at kernel launching. 1.3.1 Contributions This thesis makes the following contributions: Power Ecient Sharing-Aware Data Management: We propose a sharing-aware CTA scheduler that attempts to assign CTAs with data sharing to the same SM to improve 13 temporal and spatial locality. The approach then dynamically classies private and shared data. Private blocks are given higher priority to stay longer in L1 cache and shared blocks are given higher priority to stay longer in L2 cache. Essentially, the proposed approach increases the lifetime of shared blocks and private blocks in dierent cache levels in order to reduce the overall o-chip trac and premature eviction, and increase eective cache size. Timestamp-Based Coherence Protocol: We propose, G-TSC, a novel cache coherence protocol for GPUs that is based on timestamp ordering. G-TSC conducts its coherence transactions in logical time. G-TSC can reduce coherence trac and storage overhead in a similar manner as time-based coherence protocols, while eliminating many of the limitations of using a synchronized clock to maintain time. An Ecient Sequential Consistency Implementation with Dynamic Race Detec- tion: We propose an ecient lightweight mechanism to implement sequential consistency in GPUs. The main intuition behind the proposed scheme is that a signicant fraction of the coherence trac can be curtailed by simply delaying the propagation of updated data values across SMs until a race is detected. A data race occurs when two threads concurrently access data where at least one access is a write access. The proposed scheme uses a simple bloom lter-based signature generation mechanism that keeps track of write-sets from each SM in a signature and uses the signature to dynamically detect races. It also employs a novel selective self-invalidation technique that invalidates the updated data only. 14 Chapter 2 Power Ecient Sharing-Aware GPU Data Management 2.1 Introduction As described in the previous chapter, general purpose computation on GPUs is increasingly pop- ular as they are considered a power ecient approach for achieving high throughput. The current GPUs include several hardware and software features to support general purpose applications. Recall that they typically have a complex cache hierarchy consisting of private L1 caches per streaming multiprocessor core (SM), a shared L2 cache connected to all SMs through an inter- connection network, and banked high bandwidth DRAM. These structures reduce the memory bandwidth consumption of applications with irregular memory access patterns. However, recent studies have shown that even with such a hierarchy, general purpose applications on GPUs ex- perience signicant memory access bottlenecks [61, 60]. Each SM within a GPU can execute thousands of threads simultaneously which limits the available private L1 cache space per thread to around 10 bytes; for instance 16KB L1 cache per SM that can run up to 1536 threads in some recent GPUs. Due to the massive multithreading in GPUs, even a reasonably small per-thread working set will result in premature eviction of useful data, and cache thrashing [60]. The ef- fective use of caches is of great importance since it could reduce o-chip memory accesses and therefore improve power eciency.However, the massive multithreading in GPUs makes capturing 15 data locality in caches very dicult [60]. To understand the eciency of GPU memory hierarchy, we conducted experiments to classify the cache misses (details later). Our results show that only 19% of L2 cache misses are compulsory misses. This percentage dropped to less than 9% in L1 caches. Hence, it is clear from our initial experiments that there is in fact signicant locality in general purpose applications running on GPUs but this locality could not be exploited due to very small cache allocation per each thread. In principle, the cache eciency could be improved by pinning live cache blocks in cache so that they can be reused before eviction, or by using adaptive cache insertion policy that does not pollute cache with bring streaming data, or by cache bypassing (e.g. sending the fetched cache block to next level in memory hierarchy without inserting the data into the current cache level). While these techniques are intuitively eective, the real challenge is how to guide cache allocation and bypassing policies in the context of GPUs to fully derive the benets. In this chapter, we rst make a motivational observation that GPU applications share data across neighboring cooperative thread arrays (CTAs). For load balancing purposes GPUs spread these CTAs across multiple SMs which results in data replication in private L1 cache. We introduce a sharing-aware CTA scheduler design, which takes advantage of data sharing across CTAs and assigns sharers to one SM as much as possible while taking into account the load balance among SMs. This approach ensures that shared data does not get replicated across multiple L1 caches which eectively improves L1 cache size. To further reduce replication impact we propose to place only a single copy of shared data in L2 cache and let the shared data bypass L1 cache, which in turn could increase the life time of active private blocks. We bring the two ideas under a unied sharing-aware cache management framework for GPUs. We propose a simple mechanism to classify private and shared data dynamically and propose a cache insertion and replacement policy which enables private and shared data to stay longer in private L1 cache and shared L2 cache, respectively.Using the above mentioned two innovations, the sharing-aware cache management framework increases the eective size of caches, reduces premature block eviction and increases 16 the cache block lifetime. We implemented our proposed design in a cycle-accurate simulator and showed that our design can reduce the o-chip trac by 19%, reduce DRAM power consumption by 10% while improving the overall performance by 7%. 2.2 Data Sharing in GPUs 2.2.1 Inter-CTAs Data Sharing In this section we present the motivational data to quantify how much data is shared across CTAs within a kernel. In order to study the inter-CTA data sharing, we study the relationship between the CTAs that are sharing the data. To quantify this relation, a linearized CTA IDs is used in this study. The linear CTA ID is calculated as (CtaId) = BlockId.x + (BlockId.y GridDim.x) + (BlockId.z GridDim.y GridDim.x). Figure 2.1 shows inter-CTA data sharing. The Y-axis shows the fraction of total data that is accessed by the kernel that is shared by at least two CTAs. The X-axis shows the distance between the two sharer CTAs, in terms of the linearized CTA ID dierence. The data shows that around 70% of data sharing happen between neighboring CTAs (CTAs that have a linear distance of less than ve). These results present the average computed over a wide range of GPU benchmarks (details presented later). Clearly this data shows that sharing is prevalent across CTAs and most of the data sharing occurs within a short CTA distance. Our study also shows that around 60% data accessed is shared by multiple CTAs. This shows that data sharing is prevalent across CTAs. To understand the reasons behind data sharing, we analyzed the CUDA source code for several benchmarks. A closer look at CUDA codes of the benchmarks shows that the reason for the vast amount of sharing is that many load addresses are calculated from parametrized data such as block ids, thread ids, constant parameters loaded by ld:param such as block dimensions and grid dimensions or other constants passed to during kernel initialization. Figure 2.2 shows prominent data access segments from three dierent benchmarks where the addresses of loads or stores are 17 determined by constants, and thread ids, and CTA ids only. In the code segment from NN, the array (Layer4 Neurons GPU) is indexed using blockIdx:y only, which means that neighboring CTAs that have consecutive values of blockIdx:x will access the same memory location. In the Backprop benchmark, the array input cuda is indexed using index in which is computed using onlyblockIdx:y and hence all threads that belong to the same blockIdx:x will use the same index to access the array. In LUD, the varaible global row id used to index array m is computed in a similar way toindex in in Backprop. With cache block size of 128 bytes and data types of 32 or 64 bits (4 or 8 bytes), a single cache block can serve 32-64 memory request of consecutive addresses. Hence, some of the data sharing may also occur at the cache line granularity, even if the byte addresses may be slightly dierence. Data sharing analysis presented in [40] is compatible with our ndings about inter-CTAs data sharing. Figure 2.1: CTA Distance for Shared Data Blocks 2.2.2 Inter-SM Data Sharing While inter-CTA data sharing is a function of the kernel code, it does not necessarily mean that these CTAs also share data using the L1 cache. If these CTAs are allocated to dierent SMs then 18 ... for (int i=0; i<100; ++i ){ result+=Layer4_Neurons_GPU[i+(100*blockIdx.y)]* Layer4_Weights_GPU[weightBegin+i]; } ... Layer5_Neurons_GPU[blockIdx.x+(10*blockIdx.y)]=result; ... (a). NN ... int index_in = HEIGHT * blockIdx.y + 1; ... input_node[threadIdx.y] = input_cuda[index_in] ; ... hidden_partial_sum[blockIdx.y * hid + threadIdx.y] = weight_matrix[threadIdx.x][threadIdx.y]; ... (b). backprop ... int global_row_id = offset + (blockIdx.y+1)*BLOCK_SIZE; int global_col_id = offset + (blockIdx.x+1)*BLOCK_SIZE; peri_row[threadIdx.y][threadIdx.x] = m[(offset+threadIdx.y)*matrix_dim+global_col_id+threadIdx.x]; peri_col[threadIdx.y][threadIdx.x] = m[(global_row_id+threadIdx.y)*matrix_dim+offset+threadIdx.x]; ... m[(global_row_id+threadIdx.y)*matrix_dim+global_col_id+threadIdx.x] -= sum; ... (c). LUD Figure 2.2: Sample Cuda Codes from Three Dierent Benchmarks. Varibles in Bold are either passed as an arguments or they are constants inter-CTA data sharing transforms into inter-SM data sharing. Data is considered shared across dierent SMs when it is being accessed by threads that are executed on dierent SMs. While inter-CTA data sharing may be treated positively, inter-SM data sharing is the main culprit for creating redundant copies of data across multiple private caches. Figure 2.3 shows the CDF of the sharing behavior of cached blocked between SMs in a sample of the evaluated benchmarks. The gure shows how inter-CTA data sharing between neighboring CTA is manifested into inter-SM sharing between multiple SMs using round-robin (RR) CTA scheduler. Recall from the previous chapter description that RR scheduling simply assigns consecutive CTAs to consecutive SMs. Our analysis shows that 60% of data blocks are in fact shared across SMs. Recall that 70% of the data is shared across CTAs and unfortunately a vast majority of this sharing translates into inter-SM 19 data sharing, which in turn causes unnecessary data duplication in the private L1 cache of each SM. In fact, each shared data block is accessed on average by 2.41 SMs. Figure 2.3: The CDF of Shared L2 Data blocks vs. Number of Sharers 2.3 Data Locality Aware CTA Scheduler Even though sharing is prevalent among CTAs as shown in Section 2.2.1, the conventional round- robin CTA scheduler perturbs this locality by assigning consecutive CTAs to dierent cores. Hence, data locality among CTAs is aected. In order to exploit the inter-CTA data sharing, we propose a sharing-aware CTA scheduler. In our proposed CTA-scheduler, we consider the inter-CTA data sharing observations found in Section 2.2.1. The proposed scheduler splits CTAs into groups, each group has N consecutive CTAs as identied by the linearized CTA id, and assigns each group for execution on a specic SM. Figure 2.4 shows the proposed CTA-scheduler. Instead of round-robin allocation on a per- CTA basis, the allocation is done at a coarse granularity of CTA groups. The scheduler maintains the current and end pointer for each CTA group. The scheduler starts by issuing the maximum 20 allowed CTAs to each SM from the CTA group assigned to that SM. When one of the assigned CTAs completes its execution then the scheduler picks the next CTA from the same group. Hence, the current pointer of each SM is advanced by 1 after each CTA assignment until it reaches the edge of its assigned set of CTAs. When the CTA-pointer reaches the end of the CTA group, then a new CTA group is allocated for that SM.This CTA-assignment policy ensures that neighboring CTAs are assigned to the same SM and thereby enabling shared data to be brought into just one L1 cache and leverage data sharing behaviour observed in Section 2.2.1. The group size (N) must be chosen based on the trade-o analysis between the benets of the CTA scheduling scheme and the potential negative eects of load imbalance and resource under utilization. When the group size is too large then the execution time of slowest group will determine the overall kernel execution time. If the group size is too small then neighboring CTAs may span dierent groups and these discontinuities at the end of each group lead to data replication across dierent SMs. We propose to set the group size depending on kernel grid size (gridDim) since memory addresses are mostly based on the grid size specication. Following the CTA linearized ID used in our analysis in 2.2.1, we will use gridDim:x as the size of the CTA assignment group. Thus each group is as large as the X-dimensionality of the grid. With this group size choice, all CTAs in the group have the same blockIDx:y and their linearized IDs dier by 1. Figure 2.4: Proposed CTA scheduler 21 Next, we discuss the issues with load imbalance. If the number of CTA groups is not a multiple of the number of SMs in the GPU, some SMs will be assigned more groups than the others and hence there is a possibility that those SMs with more CTA groups assigned take more time to nish their execution and become the execution bottleneck. For example, assume a 16-SM GPU, and a kernel with 17 CTAs groups. According to the proposed CTA scheduler, all SMs will be assigned 1 CTA group except one SM that will get 2.The one SM with 2 CTA groups will determine the overall kernel execution time; note that the basic assumption in this argument is that the execution time of each CTA group is roughly the same. The 15 SMs that are assigned 1 CTA group will remain idle until the SM with 2 CTA groups nish it groups and then the kernel will nish. Hence, the execution time is a function of the number of CTA groups executed. The other possible source of workload imbalance is the unequal amount of work performed by dierent threads in a CTA. The number of instructions (memory, ALU and SFU instructions) per threads varies because of divergence. Figure 2.5 shows a code snippet from the codes of Gaussian benchmark as an example of load imbalance due to divergence. The thread that does not pass the rst two conditional statements does not execute any instructions (i.e. neither memory operations nor ALU/SFU instructions). The variableyidx is calculated from constant parameters (blockIdx.y, blockDim.y, threadIdx.y) and based on this value the instruction in line 5 will be executed or not. We can quantify the total amount of work (number of instructions) per CTA as the sum of all instructions executed by thread in that CTA. When we compare the amount of work across all CTAs, we notice that the number of instructions executed by dierent CTAs varies and thus their execution time. The execution time of the CTA is bounded by the slowest thread (the CTA is not considered done until all threads are done). The variation in execution time between CTAs can be caused by the variation in the number of instructions executed or the latency of the instruction executed. Figure 2.6 shows an example of the execution time variation between CTAs in dierent kernels of Gaussian benchmark. It shows the average, maximum, and minimum execution time of CTAs in 70 dierent kernels in Gaussian benchmark. There are multiple CTAs that take longer to 22 execute within each kernel. However, further analysis showed that these slowest executing CTAs are equally dispersed amongst all the CTA groups. As a result, even though there are variations in CTA execution time the execution time of a CTA group remained roughly the same. Hence, in general our approach of assigning a group of consecutive CTAs (delineated by their X-dimension of the input grid) does not lead to load imbalance. Figure 2.5: Code sample of (Gaussian) Benchmark Figure 2.6: Execution Time variation of CTAs for 70 Kernels in Gaussian Benchmark However, in order to avoid any workload imbalance issue that may arise in exceptional cases, we propose a modication to the sharing-aware CTA scheduler. The modied scheduler switches back to the conventional RR CTA scheduler towards the end of kernel execution. In particular, 23 the scheduler assigns CTA groups to SMs until the number of remaining CTA groups is less than the number of SMs. At that time it switches to the conventional RR CTA scheduler. The RR CTA scheduler used at the end of kernel execution reduces the impact of load imbalance. It normalizes the amount of work executed by all the SMs as it assigns fewer CTAs to an SM that takes longer to execute a specic CTA. Figure 2.4 shows the round-robin CTA scheduling for load balancing. The CTAs in dark boxes are scheduled using the load balancing round-robin scheduler. 2.4 Sharing-Aware GPU Cache Management The previous section described an approach to increase the probability that two CTAs that share data may be assigned to the same SM. But in cases where the neighboring CTAs span dierent CTA groups, the problem of data replication still persists. We still need a mechanism to improve L1 cache eciency by avoiding data replication. In this section, we will describe a sharing-aware cache management scheme that achieves this goal. GPU threads can access data from multiple memory spaces during their execution.In particu- lar, each thread has its own private local memory space, while all threads in a kernel may access the same global memory space. Both the local and global memory spaces may be cached in L1 and L2 caches. Our goal is to separate the local memory, which is by denition private data, from global memory, some of which may be shared across CTAs. 2.4.1 The Need for Better Cache Management GPU caches do not enforce inclusion. In fact, NVIDIA GPU caches adopt non-inclusive, non- exclusive caches meaning that cache blocks cached in L1 are also cached in L2 but eviction of a block in L2 cache does not necessary cause an eviction of all of its copies in L1 caches [14]. This scheme causes possible data redundancy between L1 and L2 cache. Figure 2.7 shows an example of redundancy. In this example, block A is only requested by SM3 and has two copies: one in 24 SM 's L1 cache and another one in L2 cache. The copy in L2 cache will not be accessed unless the copy in L1 is evicted. On the other hand, block B has ve copies: four in L1 caches (due to inter-CTA data sharing where CTAs are spread across SMs) and one copy in L2 cache. We can eliminate this duplication by either moving all the sharers to one SM or making all the SMs access a single copy of that block. Our proposed sharing-aware CTA scheduler already targets moving sharers to one SM. We need to further eliminate any replication when sharers are assigned to dierent SMs. Eliminating the duplicated copies increases the eective size of caches. In our example, if we are able to eliminate all the replicated copies, we would be able to increase the eective cache size by 3:5 (i.e. use only 2 cache lines to cache these data block instead of using 7 cache lines). Avoiding cache block duplication and managing block placement policies can help improve cache performance and increase the eective cache size. Since cache access latency (L1 and L2 caches) is much faster than main memory access latency, increasing the eective cache size, and hence the number of distinct data blocks cached, would reduce the average access latency, increase the cumulative cache hit rate, and therefore reduce the main memory trac. Figure 2.7: Cache Block Placement in GPU Memory System The data locality-aware CTA scheduler proposed in Section 2.3 could reduce the inter-SM data sharing but cannot eliminate it completely. This ineciency is because data blocks have various sharing patterns which cannot be fully captured by CTA scheduling, as conrmed in our experimental results (details will be discussed in Section 2.6). CTA scheduling alone only reduces 25 the number of shared blocks from 60% to 41.76% and reduces the average number of sharing cores to 2 cores per block. In order to achieve further performance improvements and power eciency, we propose cache management schemes that exploit the notion of shared and private data accesses. 2.4.2 Cache Management Policies We propose data sharing-aware cache management scheme guided by the principle of avoiding cache block duplication. To design such a cache management scheme, we rely on classifying private and shared cache blocks. We dene private and shared blocks as follows: Private blocks are dened as those blocks that are accessed by a thread or multiple threads that are executing on the same SM. Shared blocks are blocks accessed by multiple threads that are executing on dierent SMs. It is important to note that private blocks do not mean that they are accessed by a single thread. Even if a block is shared by multiple CTAs, if all the CTAs execute on a single SM then that block is classied as private under our denition. Figure 2.7 shows an example of shared and private cache blocks (block A is private while block B is shared). Our goal is to maximize the chance that private L1 caches keep private blocks while shared L2 caches keep shared blocks. Specically, private blocks are given higher priority to stay longer in L1 cache and shared blocks are given higher priority to stay longer in L2 cache. Moreover, at the cache block insertion time, we avoid polluting L1 caches with shared data in order to reduce the premature eviction of private data. We also reduce L2 cache pressure by bypassing local data that are accessed by only one thread. 2.4.2.1 Tracking Data Sharing As stated earlier, CUDA uses the notion of local and global memory to control data visibility across threads. The cache blocks in local memory space are known to be private to a single thread and will not be accessed by another SM. We consider local cache blocks as private as dened by the 26 CUDA programming model. Since local memory accesses use a dierent instruction pneumonic (ld.local) versus global memory access (ld.global), it is easy to classify the cache block accessed through ld.local as private. In contrast, global cache blocks can be accessed by all threads but if a global block is accessed by threads residing only on a single SM then that block is treated as private. Thus we need a mechanism to track how many SMs are sharing global cache blocks. Cache block sharing could be tracked at the L2 level at low cost. When an L2 block is accessed by an SM, we simply tag that block as being accessed by that SM. In the future, if a dierent SM accesses that block then that block can be marked as shared (detailed implementation described in the next section). Note that on an L2 cache miss, an MSHR entry is allocated for the L2 miss. The fact that an MSHR entry is allocated in L2 on a cache miss ensures that even when a L2 cache miss request is being serviced from memory, the L2 cache can still identify when more than one SM requests the same cache block. 2.4.2.2 Sharing-Aware Cache Allocation Policy Our cache block allocation and replacement policies ensure two important properties: P1. Only private data can cause private data evictions in L1 cache. P2. Only possible shared data can cause shared data evictions in L2 cache. Any block identied as shared block is only inserted in private L1 if the victim block in L1 is invalid or if the victim block itself is another shared block. Thus a shared block does not cause any private data evictions from L1 cache. In reality only rarely L1 cache blocks are invalid and hence vast majority of the shared blocks stay only in L2 cache. For private blocks, the priority is to place them in private L1 cache. They are only inserted in L2 if the victim block is invalid or if the victim block itself is another private block. Thus private data does not cause a shared block eviction from L2 cache. The precise allocation and replacement policies are shown in the ow chart in Figure 2.8. The ow chart is triggered on a load instruction executed by an SM. If the load uses ld.local instruction 27 then it is unambiguously classied as a private block. If the block is a hit in L1 then the data is provided and no further action is taken. If the block misses in L2 then the block is fetched from memory and it bypasses the L2 cache entirely, since this data is guaranteed to be private and accessed only by the requesting SM. In the rare case if the block hits in L2 then the block is fetched from L2 to L1 and no further action is taken. If the load uses ld.global instruction then that block may or may not be private depending on whether other SMs access the block. First, if the block is a hit in L1, data is provided to the SM and no further action is taken. If the block hits in L2, then the L2 cache checks whether the requesting SM is dierent than the SM that originally brought the cache block into L2. If so, that block is marked as shared. And the block is bypassed from the L1 cache of the requesting SM. If the block misses in L2 it is fetched from main memory. The block is placed in L2 cache and is tentatively marked as private and it is then delivered to L1. The assumption is that on L2 miss the cache block is treated as most likely private but because of the lack of complete information we do keep the block in L2. Note that in the ow chart above bypassing L2 cache means that no L2 cache resources (i.e. victim cacheline or MSHR entry) are allocated for the request. The request is sent to main memory and has the ID of the requesting SM. With this information, the response from memory is directly sent from DRAM to interconnection network without placing the data into L2 cache. Similarly, bypassing L1 cache means no L1 cache resources are allocated to the request. Instead the data is directly delivered to the destination register of the load instruction within the SM. 2.4.2.3 Sharing-Aware Replacement The ow chart above describes the cache allocation policy. We now describe the replacement policy. We modify the basic least recently used (LRU) in L1 and L2 cache to design a sharing- aware replacement scheme. For this purpose we divide L1 and L2 cache sets into two logical sub-sets. One sub-set holds all the private blocks and the other sub-set holds the data marked as 28 shared. Replacement policy in L2 cache favors replacing private blocks over shared ones. When a replacement request reaches L2 cache controller, the cache blocks in the designated set are checked. The least recently used private block is picked as a replacement candidate. If there are no private blocks (i.e. all cache blocks in the set are shared), the least recently used shared block is picked. This policy extends the lifetime of shared blocks in L2 cache by prioritizing the replacement of private blocks over shared ones. On the other hand, replacement policy in L1 caches favors replacing shared blocks over private ones. Upon receiving a replacement request, L1 cache controller checks cache blocks in the des- ignated set. The least recently used shared cache block is picked rst as a replacement candidate and if there are no shared blocks, then the least recently used private block is picked. Figure 2.9 shows the mechanism to choose the replacement candidate in L1 cache. It also includes the dead block detection mechanism described later in Section 2.5.2 (in the dotted box). L2 caches use similar mechanism except it does not include the dead block detection mechanism but L2 cache swaps private and shared block handling. 2.5 Implementation and Discussion 2.5.1 Microarchitecture Implementation To implement the sharing tracking mechanism, each block in L2 cache is augmented with a 4-bit owner SM eld (ownerId), indicating the ID of the SM that triggered the rst L2 miss and initiated the memory request. Also each L2 cache block keeps a 1-bit sharing ag (SF ) (initialized to 0, indicating not shared) to indicate whether the block is shared. On a L2 cache hit the requesting SM ID is checked against the ownerId eld to determine if a block is shared or private. The SF bit is set when they do not match. To support the replacement algorithm in L1 cache, each L1 cache block keeps a 1-bit ag (private ag (PF)) which is set if the block is deemed private. Information about data sharing is 29 Figure 2.8: The Flowchart of The Proposed Allocation Mechanism in Caches provided by L2 cache based on the block's SF bit and piggyback together to L1 in the response from L2. If the SF is set, the block is marked as shared. Figure 2.11 highlights the additional cache tag bits needed for our design. 2.5.2 Preventing Dead Blocks Prioritizing the replacement decision by sharing information may lead to dead blocks. Dead blocks are blocks that are no longer used by any current threads but are not replaced. In L2 cache, the modied replacement policy does not cause dead blocks. All global cache blocks are placed in L2 cache conservatively assuming they may be shared, therefore any cache block could become eviction candidate when it is the least recently used and all other blocks in that set are either reserved for incoming block or shared. 30 Figure 2.9: Proposed Replacement Candidate Selection Flowchart for L1 Cache In L1 cache, since the victim selection decision is made when the block is lled, a private block can become a dead block and may still not get evicted, particularly when the cache is not highly contended. In order to detect dead private blocks in L1 cache and evict them, we use the LRU pointer. If the LRU pointer points to a private block, the block is declared dead but the next shared block in LRU order is picked. Figure 2.10 shows an example for dead block detection and resolution in L1 cache. In the example, blocks B and D are private whereas blocks A and C are shared. The LRU pointer points to block D which is private so block D is declared dead but block C is replaced instead. When block E is being lled the block D is again the least recently used and it has already been declared dead by a prior access, but it was given one more opportunity to stay. Since that opportunity window has passed block D is replaced next. A 1-bit ag (dead 31 block ag (DBF)) is used to indicate dead data blocks in L1 cache in order to replace them in subsequent cache blocks ll operations. Figure 2.10: Cache Freezing Issue Example 2.5.3 L1 Cache Bypassing and Performance According to our policy, a shared cache block may bypass L1 cache and is only placed into L2 cache. This approach may seem counter-intuitive since shared blocks may take longer to access. In the following, we explain the reason why our approach improves performance. Better data placement on dierent cache levels increases the eective cache size which means that more cache blocks can be present in the cache at the same time. This approach allows more memory requests to be service by either L1 or L2 cache. Even though our scheme may increase the access latency of accesses that are serviced by L2 cache, it reduces the average access latency and the cumulative miss rate of L1 and L2 caches. Moreover, our analysis shows that more than 44% of memory requests to a shared cache block hits in the MSHR. It means that these accesses comes in a short window between requesting the block from lower level of memory system until the block is serviced. Such behavior makes it possible to provide similar performance without letting shared blocks occupy L1 cache space: the close-by requests could be served directly from MSHR and bypass L1 cache. On the other hand, our analysis also shows that nearly 97% of the future accesses to a shared block occur at least a 1000 cycles after that block is lled. This observation 32 means that even if we allocate L1 cache space for a shared block, when accessed again in future, it is more likely to have been replaced already in the long intervening time window. We note that less than 3% of a repeated access occurs within 1000 cycles of the initial block ll. Figure 2.11: Hardware Extension Needed for Our Design 2.6 Evaluation We use GPGPU-SIM v3.2.2 [8] to model and evaluate our data-sharing-aware CTA scheduler and cache management scheme. We use GPUWattch [46] to estimate and compare the power and energy consumption of our design. The simulation conguration is shown in table 2.1. Note that the sizes of all L1 caches are for each core. The baseline machine has exactly the same conguration except it uses LRU replacement policy for both L1 data cache and L2 unied cache. In addition, it uses a conventional round-robin CTA scheduler to assign CTAs to cores. We selected 20 benchmarks from Parboil [70], Rodinia [13], and ISPASS-2009 [8] benchmark suits. We measure cache performance by the number of total misses with respect to the number of unique blocks in L2 cache, the number of memory requests that have been serviced by o-chip DRAM, and the execution time of the benchmarks in the GPU. We measure the power consumption of the L2 cache, interconnection network and DRAM. 33 Table 2.1: GPGPU-Sim Conguration for Cache Management Scheme Evaluation. Simulation Conguration Number of Cores 15 Core Conguration 32 SIMT lanes, 1.4GHz, GTO warp scheduler L1 Data Cache 16KB, 4-way assoc, 128B block L2 Unied Cache 768KB total, 128KB/channel, 8-way assoc, 128B block Instruction Cache 2KB, 4-way assoc, 128B block, LRU Texure Cache 12KB, 24-way assoc, 128B block, LRU Constant Cache 8KB, 2-way assoc, 64B block, LRU Registers/Core 32768 Interconnection 2D mesh, 1.4GHz, Conguration 32B channel width DRAM Model FR-FCFS (32 queue/channel), 6MCs, Channel BW=8B/cycle, GDDR5 Timing tCL=12, tRP=12, tRC=40, RAS=28, tRCD=12, tRRD=6 2.6.1 CTA Scheduler Evaluation This section shows the performance of our proposed CTA scheduler. In order to provide a thorough evaluation, we evaluate three dierent CTA group sizes: 8 (q8), 16 (q16), and gridDim:x (qx). These schedulers are evaluated with and without the workload balancing optimization where a round-robin scheduling is used towards the end of kernel execution. Figure 2.12 shows the normalized execution time and the average number of cores accessing a cache block. In the gure, q8, q16, qx represent the schedulers with group sizes of 8,16, and gridDim:x without the round robin scheduling optimization, whereas q8 adpt, q16 adpt, qx adpt represents the schedulers with group sizes of 8, 16, andgridDim:x with a special round robin scheduling optimization at the end, respectively. Figure 2.12 shows that the best execution time can be achieved by qx adpt while the lowest number of sharers per cache block is achieved by qx; and qx adpt comes close second. The gure also shows that round robin scheduling when only a few CTA groups are left to execute does improve performance by slightly increasing load balance at the cost of some redundant copying of data. We also implemented the Block CTA Scheduler (BCS) proposed by Lee et al. [45]. BCS schedules sequential set of CTAs for each SM during kernel initiation and then it schedules CTA 34 Figure 2.12: Eect of CTA scheduler Queue size in pairs. It delays CTA scheduling until there are two available CTA contexts available in the SM and then schedules two CTAs at a time. BCS is similar to round-robin scheduler except that it schedules CTAs at a granularity of 2 CTAs at a time. The execution time and number of sharers data is shown under the label BCS in the gure. Figure 2.13 shows a more detailed comparison between our CTA scheduler and BCS. The gure shows the average improvements along multiple dimensions over all the 20 benchmarks normalized to our baseline GPU. Across a wide range of metrics, such as performance, power, energy, interconnect (ICNT) power, sharing-aware CTA scheduler outperforms BCS. Overall, sharing-aware CTA scheduler achieves better cache block locality and outperforms BCS by 7%. In fact for a majority of the presented metrics BCS is no better than the baseline GPU which uses RR scheduling without any sharing-aware cache management. We also incorporated our cache allocation and replacement policies on top of BCS. This data is presented under the label BCS+mgt. Our allocation and replacement policies do in fact improve the power and performance of baseline BCS scheduler. Thus our cache allocation and replacement policies can also be applied independently on top of other CTA schedulers to improve performance. 35 Figure 2.13: Execution Time, Power and Energy consumption between the proposed CTA sched- uler and BCS with and without cache management technique 2.6.2 Cache Management Scheme Evaluation Our evaluation shows that our new cache management scheme with CTA scheduling group size of (gridDim.x) is able to enhance the performance of both L1 and L2 caches. We evaluate the performance of L1 cache in terms of miss rate and the performance of L2 cache in terms of the misses-per-thousand-instructions (MPKI). Figure 2.14 shows the normalized values of both L1 miss rate and L2 MPKI along with the DRAM trac, normalized to the baseline GPU. The miss rate of L1 data caches has dropped by 13% over baseline. MSHR hits increased by 11% for shared block L1 cache misses which indicates that more warps/threads access the same block within a small time window while the miss is being serviced. MSHR hits reduce the interconnection network bandwidth since the SM does not send multiple requests from dierent CTAs for the same block/address to the lower level cache. L2 cache performance also improved as the number of misses per thousand instruction (MPKI) in L2 cache dropped by 47% compared to the baseline GPU. Also the non-compulsory misses (capacity and con ict misses) in L2 cache are reduced from 81% to 76%. The improved cache performance reduces the o-chip trac by 19% reducing the DRAM energy by 17%. The overall energy of GPU is reduced by 6% and the combined energy 36 Figure 2.14: Normalized L1 Miss Rate, L2 MPKI, and DRAM Trac with qx adpt Scheduler consumed by L2 cache, interconnection network, and DRAM is reduced by 14%. The overall execution time of evaluated benchmarks has improved by 7% over the baseline machine. Figure 2.16 shows the breakdown of the energy consumed by L2 cache, interconnection network and DRAM normalized to the baseline GPU energy consumption. Based on our results, these three components accounted for roughly 35% of the total energy consumed. The gure shows that our scheme is able to reduce the average energy consumed by these components by 10%. 2.6.3 Bypass Evaluation We compare our proposed cache management technique with cache bypassing technique [38]. In one prior work[38], cache performance is dynamically monitored to identify memory accesses that show streaming behavior and bypass them. It monitors cache miss rate over sampling periods and then it disable the cache if miss rate is larger than a specic threshold. The same mechanism is applied to L1 and L2 caches. Our analysis shows that our cache management and bypassing policy outperforms the policy proposed in prior work [38] by 17.98%. Although bypassing streaming data may speedup data 37 Figure 2.15: Normalized Energy Consumption of L2 cache, NoC, and DRAM with qx adpt Sched- uler accesses since memory requests are not delayed due to resource constraints (e.g. MSHR full), it does increases the DRAM trac since all memory requests are serviced by DRAM. Our simulations also show that bypassing stream data increases DRAM trac by 2.34% and hence increase its energy by 1.67%. The overall energy is increased by 0.82%. 2.6.4 Kepler Simulations Newer GPU architectures like Kepler and Maxwell have larger L2 cache (2MB in Kepler and Maxwell). To study the impact of larger L2 caches we simulated our design on a Kepler-like conguration. Simulation results show that our design can save up to 8% of DRAM energy, 18% of interconnect energy, 29% of L2 cache energy while enhancing the overall performance by 3%. The DRAM trac is reduced by 10%. Even with the increase in L2 cache size the proposed scheme still sustains most of the bets, since the working set of many workloads still does not t the larger L2 cache. Hence, as long as the dataset grows faster than cache size we expect that the benets of the proposed scheme will continue to grow. Figure 2.17 shows the energy consumption of the L2 cache, interconnection network and the DRAM breakdown compared to the baseline 38 Figure 2.16: Normalized DRAM Power, Energy, and O-chip Trac with qx adpt Scheduler consumption of these components. The shown values are normalized to the the energy of the baseline Kepler GPU conguration. Figure 2.17: Normalized Energy Consumption of L2 cache, NoC, and DRAM with Kepler Con- guration 39 2.7 Related Work 2.7.1 CTA Scheduling and Management To improve the performance of GPUs and their memory subsystems, various warp schedulers have been proposed to reduce memory latency. Cache-conscious warp scheduler (CCWS) [61] and cooperative thread array aware warp scheduler (OWL) [34] are examples of these schedulers. OWL prioritizes warps from a set of CTAs in order to increase the data locality and avoid bank con icts. In CCWS, the warp scheduler controls the number of warps that are allowed to be scheduled to improve the performance of L1 cache. Victim tag array is used to collect the lost locality score which indicates the severity of inter-warp contention. DAWS [62] is a modication to CCWS where it uses cache foot-print prediction. When warps lose locality due to contention, the scheduler suspends some warps. Kayiran et al. [35] proposed controlling the TLP by changing the number of CTAs assigned to a core. When the application shows memory-intensive behavior, the number of CTAs is lowered in order to reduce cache, memory, and network contention. On the other hand, when the application is in a computationally intensive phase, the number of CTAs is increased to exploit more TLP. They also proposed CTA pausing where the warps belonging to the most recently assigned CTA are deprioritized when the optimal number of CTAs per core is changed at runtime. CTA-Core assignment policy is not changed in any of these designs. Lee et al. [45] proposed a CTA and warp schedulers that aim to exploit inter-CTA locality. The block CTA scheduler (BCS) assigns a block of sequential CTAs to the same core. After that, it uses another scheduler (lazy CTA scheduler) that determines the optimal number of CTAs per core to boost the performance and it assigns CTAs to SMs in pairs. 2.7.2 Cache Management and Bypassing in GPUs Many papers have discussed and proposed techniques for cache management in GPUs. Chen et al. [15] adopted cache bypassing and selective insertion of cache blocks in order to extend the 40 lifetime of a cache line. Their insertion and bypassing decisions are based on block re-reference prediction for each block. Coordinated bypassing and warp throttling (CBWT) management scheme is proposed in [14]. In this scheme, cache bypassing is used to protect cache hot lines and alleviate cache contention. Each line has a protection distance (PD) to indicate how many accesses this line is protected from and it is used to trigger bypassing. The bypass policy is coordinated with warp throttling to control parallelism and contention. MRPB [32] uses FIFO buers to reorder memory requests before they are sent to L1 caches. It also employs cache bypassing, which is triggered when stalls are caused by lack of resources, to reduce intra-warp contention in L1 caches. Although CBWT combines bypassing and warp throttling to preserve locality, it can work together with MRPB to further improve performance and energy eciency. Li et al. [47] proposed a cache bypassing scheme on top of CCWS called Priority-based Cache Allocation (PCAL), which activates more warps when the NoC is underutilized. Those extra warps are given lower priority in the cache to eliminate cache contention. Keshtegar et al. [37] proposed a cache communication mechanism to reduce average memory access time. The main purpose of this mechanism, implemented in Logical Management Unit (LMU), is to forward the missed requests from L1 cache to neighboring L1 caches before sending that request to lower level cache. Although LMU reduces the trac in the interconnection network, it increases number of accesses in the L1 caches since the source of L1 trac now comes from both the SM itself and the LMU. Sharing tracker proposed by Tarjan et al. [71] oers an eective latency-tolerance mechanism to share cache blocks between multiple private caches. Sharing tracker uses a statistical method to track copies of cache blocks in private caches and tries to service memory requests from private caches before sending the request to next level in the memory hierarchy. The outcome of the tracker can lead to false negative or false positive results since it relies on partial information. The outcome of the sharing tracker is not always accurate; it may return a false positive result since it uses partial tags, and it may also return false negative since it does not keep track of all copies of a cache block (it keeps track of only one copy). 41 2.8 Conclusion To conclude this chapter, we observe that neighboring CTAs within GPU applications share considerable amount of data. Unfortunately, the default GPU scheduling policy reduces the eective L1 cache size by unnecessary data duplication which in turn increases the data movement and power consumption. Thus data sharing across threads, which is usually viewed favorably, becomes a liability. Based on this observation, we propose a holistic GPU cache management scheme to improve eective GPU cache and power eciency. The central goal is to reduce data movements and increase eective cache space. We propose a sharing-aware CTA scheduler that attempts to assign CTAs with data sharing to the same SM to improve temporal and spatial locality. We then augment the scheduler with a sharing-aware cache management scheme. The scheme dynamically classies private and shared data and proposes to prioritize storing private data in L1 cache and shared data in L2 cache. Our experimental results show that the proposed scheme reduces the o-chip trac by 19% which translates to an average DRAM power reduction of 10% and performance improvement of 7%. 42 Chapter 3 G-TSC: Timestamp Based Coherence for GPUs 3.1 Introduction The previous chapter focused on data sharing across CTAs. But the shared data is mostly read-only data where the presence of multiple private copies is not going to aect the execution outcome for a program. Hence, the absence of ecient cache coherence support in GPUs does not hinder the porting of such applications to GPUs. As the GPU programming languages, such as OpenCL [57] and NVIDIA CUDA [18], enhance the capabilities of GPUs these devices are already used extensively for executing general purpose applications with regular parallelism. Prior study has argued that GPUs can also accelerate applications with irregular parallelism [28]. But porting an irregular parallel application to GPUs is currently hobbled by the lack of ecient hardware cache coherence support. If hardware cache coherence is provided on GPUs, it would enable ecient porting of a broad range of parallel applications. Cache coherence can be used as a building block to design memory consistency models and enable a programmer to reason about possible memory ordering when threads interact. At the architecture level, most of the GPUs currently achieve cache coherence by disabling private caches and relying on the lower-level shared cache. For instance, private caches are dis- abled in NVIDIA GPUs while AMD GPUs support coherent instructions that perform memory 43 operations at the shared L2 cache and allow the software to ush the private cache at anytime [9]. Obviously, such approaches provide coherence but at the cost of performance loss stemming from disabling caches. With an ideal coherence mechanism, GPU applications that requires cache coherence can achieve up to 88% performance improvement over disabling L1 cache [67]. However adapting traditional CPU coherence protocols, both snooping-based or directory- based protocols [21] for GPUs is fraught with a variety of challenges. As has been shown in prior work [67], conventional invalidation-based coherence protocols designed for multiprocessors (e.g. directory-based or snoopy protocol) are ill-suited for GPUs. They incur extensive coherence trac and large storage overhead. The trac overhead incurred by the invalidation-based protocols is due to unnecessary rells for write-once data which is a common access pattern in GPUs. Additionally, invalidation-based protocols incur the recall trac, when all L1 copies need to be invalidated upon L2 invalidation, directory eviction or false sharing. The recall trac issue is aggravated by the cache inclusion requirement of these protocols since the size of the shared L2 cache must match the aggregate private L1 cache size [17, 55, 67]. Selective coherence, which disable caching specic data in private caches, requires additional hardware support and code modications [36]. The storage overhead of the invalidation-based protocols is mostly due to the need to track outgoing in- ight coherence transactions and incoming coherence requests. If we reserve sucient storage to handle the worse case scenario,an on-chip buer as large as 28% of the total GPU L2 cache is needed [21]. Throttling the network trac by some ow-control mechanism is necessary to reducing the storage requirement [48]. Although inclusive cache coherence protocol is attractive choice for low-complexity coherence protocol, it aects the cache performance and the interconnection trac bandwidth. On the other hand, non-inclusive implementation of these protocols requires large directories even though they can reduce the recall trac [7]. To tackle these ineciencies with traditional CPU-centric coherence protocols, Temporal Co- herence (TC) has been proposed for GPUs [67]. TC relies on self-invalidation of expired blocks in the private cache to eliminate coherence trac due to invalidation requests. TC is inspired by 44 Library Cache Coherence (LCC) [65], a time-based hardware coherence protocol that uses global synchronized counters to track the validity of cache blocks at dierent levels in the cache hierarchy and delays updates to unexpired blocks until all private copies are self-invalidated. Unfortunately, TC suers from several drawbacks. As we will describe in detail shortly, rst, the use of global synchronized counters in TC to implement coherence raises an issue about the scalability. With the rapid growth in chip size and the increase in clock speed, the global counters can suer from clock skewness and wiring delay that may aect the correctness of the protocol [78]. Second, delayed updates due to unexpired cached copies result in execution stalls that do not happen in conventional cache coherence protocol. When an update is delayed, all subsequent reads are delayed until the update is performed. Preserving all cache blocks that are unexpired in L2 cache may cause unnecessary cache stalls due to higher hardware resource contention. Third, in TC, the performance can be sensitive to the lease period; a suitable lease period is not always easy to select/predict. Tardis is a new CPU coherence protocol based on timestamp ordering [77]. It uses a combi- nation of physical time and logical time to order memory operations. The key dierence between Tardis and TC is that Tardis enforces global memory order by logical time rather than physical time. The timestamp based approach can largely eliminate the drawbacks of TC. While Tardis was explored in the context of CPU its applicability to a GPU's unique architecture and execution model are unknown. In this chapter, we propose G-TSC, a timestamp-based cache coherence protocol for GPUs, inspired by the Tardis. We analyze the unique challenges in adopting the logical timestamp or- dering approach to the highly threaded GPUs and then present and evaluate solutions. These challenges include controlling the accessibility of the updated data within a streaming multipro- cessor (SM), managing the replicated requests from warps in the same SM, and relaxation of the cache inclusion requirement in order to increase the eective cache size. 45 We show how to resolve these challenges in the presence of a large number of concurrent threads in a single SM that can generate a huge number of memory requests in a short time window, and in the absence of the write buers which are traditionally used to facilitate these interactions in CPUs. We specify the complete operations of G-TSC based on a general GPU memory hierarchy. We consider the implementation of both Release Consistency (RC) and Sequential Consistency (SC) based on G-TSC. We implemented G-TSC in GPGU-Sim [8] and used twelve benchmarks in the evaluation. When using G-TSC to keep coherence between private caches and the shared cache, G-TSC outperforms TC by 38% with release consistency. Moreover, even G-TSC with sequential con- sistency outperforms TC with release consistency by 26% for benchmarks that require coherence. The memory trac is reduced by 20% for memory intensive benchmarks. 3.2 Background and Motivation 3.2.1 Time-based Coherence Temporal coherence (TC) [67] is a time-based cache coherence protocol designed for GPUs. TC uses time-based self-invalidation to reduce the coherence trac. Like other time-based coherence protocols [65], TC assumes that single chip systems can implement globally synchronized counters. In TC, each cache block in private caches is assigned a lease, which indicates the time period that the block can be accessed in the private cache. At the end of the lease expiration the block is considered invalid, even if it is not evicted. A set of synchronized counters are used to count the lease period. A read access to a cache block in L1 cache checks both the tag and expiration time of its lease. A valid tag match but expired lease is considered as a coherence miss, because the block is already self-invalidated. L2 cache keeps track of the expiration time of each cache block. When L2 cache receives a read request, it updates the expiration time of the block's lease, so that 46 the new request could access it. A write request is sent directly to the L2 cache where it can be performed only when the leases of all private copies of the block expire. TC also implements a version that relaxes the write atomicity (TC-Weak) which eliminates write stall and postpones any possible stall to explicit memory synchronization operation (memory fence). Write acknowledgment in TC-Weak returns the time at which the write will become visible to all other SMs. These times are tracked by Global Write Completion Time (GWCT) counters for each warp. A memory fence operation uses GWCT to stall warps until all previous writes by that warps are globally visible. While TC solves some of the challenges in providing coherence to GPUs it suers from several implementation related challenges. 3.2.1.1 Globally Synchronized Clock TC uses globally synchronized counters to drive coherence decisions (e.g. self-invalidation) and avoid coherence trac. Each private cache and shared cache partition maintain its own synchro- nized counter and all counters are clocked by an independent clock. Relying on synchronized counters in all private and shared caches to make coherence decisions raises scalability concerns. With the growth in GPU chip size and increase of the clock speed, the signal used to clock the synchronized counters can suer from clock skew and may also lead to extra power consumption for the synchronized clock tree. The clock skew can be aggravated by the increase of clock speed and die area [78], which will in turn aect the correctness of the protocol. 3.2.1.2 Cache Inclusion Current GPUs do not enforce cache inclusion. TC relies on L2 cache to maintain the lease term of each private L1 cache copy. This approach forces L2 to be inclusive cache. In the absence of cache inclusion one approach to maintain the lease information is to maintain the lease terms in memory. But adding lease information to memory at the granularity of a cache block size is 47 prohibitively expensive, in terms of area. One option to reduce the area cost is to maintain lease expiry information at a coarse-granularity, say at a page level, rather than at the cache block granularity in memory. However, a coarse grained lease counter must be updated to the latest lease expiry time of any cache block within that larger block. Hence, the lease validity times may be unnecessarily increased for all cache blocks in that coarse granular block. The consequence is that when the original block is fetched back the counter (which is modied by some later evictions) can stall the write to the same cache block for a longer period unnecessarily. To avoid these drawbacks TC assumes inclusive cache, which reduces the eective cache size and could eventually aect cache performance. It is also incompatible with the common assump- tions about GPU cache [6, 14], because inclusion is normally not enforced. 3.2.1.3 Lease-Induced Stall and Contention In TC, when the lease of a cache block is not expired, the writes to the block in L2 need to be delayed until the lease is expired. When a write is delayed, all subsequent reads are delayed until the write is performed. The waiting reads then increase the occupancy of the input queue of the shared cache. Delayed eviction in L2 caches (due to the inclusion requirement discussed above) can cause similar problem. A cache block with an unexpired lease forces the replacement policy to chose a dierent victim cache line. If all cache lines in a set have unexpired leases then the replacement process also stalls. Stalls in L2 cache can aect the capability of the GPUs to exploit memory level parallelism which is critical to hide memory latency. 48 3.3 G-TSC: GPU Cache Coherence Using Timestamp Ordering 3.3.1 Timestamp Ordering The fundamental reason that TC suers from the various drawbacks is that the writes need to wait for the unexpired leases. We argue that it is possible to achieve all the benets of TC without introducing stalls and weakening the semantics. The key to achieving these desirable properties is timestamp ordering. Timestamp ordering is combination of timestamps and physical time used to dene the order of memory operations. It is formulated as Op 1 ! Op 2 ) (Op 1 < ts Op 2 ) or (Op 1 = ts Op 2 and Op 1 < time Op 2 ) where Op 1 andOp 2 are memory operations (load or store),! indicates the order of memory operations, < time means that the operation on the left happened before the operation on the right in physical time, and < ts means that the operation on the left has a timestamp smaller that the operation on the right. When the timestamps of two memory operations are the same, the physical time is used to order them. It is dierent from the time-based ordering used by TC, which always uses physical time to order global memory operations: Op 1 ! mem Op 2 ) Op 1 ! time Op 2 where! mem indicates global memory ordering while! time indicates the order of executing the operations in time. In timestamp ordering, the global time is only used to order memory operations from the same thread. The key property of timestamp ordering is the capability to logically schedule an operation in future by assigning a larger timestamp. This largely eliminates the lease-induced stalls in TC, as a write could be performed long before the read lease expires but logically it can still happen after the read. Tardis [77] is a previously proposed coherence protocol for CPUs that uses timestamp ordering. In this work, we build on Tardis and design timestamp coherence for GPUs, called G-TSC. 49 3.3.2 Timestamps in GPUs In G-TSC, each cache block (C) in the private and shared caches is associated with two times- tamps: a read timestamp (C:rts) and a write timestamp (C:wts). The timestamps are kept as logical counters. C.wts represents the timestamp of the store operation that produced the data in C. C.rts represents the timestamp through which the data could be correctly read from C, and after this time stamp expiry, the data could be modied by another thread. Conceptually, the period between C.wts and C.rts is a read-only period in which the data in C is guaranteed to be valid for the local threads in the SM. We call this period as the lease. Each private cache keeps a warp timestamp table (warp ts), where warp i's timestamp is recorded as warp ts i . The timestamp of each warp represents the conceptual timestamp of the last memory operation performed by that warp.The shared cache keeps a memory timestamp (mem ts). mem ts keeps track of the maximum rts of all cache blocks evicted from the shared cache. Memory operations can be conceptually ordered using timestamps. It is denoted as OP ts which can be LD ts (for load) or ST ts (for store). All mem ts and warp ts are initially set to 1. wts and rts are set to (mem ts) and (mem ts +lease) when the data is fetched from DRAM. 3.3.3 Principles of G-TSC G-TSC constructs a concurrent system with timestamp ordering such that the load value and write order are consistent with the timestamp order. For example, consider a load[A] and a store[A] (that produces a value 1), assuming that the initial value at A is 0. If load ts = 10 and store ts = 8, then the load must return 1, because it logically happens after the store according to the timestamp, even if according to physical time, the load is issued from a warp earlier. If load ts = 8 and store ts = 10, then the load must return 0. In essence, G-TSC attempts to 50 (a) The FSM Actions in L1 Cache. (b) The FSM Actions in the shared L2 Cache. Figure 3.1: The Finite State Machine of both L1 and L2 Caches. The prex Pr denotes the messages received from the SM, DRAM denotes the messages received from the DRAM and Bus denotes the messages exchanged with the NoC. assign the timestamp to each memory operation, so that the returned values are consistent with the assignments. Without con icting memory operations from dierent warps, each warp monotonically in- creases its own warp ts and assigns it to each memory operation issued. However, this "default" assignment may not t into the current state of the system. In order to satisfy coherence, the protocol continuously adjusts the assignment to memory operations (LD ts andST ts ) andwarp ts as we describe is details in the next section. 3.4 G-TSC Implementation In this section, we discuss the implementation of G-TSC. Our protocol is specied by: 1) The operations in private L1 after receiving the requests from the SM; 2) The operations in shared L2 cache. 3) The operations in private L1 after receiving the response from shared L2; 3.4.1 Private Cache Operation Figure 3.1a shows the nite state machine of the L1 cache and its transitions. We will explain these states and transitions in the following sections. Note that PrRd and PrWr are generated by the SM (similar to processor read and processor write in traditional CPU coherence transition 51 Figure 3.2: The Flowchart of the Load Request From SM diagrams), BusRd and BusWr are generated by the L1 cache, and BusFill, BusWrAck, and BusRnw are generated by the L2 cache (and delivered through the interconnection network). 3.4.1.1 Load Figure 3.2 shows the owchart of a load request processing in L1 cache. When a load address has a tag match in cache then the cache line where the tag match occurred is represented by C, and C.wts and C.rts represent respectively, the write timestamp of the data in that cache line, and the read timestamp assigned when that cache line was fetched previously. Note that the read timestamp is typically assigned by the L2 cache when it supplies to the data to the L1 cache. The load access is then represented by a tuple <C;C:DATA;C:wts;C:rts>. An access to a cache block in L1 cache results in a hit if it fullls two conditions: 1) pass the tag check, and 2) the warp ts i is less than or equal to C:rts, where warp ts i is the timestamp of the warp that generated the load operation. An access that fullls both conditions results in a 52 Figure 3.3: The Flowchart of the Store Request From SM hit and it may update the warp ts i to Max(warp ts i ;C:wts). If the access fails to fulll any of these conditions, a read request <BusRd;BusRd:wts;BusRd:warp ts> is sent to L2 cache. The value of BusRd:wts is set to 0 if the requests failed in the tag check, otherwise it is set to C:wts if there is a tag match but its lease has expired. The value of BusRd:warp ts is set to warp ts i . 3.4.1.2 Store Figure 3.3 shows the owchart of a store request generated by the SM and issued to the L1 cache. Since L1 cache is a write-though cache, all store requests (PrWr) are processed in the L2 cache. First, if the address hits in the L1 cache the L1 cache block data is updated, but all further accesses to that cache line from the SM are blocked (further elaboration in section 3.5.1). After that, a write request <BusWr;BusWr:warp ts;BusWr:Data> is sent to the L2 cache where BusWr:warp ts is set to warp ts i and BusWr:Data holds the store data. 53 Figure 3.4: The Flowchart of the Read Request Sent to L2 from L1 Cache 3.4.2 Shared Cache Operation Figure 3.1b shows the nite state machine of the L2 cache and its transitions. We will explain these states and transitions in the following sections. Note that BusRd and BusWr are generated by L1 cache and received through the interconnection network, DRAMFill is generated by the DRAM, DRAMRd is generated by the L2 cache and sent to the DRAM, and BusFill, BusRnw, and BusWrAck are generated by the L2 cache and sent to L1 cache through the interconnection network. 3.4.2.1 Loads from L1 The owchart of processing a read request in shared cache is shown in gure 3.4. If the read address hits in L2 cache block, then the wts in the request (BusRd:wts) is checked against the wts in the cache block and if they match then a renewal response is sent back the requester with an updated rts. This is the case when data has not been updated in L2 after the last write that is seen by the private L1. But the lease in L1 has expired and it simply needs to be renewed. 54 Figure 3.5: The Flowchart of the Write Request Sent to L2 from L1 Cache If the BusRd:wts does not match wts in the cache block, it implies that the data is in fact updated by another SM after the requesting SM's lease has expired. Hence, a ll response is sent to the requester including the new data, the wts of the new data, and updated rts as shown in the ow chart. 3.4.2.2 Stores from L1 The processing of a write request (BusWr) sent to L2 cache from L1 cache is described in gure 3.5. The wts of the new data is calculated based on the stored value of rts and the received value ofwarp ts as shown in the ow chart. After calculating the value of wts, the new value of the rts is also calculated and both timestamps are sent back to the requester with the acknowledgment response. 55 Figure 3.6: The Flowchart of DRAM Fill and Eviction Finally when the L2 receives a request (either BusRd or BusWr) for a cache block that is not present in the cache, both load and store will trigger a read request (DRAMRd) sent to the GDDR DRAM. 3.4.3 DRAM Operation Figure 3.6 shows how the shared cache handles the DRAM lls and block evictions. When a block is lled from DRAM, C:wts and C:rts are set based on mem ts and mem ts +lease respectively. On the other hand, when a cache block is evicted from L2, mem ts needs to record the evicted block's expiration time, so that when later the block is fetched back, L2 could assign timestamps to future stores correctly. Upon eviction, the value of mem ts is set to Max(mem ts o ;C e :rts) where mem ts o is the original value of mem ts and C e :rts is the rts of the evicted cache block. Thus all evicted blocks share a single time stamp, which is the latest time stamp of the evicted block. As we mentioned, even though all evicted blocks share the same mem ts, it is not an issue for G-TSC, because the timestamp ordering could always logically order stores to a point in future without stall. 56 (a) The Flowchart of Re- newal Response from L2. (b) The Flowchart of Write Acknowledgment from L2. Figure 3.7: Flowcharts of Private Cache Operation. Figure 3.8: The Flowchart of Fill Response from LLC 57 3.4.4 Private Cache Operation After Response from Shared Cache Figures 3.7 and 3.8 show how the private cache handles the responses from shared cache. The private cache receives a renewal response <BusRnw;BusRnw:rts> when it already has the up- dated version of the data. In this case, it extends the current lease of the block to the rts value in the response. A write acknowledgment <BusWrAck, BusWrAck.rts, BusWrAck.wts> means that a store operation is completed and a new values of wts and rts has been assigned. Hence, the private cache needs to update its local information and unlock the block so other warps can access it. A ll response <BusFill, BusFill.wts, BusFill.rts, BusFill.Data> can either ll a new block or update an existing block with new data. The private cache should probe the tag array to get the older version of the block, or allocate a new cache block for the incoming block by using the replacement algorithm. The data, rts and wts are copied from the response to the cache block allocated (Figure 3.8). 3.4.5 Example of G-TSCOperation We will explain the operations of G-TSC with an example. Assume two warps are being executed in two dierent SMs where the rst one reads some memory location [X], writes to another memory location [Y ] and then reads the [X] again. The other warp reads [Y ], writes to [X], and then reads [Y ] again. The sequence of instructions for both warps are shown in gure 3.9a. For the sake of this example, we will assume that there is only one warp executed in each SM. The execution sequence for all instructions is shown in gure 3.9b. The read operation (A1) that tries to read location [X] misses in L1 cache and hence the read request is sent to the lower-level cache ( 1 ). The request contains the address (addr = X), the warp timestamp (warp ts = 1) and the write timestamp (wts = 0) which is set to zero since the block is not present in L1 cache. The block 58 (a) (b) Figure 3.9: G-TSC Operation Example. The contents of the caches of each SM is shown with the wts and rts of each block in the parenthesis. 59 is fetched from the main memory and placed in L2 cache 2 and then is sent to L1 cache with a lease period ([1,6]) as shown in 3 . Instruction (B1) that reads address [Y ] follows the same steps as shown in steps 4 , 5 , 6 . We assume a longer lease period for Y for the sake of explanation. The protocol works with any lease value. When SM0 executes the write instruction (A2), the writing operation should be performed at the shared cache. Hence the write is sent to L2 cache with the warp timestamp (warp ts = 1) 7 . Based on the information in L2 cache, the system knows that the block is valid in some private cache until timestamp 11 (SM1 cache in this case) so it assigns a write timestamp after that lease period (ST ts = 12) 8 and sends an acknowledgment to L1 cache with the new lease period (wts = 12, rts = 22) 9 . The timestamp of warp 1 that issued the write operation is adjusted to 12 to match the timestamp of the writing operation 9 . Instruction (B2) follows the same steps that are shown in 10 ,11 ,12 . After that, SM0 tries to execute instruction (A3) and readX. Even thoughX is present in the cache but the timestamp of the reading warp (warp ts = 12) is beyond the lease of address X ([1,6]) 13 . So a renewal request is sent to L2 cache containing the write timestamp of X (wts = 1) along with the timestamp of the reading warp ((warp ts = 12)) 13 . L2 cache checks the write timestamp in the renewal request against the actual write timestamp of the address in the cache (wts = 7). Since they do not match, it is clear that a new write has occurred after the last value X was seen by SM0. Then the L2 cache sets the new lease of X to be 15 which is greater than the timestamp of the reading warp 14 , thereby giving the reading warp an opportunity to read the data. The new data and extended lease period are sent to L1 cache 15 of the requester. When instruction (B3) tries to read Y in SM1, it hits in the cache 16 . Note that the timestamp of the reading warp (warp ts = 7) fall within the lease period ([1,11]) hence the read is performed.Based on the timestamp ordering, the order of the executed instructions in this example is (A1!B1!B2!B3!A2!A3). Table 3.1 shows the contents of dierent messages exchanged in G-TSC. 60 Table 3.1: Contents of Requests and Response Exchanged Between Private and Shared Caches. Message Type rts wts warp ts data Read/Renewal Requests (BusRd) p p Write Request (BusWr) p p Fill Response (BusFill) p p p Renewal Response (BusRnw) p Write Acknowledgment (BusWrAck) p p 3.5 GPU-Related Considerations The above state transition description shows how logical time based coherence can be applied within the context of GPUs. In this section we discuss GPU-specic considerations that need to be addressed for achieving good performance. 3.5.1 Update Visibility The L1 cache in GPUs is shared between thousands of threads, and to ensure correctness, an updated data block should not be accessible by other threads until the store is completed and acknowledged. With timestamp ordering, a store operation is not completed until its timestamp is determined. Figure 3.10 illustrates this issue with an example. In this example, we will show how poor management of the updated data can aect the correctness of the coherence protocol and cause a coherence violations. Initially, the cache block A has a lease period [1; 5] ([A:wts;A:rts]). In step (2), warp 1 attempts to write A. According to the validity information available in the private cache, the timestamp of the store operation (ST ts ) is set to 6 and the warp timestamp (warp ts) and write timestamp (wts) are updated accordingly. The write is sent to L2, and L1 waits for the acknowledgement. Before the acknowledgement which will contain the lease that L2 assigns to the new data, both A:wts and A:rts are set to 6. At this point, L1 only knows that the start of the lease will be at least 6. In step (3), warp 2 with warp ts = 1 reads A and its own warp ts is updated to 6 meaning that the timestamp of load operation is set to 6. In step 61 (4), the acknowledgement from L2 for the store operation from warp 1 arrives, and the assigned lease is [11,20]. The start of the lease is greater than 6, and the lease of A in L1 is updated to [11,20]. At this point, we can see that the timestamp of the read from warp 2 in step (3) is 6, which is less than the lease of warp 1's write in global order ([11,20]). It means that the write is performed at logical time with timestamp 11, but warp 2 already observed it at an earlier logical time with timestamp 6. Essentially, a read observes a value that will be produced a later logical time, which is a violation of coherence. Intuitively, there are two ways to resolve this problem: 1) delay all accesses to the updated data until the store operation is globally performed and acknowledged; or 2) keep the old copy along with the new one and allow accessing the old copy until the store is globally performed. For 1), an MSHR entry is allocated for read requests as if they are read misses, and they are granted access to the data as soon as the store is acknowledged. At this point, the timestamp is determined and the warp ts of the reading warp is updated accordingly. For 2), a hardware structure is needed to hold the old data while the store is pending. Moreover, we also need to ensure that the writing warp can only read the new data that it generates after the write is being globally performed. Note that this particular problem is not an issue for Simultaneous Multithreading (SMT) [72] processors with conventional coherence protocol and write atomicity. Because before the write is globally performed, the new value is in the processor's write buer and the old value is in the L1 cache. The other threads in the same processor can bypass write buer and directly obtain the old data from L1 cache, which ensures write atomicity. If write atomicity is not supported, the threads in the same processor could read the new values from write buer. However, conventional protocol never allows the read to observe new value before it is produced (as opposite to the example in Figure 3.10). Using a write buer in GPUs increases the hardware complexity of the load/store unit and has a high area overhead. A single store instruction generates 2-4 memory write operations on average. With 48-64 concurrent warps executing the same code, those warps 62 Figure 3.10: Example of Update Visibility Challenge in GPUs are expected to hit the same store instruction within a small time window meaning that the write buer need to deal with200 outstanding write requests per store instruction. In this work, we evaluated both approaches. Dierent from TC, we found that option 1 gives the better trade-o in GPUs. The performance overhead of delaying accesses to updated block is negligible, so we do not need to pay for the hardware cost for keeping multiple copies. 3.5.2 Request Combining in GPUs The second challenge is the validity of the data serviced by L2 cache requested by multiple threads. When multiple read requests from dierent warps with dierent warp ts in the same SM try to access a cache block that is not present in L1 cache, these requests can be either all forwarded to L2 cache or just the rst request is forwarded, with the hope that the other warp ts will fall in the lease and be able to access the data.The two options indicate a trade-o between coherence trac and performance. Forwarding all requests to L2 cache increases the trac but assures 63 that the requests are serviced as soon as the responses are returned from L2 cache. Forwarding only the rst request and keeping the remaining requests in the MSHR preserves the bandwidth but may increase the latency of some requests if the allocated lease window cannot cover their warp ts and incur additional renewals. This issue is signicant in GPUs since the NoC bandwidth is one of the performance bottlenecks as shown in [8]. The choice between forwarding all requests and keeping them in MSHR has a signicant impact on the performance since the latency of the NoC increases with the increase of the memory trac generated by the SMs [40]. Forwarding all requests to L2 cache can increase the number of memory requests sent by SMs by 12% to 35% on average. Consider the example in Figure 3.11. In step (2), a read request is sent to L2 with the warp ts of warp 1. In step (3), warp 2 and 3 try to read the same block. Assuming we only send one request, they do not generate extra messages from L1 to L2. Later in step (4), the response gives L1 the lease window [1,5], warp 1's request is removed from MSHR. Unfortunately, it is not sucient for the other two requests so we need to send a renewal request for them and they still remain in L1's MSHR, see step (5). In our approach we chose to keep the requests in MSHR and then send a renewal request in case the lease term expires before the waiting request can read the data. Where extra renewal requests are sent we still end up with saving some bandwidth because a renewal request generally has a smaller data response packet since the response from L2 contains the renewed lease information when no store has been performed in the interim. 3.5.3 Non-Inclusive Caches in GPUs As discussed in Section 3.2.1.2, TC has to force inclusion and incur delayed eviction. In timestamp ordering, it is possible to maintain only one timestamp in memory for the evicted blocks without introducing unnecessary stall, since timestamp ordering makes it possible to logically schedule an operation to happen in future by assigning a larger timestamp. Therefore, even if the timestamp in 64 Figure 3.11: Example of Multiple Requests Challenge in GPUs 65 memory is increased by other evictions, a con icting store can execute without stall by assigning a larger timestamp greater than the single coarse-grain timestamp stored in memory. Using timestamp ordering, we can support non-inclusive policy, which is compatible with current GPUs while avoiding the delayed eviction. 3.5.4 Timestamp Over ows The experiments based on our benchmarks show that 16-bit timestamp is enough for all executions to make timestamp counter wrap-around suciently rare. Note that the L1 cache is ushed after each kernel and all timestamps are reset. In case of timestamp wrap around, the timestamp over ow mechanism can be handled at the L2 cache. The over ow can occur due to lease extension or assigning a timestamp to a new store operation (these are the only operations that increase the timestamp). The timestamps at the L1 caches are re ection of the timestamps assigned by the L2 cache (L1 cache does not increment the timestamps by itself). When a timestamp update causes an over ow, the L2 cache bank sends a reset signal to all L2 cache banks and then reset its timestamps. Upon resetting the timestamps in L2 cache bank, the write timestamp of all blocks is set to 1 and the read timestamp is set to (lease) and the memory timestamp is set to 1. Since the L2 cache has the up-to-date data of all blocks, there is no need to ush the cache. After resetting all timestamps, the L2 cache responds to every request that has a large timestamp value with a ll response along with the data even if the request is for a renewal. It also includes a timestamp reset signal with the response to inform the L1 cache that the timestamp is reset. When L1 cache receive a response with a reset message, it ushes its blocks and reset warp timestamp and then access the new data. For comparison, TC uses a 32-bit local timestamp for each L1 cacheline, a 32-bit global timestamp for each L2 cacheline, a 32-bit entry per warp in the GWCT table and a 32-bit counter for each L1 and L2 cache. 66 3.6 Evaluation and Discussion 3.6.1 Evaluation Setup We implementedG-TSC in GPGPU-Sim version 3.2.2 [8]. We used GPUWattch [46] to estimate the power and energy consumption. The simulated GPU has 16 SMs, with 48KB shared memory, and 16KB L1 cache each. Each SM can run 48 warps at most with 32 threads/warp. L2 cache is partitioned into 8 partitions with 128KB each (1MB overall). The conguration of the GPU used in the evaluation is summarized in table 3.3. In our evaluation, we used two sets of benchmarks: the rst set requires cache coherence for correctness, and the other does not. The second set of benchmarks are used to show the impacts of coherence protocol on them due to the protocol overheads. The benchmarks used in this work are listed in table 3.4. The performance of G-TSC is compared against TC. We implemented TC on GPGPU-Sim simulator and all the results presented in the work are based on our implementation of TC on GPGPU-Sim. But to validate that our implementation of TC closely matches the original implementation we also ran TC on the same benchmarks with the same conguration setting using the original simulator used in the TC paper [67]. Table 3.2 shows the execution time of TC on our G-TSC simulation infrastructure (column four) and the execution time of TC on the original simulator (column ve). As can be seen The two simulators provide very similar execution times. The few dierences that are present may be attributed to the fact that the original TC used Ruby [49] to implement its cache and memory system, while we enhanced the default memory system implemented in GPGPU-sim for implementing the G-TSC memory system. We also simulated the baseline (BL) conguration, which essentially turns o the private cache to provide coherence, both on the original TC simulator and our G-TSC simulator. Table 3.2 shows the execution time of BL on our G-TSC simulation infrastructure (column two) and the execution time of BL on the original TC simulator (column three). The baseline execution times dier in the two models. We believe that the dierence stems from how the two simulators 67 implement no L1 cache design in the simulator. G-TSC implements BL by essentially sending all requests directly to the L2 cache over the NOC and assumes that there are no L1 cache tags to be checked or L1 cache MSHRs to be updated. Hence, any relative performance improvements over the baseline model reported in the original TC paper and our work may be dierent. From here on we report all results relative to our baseline implementation on our simulation infrastructure. We implemented G-TSC and TC with SC and RC memory models. Table 3.2: Absolute Execution Cycles of TC and Baseline (BL) in Millions Benchmark BL BL [67] TC TC [67] (G-TSC simulator) (G-TSC simulator) BH 0.55 1.26 0.84 1.03 CC 1.47 2.99 1.77 1.75 DLP 1.63 5.53 1.63 1.44 VPR 0.85 1.98 0.90 0.77 STN 2.00 4.66 1.74 1.62 BFS 0.79 1.95 2.32 1.87 CCP 13.50 13.59 13.50 13.47 GE 2.22 4.89 2.49 3.51 HS 0.22 0.22 0.23 0.23 KM 28.74 30.89 30.78 34.17 BP 0.84 1.61 0.69 0.58 SGM 6.08 5.74 6.14 5.91 Table 3.3: GPGPU-Sim Conguration for G-TSC Evaluation. GPU SM Count 16 Core Cong. 48 warps/SM, 32 thread/warp, 1.4 GHz Warp Scheduler Loose Round Robin Shared Memory 48KB L1 Cache 16KB, 128B line, 4-way, 32 MSHR L2 Cache Bank 128KB, 128B line, 8-way, 32 MSHR, 700 MHz Memory Partitions 8 DRAM Clock 1.4GHz DRAM Queue 32 FR-FCFS 68 Table 3.4: List of Evaluated Benchmarks. Name Abbr. Coherence Benchmarks Barnes Hut [11] BH Cuda Cuts [73] CC Dynamic Load Balancing [12] DLB Versatile Place and Route [67] VPR Stencil (Wave Propagation) [67] STN Breadth-First Search [24] BFS Non-Coherence Benchmarks Gaussian Elimination [13] GE Hotspot [13] HS Kmeans [13] KM Back proportion [13] BP SGEMM Matrix Operations [70] SGM Distance-Cuto Coulombic Potential [70] CCP 3.6.2 Performance Evaluation Figure 3.12 shows the performance (execution cycles) of G-TSC and TC with RC and SC normal- ized to the performance of coherent GPU with L1 cache disabled (therefore enforcing coherence through the shared L2 cache). There are two sets of benchmarks. The rst set shown in the left cluster are benchmarks that require coherence and will not function correctly without it. The benchmarks in the right cluster do not require coherence. Hence, we show one new performance bar (the left most bar titled Baseline W/L1) using a baseline with L1 cache since they do not need coherence and can take advantage of L1 cache in the baseline. The higher bars in Figure 3.12 indicate better performance. Our results show that the per- formance dierence between RC and SC with G-TSC is smaller than the dierence between RC and SC for TC. G-TSC does not incur much stall time due to unexpired leases, as opposed to TC. Hence, the dierence between SC and RC with G-TSC is small, sometimes even negligible (e.g. BH, BFS and most of the applications that do not require coherence as shown in the right cluster). ForG-TSC, benchmarks that requires coherence obtain 12% speedup with RC compared to SC. The overall average speedup is around 9% over all benchmarks. 69 Figure 3.12: Performance of GPU Coherence Protocols with Dierent Memory Models Interestingly, for one benchmark (CC), SC is better than RC in G-TSC_G-TSC-SC outper- forms G-TSC-RC sometimes (e.g. CC) because the NoC trac is limited by the fact that in SC only one outstanding memory request per warp is allowed. While RC could eliminate certain warp stalls, but it generates more coherence messages and allows more requests into NoC, which happens to have more negative impact on performance in CC. As a result, the average network latency goes down and the memory requests can be serviced faster in SC. In CC, we indeed con- rm that the average network latency per request in G-TSC-SC is 29% lower than G-TSC-RC due to a 14% reduction in memory request rate generation. Previous work [8] observed similar behavior. G-TSC is able to achieve about 38% speedup over TC with RC; and about 84% speedup over TC with SC. G-TSC with SC outperforms TC with RC by 26% for benchmarks that require coherence for correctness. These signicant performance improvements are mainly due to G- TSC's ability to avoid warp stalling caused by delayed writes and eviction. G-TSC also avoids 70 the stalls caused by GWCT in TC before executing fence instructions. These stalls aggravate the performance penalty in SC since each warp is allowed to have at most one outstanding memory request. Benchmarks like CCP, HS, and KM (that do not require coherence) do not exhibit signicant dierence in performance between G-TSC and TC and between SC and RC. These benchmarks are compute-intensive benchmarks so the stalls imposed by the coherence protocols or consistency model requirement are overlapped with execution of other non-memory instructions. Figure 3.13 plots the pipeline stalls due to memory delays normalized to baseline with no L1 cache conguration. The results shows that TC encounters around 45% more stalls than G-TSC for the rst set of benchmarks and more than 1:4 stalls for the second set of benchmarks. The performance of GPU with L1 cache is also presented in gure 3.12 to show the performance overhead of G-TSCfor benchmarks that do not need coherence. We report the performance of the second group of benchmarks only since the presence of L1 cache with no coherence (which is the case here) aects the correctness of the rst group of benchmarks. The gure shows that G-TSC overhead is around 11% with respect to the non-coherent GPU. It also shows that G-TSC can perform as good as the non-coherent GPU in most of the cases (e.g. CCP, GE, HS and SGM). Figure 3.14 shows the performance of G-TSC with dierent lease periods with RC. G-TSC shows small sensitivity for lease values variation. This insensitivity is because lease period in G-TSC is not related to the physical time; it represents the logical time. Intuitively only very small and very large lease values may impact G-TSC. Small lease values can aect performance because of the excessive renewal requests. It also may aggravate the multiple reader issue discussed in 3.5.2. Large leases cause the timestamp to roll faster and reduce the chance that multiple warps could access the cache block during its lease before renewal. But for a range of lease periods that we explored (8-20 cycles) G-TSC performance is unchanged. 71 Figure 3.13: Pipeline Stalls due to Memory Delay in G-TSC and TC Normalized to Stalls in No-L1-Cache Conguration 3.6.3 Coherence Trac Coherence trac in G-TSC and TC is mainly due to the lease renewal requests in L1 cache or fetching new data to replace old data. Since G-TSC is conducting its coherence transactions in logical time, it is able to reduce the coherence trac compared to TC which operates coherence transactions in physical time. Since logical time in G-TSC rolls slower than the physical time, more load operation are able to access the cache block during its lease period in L1 cache. This reduces the number of renewal requests. Another optimization for NoC bandwidth usage is that renewal response in G-TSC does not require sending the data again. Figure 3.15 shows the trac in NoC for G-TSC and TC with RC and SC memory models normalized to the NoC trac in a coherent GPU with no L1 cache. We see thatG-TSC is able to reduce the trac by 20% over TC with RC and 15.7% with SC for the rst set of benchmarks. Note that the NoC trac is almost the same for RC and SC in both 72 Figure 3.14: Performance of G-TSC-RC with Dierent Lease Values G-TSC and TC for the second set of benchmarks; these benchmarks do not generate coherence trac to begin with. Figure 3.16 shows the NoC trac of TSC with dierent lease values with RC. We observe that G-TSC is insensitive to We see again thatG-TSC is insensitive to dierent lease values. We can observe that G-TSC with lease value 10 shows a slight better trac reduction with GE and BP from the second set of the benchmarks. 3.6.4 Energy G-TSC is able to reduce the total energy of the GPU since it is able to enhance the performance and reduce the NoC trac. Figure 3.17 shows the normalized overall energy consumption of evaluated benchmarks. G-TSC consumed 11% less energy than TC with RC for the rst set of benchmarks. RC consumes more energy than SC for some benchmarks, like CC and BH, even 73 Figure 3.15: NoC Trac of GPU Coherence Protocols with Dierent Memory Models though their performance is better. The reason for this behavior is that in SC implementations, the cores remain idle and do not consume much energy (except static energy). We studied the energy saving of individual components of the GPU, mainly, energy consumed by L2 cache, main memory (DRAM and memory controller) and the interconnection network. G-TSC reduces the energy consumed by the L2 cache by 2%, the NoC by 4%, and the other GPU components by 5%. It also saves 1% more energy for the L2 cache, 3% for the NoC, and 5% for the other GPU components over TC. The total energy saving is 11% over the baseline, and 9% over the TC for the rst set of benchmarks. The results in gure 3.17 includes the energy of L1 cache. We also presented the L1 cache energy consumption in gure 3.18. The gure shows that TC consumed slightly less energy than G-TSC. Lease value has a minimal impact on the energy consumption in G-TSC. Figure 3.19 shows that the variation in energy consumption due to dierent lease values is again negligible. Some 74 Figure 3.16: NoC Trac of G-TSC with Dierent Lease Values benchmarks are more sensitive to lease value. such as GE from the second set of benchmarks. The eect of lease value is only noticeable with very large or small lease values. We see that in GPUs, SC may not always be a bad choice, because it may oer better per- formance for certain benchmarks (as discussed before) and incur less energy due to the reduced NoC trac. With TC, the majority of applications show a big gap between RC and SC. However, G-TSC reduces this gap and makes it much smaller. This motivates supporting SC feasible in GPUs, and some recent works came to the same conclusion [27]. 3.6.5 Characteristics of G-TSC Implementing cache coherence in logical time in G-TSC rather than physical time as in TC in- troduces some advantages. Kernels that have more load instructions than store instructions do not incur cache misses due to lease expiration since their timestamps roll slower. Our experiments 75 Figure 3.17: Total Energy Consumption of GPU Coherence Protocols with Dierent Memory Models Figure 3.18: L1 Cache Energy (in joules) of GPU Coherence Protocols with Dierent Memory Models 76 Figure 3.19: GPU Energy Consumption of G-TSC with Dierent Lease Values show that the number of misses due to lease expiration has dropped by around 48%. This obser- vation allows more accesses to hit in L1 cache which indeed translates into relatively longer lease in physical time. However, multiple results show that G-TSC is insensitive to small variations in lease values. It allows the implementation with relatively small lease values which limits the speed of timestamp rollover. 3.7 Related Work The use of timestamps in coherence protocols has been studied in multiple hardware and software protocols. Lamport [42] is one of the earliest eorts that tried to use logical times to order opera- tions in distributed systems and avoid using synchronized physical clocks. They studied the use of logical timestamps to order operations in distributed systems. De Supinski et. al. in [19] evaluated the performance of the late delta cache coherence protocol, a highly concurrent directory-based 77 coherence protocols which exploits the notion of logical time to provide support for sequential consistency and atomicity for CPUs. Min et al. [51] proposed a software-assisted cache coherence scheme which uses a combination of a compile-time marking of references and a hardware-based local incoherence detection scheme. Nandy et al. [53] is one of the rst hardware coherence protocol that uses timestamps. TSO-CC [20] proposed a hardware coherence protocol based on timestamps. It supports total-store-ordering (TSO) memory consistency model and requires broadcasting and frequently self-invalidating cache lines in private caches. TC-Release++ [75] is a timestamp-based coherence protocol for RC that is inspired by TC and addresses the scalability issues of eciently supporting cache coherence in large-scale systems. TC-Release++ eliminates the expensive memory stalls and provides an optimized lifetime prediction mechanism for CMP. The previous protocols tightly couple timestamp with physical time. Tardis [77] is a timestamp coherence protocol that is based on logical time rather than physical time. Tardis is designed for CMP and implements SC. G-TSC builds on top of Tardis and focuses on GPU implementation. G-TSC optimizes the protocol requirements to t the highly multi-threaded GPU cores. An imporved version of Tardis (called Tardis 2.0) [76] implements TSO consistency model and pro- poses optimized lease policies. Similar to Tardis, Martin et. al [48] proposed timestamp snooping scheme where processor and memory nodes perform coherence transactions in logical order. The network assigns a logical timestamp for each transaction and then broadcasts it to all processor and memory nodes without regard for order. Self-invalidation in private caches has been explored in the context of cache coherence. Dy- namic self-invalidation (DSI) [44] reduces cache coherence overhead and reduce invalidation mes- sages by speculatively identifying which block to invalidate when they are brought into the cache but deferring the actual invalidation to future time. DSI still requires explicit messages to the directory to acknowledge self-invalidation. DSI can reduce the trac by using tear-o blocks that are self-invalidated at synchronization instructions. A similar idea is proposed in [64] that extends 78 the tear-o blocks to all cache blocks in order to entirely eliminate coherence directories. Last- Touch Predictors (LTP) [41] triggers speculative self-invalidation of memory blocks in distributed shared memory. 3.8 Conclusion This chapter proposed, G-TSC, a timestamp-based GPU cache coherence scheme that reduces the coherence trac. Dierent than the previous work on time based coherence for GPUs,G-TSC conducts its coherence transactions in logical time. We implementedG-TSC in GPGPU-Sim and used 12 benchmarks in the evaluation. When using G-TSC to keep coherence between private L1 caches and the shared L2 cache, G-TSC outperforms TC by 38% with release consistency. Moreover, even G-TSC with sequential consistency outperforms TC with release consistency by 26% for benchmarks that require coherence for correctness. For the same benchmarks, the memory trac is reduced by 20%. 79 Chapter 4 An Ecient Sequential Consistency Implementation with Dynamic Race Detection for GPUs 4.1 Introduction As has been described in prior chapters, GPUs are highly multithreaded processors designed for graphics processing.They are optimized for data-parallel execution. Due to their high throughput and power eciency, GPUs emerged as processing platform of choice for wider range of general purpose applications. GPUs, driven by the graphics workload characteristics, mostly implements a released memory consistency model aided by a simple software-driven coherence protocol that uses coarse-grain synchronization at the GPU kernel execution boundaries [66, 25]. The memory model implemented in GPUs assumes data-race-free kernel execution. Hence, it assumes there is no data sharing between threads during kernel execution. The GPU memory model allows the instructions to be executed at any order between membars. The membar are synchronization instructions used to order the memory operations executed by a single thread. The GPU memory model does not guarantee that the updates by dierent warps are observed in a certain order during kernel execution. It guarantees that the updates are made visible to all threads by the end of the kernel execution [66, 68, 4]. These protocols invalidate the private caches at the start of a kernel and drain the updated data to the main memory by the end of the kernel [66, 5]. 80 Atomic operations, that are used for ne-grain synchronization are executed at the shared L2 cache (i.e. bypass private L1 caches) [66, 5]. Such protocol can ensure sequential consistency for data-race-free (SC for DRF) programs [3] given that there is no data race within the kernel execution. As GPUs broaden their appeal to general purpose applications, there is an equal need to explore innovative, more complex memory models [66]. Sequential consistency (SC) memory model is one of the most strict memory models which is also easier for the programmers to understand and reason about the load operation outcomes. However, systems designers avoid implementing SC due to its strict ordering restrictions that limit performance optimization. Recent eorts have tried to design coherent systems that supports SC only under a narrow set of assumptions [3]. For instance SC with data race free (DRF) protocols rely on program- mers to write data race free programs and rely mainly on cache self-invalidation (SI) at possible race accesses. DRF semantics allow the hardware designers and compilers to have exibility to reorder memory operation at lower hardware costs while maintaining the sequential consistency of the system. DRF semantics require programmer to declare any possible race condition as a synchronization even if they do not necessarily convert into an actual race at runtime. In fact, many race conditions do not convert into actual race during execution [74, 63]. In this chapter we propose a novel approach to implementing sequential consistency in GPUs with minimal performance loss. In the previous chapter we described G-TSC coherence protocol and described how we can achieve sequentially consistent execution by relying on G-TSC. Even though G-TSC allows for ecient coherence implementation it still suers from unnecessary data movement overheads, while implementing sequentially consistent execution as shown in the fol- lowing example. Let us consider the execution of two scenarios shown in gure 4.1 on a GPU that implements SC on top of G-TSC presented in the previous chapter. The gure shows how the execution of the instructions from two dierent SMs are interleaved. In the scenario shown in gure 4.1a, instructions A1, B1 and B2 will fetch the memory locations A and B to the private 81 caches of SM0 and SM1 with lease from [1,10]. The store operation in B3 will have a timestamp of 11 and the wrap ts in SM1 will be bumped up to 11 and the lease of the new value at [B] is [11,20]. Hence, the load operation at instruction A2 will cause the warp ts at SM0 to advance to 11. Because of the wrap ts adjustment due to A2 and B3, the cached data of [A] is no longer accessible since the warp ts is outside the lease of them. As a result, the accesses to [A] in A3 and B4 will be delayed until a lease extension is granted by the shared L2 cache although the cached data is not changed. These renewal requests to access [A] are not necessary since the cached data is not updated. The second scenario shown in gure 4.1b illustrates another possible optimization to implement SC. Similar to the previous scenario, instructions A1 and B1 will fetch the addresses [A] and [B] to the private L1 cache of SM0 and SM1 respectively. The store operation in instruction B2 will set the lease for the new value at [B] to be [11,20]. Then, the load operation in A2 will adjust the warp ts at SM0 to 11. The store operation in instruction B3 will set the lease for the new value at [A] to be [11,20]. Because the warp ts at SM0 is beyond the validity of the cached data at [A], the load operation at A3 will be forced to fetch the new value of [A]. However, the load operation at instruction A3 can return the old cached value fetched by A1 without violating the sequential consistency ordering rules. These two simple scenarios show that we can improve the implementation of SC on GPUs if we can guide the self-invalidation to invalidate the updated blocks only and delay the propagation of updated data when that delay does not cause a sequential consistency order violation. This observation forms the foundation for the work described in this chapter. The above observation can be achieved by employing a lazy update propagation where the propagation of the updated data is delayed until it is necessary. The question therefore is when is the updated data needed to be communicated to safely implement SC? We argue that one elegant approach to propagate updated values is when a race is detected between dierent SMs; a race is dened as two competing memory accesses to the same location originating from dierent SMs where at least one access 82 is a store operation. By implementing an ecient race detection mechanism to guide the lazy update propagation and a smart self-invalidation technique to invalidate the updated data only, we can have a better implementation for SC on GPUs. (a) (b) Figure 4.1: Motivational Example For Dynamic Race Detection Scheme for SC in GPUs. In this chapter, we propose a cost-eective hardware-assisted dynamic race detection mech- anism that detects races at runtime to implement SC. We use a signature-based mechanism to track the write sets of each SM and use that signature to detect races across multiple SMs. The scheme essentially take into consideration the fact that dozens of SMs may contend for a shared variable occasionally and as such keeping track of the complete write set information is expensive and unnecessary. We also propose an epoch-based cache access validation mechanism that can be used to replace the mass cache self-invalidation approaches used in CPU and GPU domain [66, 16, 44, 52, 63]. The main assumption of our work is that if we are able to de- tect races at runtime and enforce coherence at that time, we can achieve memory ordering by applying a synchronization fence at the reader. The proposed scheme detects only actual race conditions and avoids the potential race conditions that do not actually occur. Even though there have been approaches to detect races by monitoring contending accesses to variables in the CPU domain [52, 54, 58, 59, 63], none of these works target implementing SC for GPUs. Our contributions in this work are: 83 Design a SC protocol for GPUs that detects races at runtime without programmer annota- tions. It also eliminates explicit invalidation messages and the hardware directory overhead needed to track data sharing information. A signature-based hardware-assisted dynamic race detection mechanism that can detect races at runtime. A simple yet ecient signature-based access validation mechanism that can be used to determine if the data cached in private cache is modied or not. This mechanism can replace the mass cache self-invalidation technique that invalidate the whole cache or selective self-invalidation techniques that need some sort of cacheline classication. A simple mechanism that allows the load/store unit in the SM to execute memory operation speculatively and then verify their correctness. Our evaluation shows that our dynamic race detection approach can achieve SC with perfor- mance overhead as low as 5%. 4.2 Hardware-Assisted Dynamic Race Detection 4.2.1 Overview Figure 4.2: Simple Example of Dynamic Race Detection Scheme 84 Hardware-assisted race detection has been used in prior work for tasks such as parallel program debugging and to implement specic memory consistency models [59, 63]. But to the best of our knowledge prior works have not used race detection to implement memory models for GPUs, which is what we propose in this work. Memory model implementation requires a mechanism to propagate updated data from the producer to the consumer. Some coherence protocols, which are responsible for update propaga- tion, employ some variation of eager update propagation where the updated data is propagated to the consumer as early as possible. On the other hand, lazy update propagation delays the update propagation until it becomes necessary to propagate the updates. Multiple updates can be accumulated and then they are all propagated at once. Lazy update propagation need to be guided to determine when the update propagation becomes necessary. DRF semantics suggests that update propagation becomes necessary when a race is detected between two dierent proces- sors. It denes the race as two con icting accesses from two dierent processors that are executed one after the other without another race in between [3]. Two accesses are considered con icting if they access the same location and at least one of them is a write [3]. According to the DRF semantics, a system can implement SC if it is able to detect races and then apply the updates upon race detection. Our approach consists of a hardware-assisted dynamic race detector which detects races e- ciently. Once a race is detected the racy read is classied as a synchronization acquire operation. Then a synchronization fence is applied to automatically enable a desired memory ordering se- mantics. Also, a store operations is classied as a synchronization release operation and a syn- chronization fence is applied at the store to enforce the desired memory ordering. Dynamic race detection scheme does not requires exclusive-ownership before writing which helps eliminating in- validation requests. A write set of all the updates that occur during the execution are accumulated and then they are propagated lazily once a synchronization operation is executed. 85 We will explain the operation of our scheme rst using an example with only two SMs shown in gure 4.2. In our execution model each SM's private L1 cache is a write-through cache which writes data to a shared L2 cache. The execution of any kernel on an SM consists of epochs bounded by synchronization points. These synchronization points are created by either a racy load (synchronization acquire), a store (synchronization release) or due to other events (which will be discussed later). Each SM has its own epoch which is independent from other SMs. A set of pending updates are propagated to the SM at synchronization point creation. Splitting the execution into epoch is used to check the validity of the cached data and verify that the cached data has not been changed (which is explained later). The bold arrow in the gure represents the time and the actions above it belong to SM0 while the ones under it belong to SM1. SM1 issues a read request to address [A] which triggers a cache miss and the block is fetched and cached in the private cache. Then SM0 issues a store to addresses [A] followed by a store to address [B]. Both stores from SM0 created a synchronization points for SM0 and updates its epoch from 0 to 1 and 2. Each store address from SM0 is used to build a signature for SM1. The signature contains all the addresses whose values are not yet visible to SM1. Note that even though address [A] is presented in the private cache of SM1, no invalidation request is sent to SM1. Then SM1 issues a read request to address [B]. Since [B] is not in the cache, SM1 reads that from L2. This read from L2 triggers a race that is detected by race detection unit (RDU). This read operation is treated as a read acquire operation and hence a synchronization point that marks the beginning of execution epoch 2 of SM1. The SM is then made aware of the "unseen" updates. The execution of SM0 continues with two store operations to addresses [C] and [D]. Both stores created a synchronization points for SM0 and their addresses are added to the unseen updates set . Although a store operation may not be involved in a race, it must create a synchronization point for the executing SM. The SM must be aware of all the previous updates before the store is performed so any following load executed after the store must not return over-written value by recent store operation by another SM. SM1 issue a read request to address [A] which hits in its 86 private cache but the SM knows that address [A] has been updated and the cache copy is stale. Hence the access resulted in a coherence miss and the new version of the block is fetched. Thus a separate address verication unit (AVU) is used to check whether a cache access in the current epoch is valid or otherwise based on the updates seen in the previous epoch. The read request to address [D] by SM1 triggers a race and creates a synchronization point that marks the beginning of a new execution epoch (epoch 3). SM1 then issues two read operations to address [C]. The rst one misses in the private cache and fetches the data from the memory. The second one hits in the cache even though the SM knows that the block is updates. This is because the block is fetched in the current epoch so the SM assumes that there is no updates to that address. In the next subsection we discuss our implementation of the race detection and its usage. 4.2.2 Dynamic Race Detection Implementation As evident from the example dynamic race detection is the key to enforcing SC in our proposal. Race detection as described above relies detecting con icts between reads and writes. As such our dynamic race detection scheme requires two structures: the rst structure keeps record of the addresses of the unseen updates set for each SM, namely all writes made by all other SMs that are not seen by the current SM. And the other structure keeps record of the addresses of the seen updates (propagated updates) which is used to validate the cache data (check if they are updated since they are fetched or not). These structures are called the Race Detection Unit (RDU) and the Access Validation Unit (AVU), respectively. The additional hardware components are shown in gure 4.3. Race Detection Unit (RDU) The RDU is a centralized structure that keeps track of the addresses of the unseen updates for each SM. It detects races by comparing the incoming address in the race-check requests originating from an SM, which are essentially a subset of loads issued by the SM as we explain later, with 87 Figure 4.3: Proposed Architecture to Implement Dynamic Race Detection Scheme in GPUs the unseen update set of the requesting SM. Since the unseen update set represents the addresses of the write operations from all other SMs, RDU is able to detect any RAW race by checking the presence of the incoming load or store address from the current SM in the unseen update set. Keeping a plain list of all the addresses in the unseen update set and navigating through that list for each race-check is a lengthy process and also requires signicant storage. Hence, we propose to use a signature-based structure instead of keeping the whole list of addresses. The signatures are a statistical structures based on bloom lters[10] that can give a false positive result but not a false negative. Thus each per-SM unseen update signature captures the entire write-set from all other SMs conservatively. As explained above the signature captures all write addresses with certainty, but it may also have hash collisions with additional addresses that are not part of the unseen update set. But this conservative approach guarantees program correctness at the expense of some additional addresses that may be falsely claimed as part of the unseen update set. The RDU receives two types of messages: write requests and race-check requests. Since all private caches are write-through caches, all writes by all SMs are sent to a shared L2 cache and to the RDU as well. The RDU then uses the address in the write request to update the signatures of all SMs except for the writer (the SM that sent the request). The write operation also creates a 88 synchronization point for the writing SM that marks the end of the current epoch and a start of a new epoch. Race-check requests are triggered by some read operations in the private caches which enables RDU to check for possible races (read-after-write access). Although, a race-check can be performed for all read operation, this approach is expensive and may aect the performance since the read is not considered executed until the race-check result is known. For that reason, the proposed scheme tries to reduce the number of race-checks as much as possible. Based on the fact that a read operation can return an old value if it can be ordered before any write operation that override that value, the proposed scheme may not perform race-check for any read operation that hits in the private cache. However, allowing a block to hit in the private cache indenitely may resulted in a livelock situation. Hence, a race-check is also triggered when a blocks hits in the private cache is multiple times in a row (4 consecutive hits in our baseline). On the other hand, a read operation that misses in the private cache may read a data that is produced by a recent write operation (racy write). Hence, they need to perform a race-check. If a racy read returns a value generated by a recent write operation (racy write), all the updates that are executed before the racy write must be propagated as well. Since the signature does not preserve the order of the inserted updates, it is necessary to observe all the accumulated updates. The race-check request checks the membership of an address in SM's signature. It initiates a race-detection mechanism if the requested address hits in the signature. The race-detection mechanism creates a synchronization point in the SM execution timeline. The creation of a synchronization point that is triggered by either a write request or a race detection, marks the end of current epoch, copies the signature of the requesting SM from the RDU to the address validation unit (which we will explain next), resets the signature in the RDU to collect a new set of updates in the new epoch, and increments the epoch number of the requesting SM by 1. Requests from each SM are sent to the RDU and are processed in their arrival order which is compatible with the program order and not allowed to reorder. The RDU can force a synchronization point when the number of accumulated unseen updates reach some threshold even if no race is detected. This 89 forced synchronization points keeps the probability of false positives in the bloom lter signature at an acceptable level. As explained above, a write from one SM does not necessarily invalidate the private cache lines in other SMs. As such an SM may continue to hit in its private cache multiple times, even on stale data, instead of detecting a race and observe the new data. Note that these hits to the stale data does not violate memory ordering rules; these read hits can be ordered so that they appear to have executed before the write to that address in some global order. However, to reduce the staleness, we allow a cache line to hit for certain number of times before it needs to be checked with the RDU for possible race. In our evaluation, we set the number of allowed hits before possible race check to be four, based on an observation that a block is accessed at most four times before it is evicted in the evaluated benchmarks. Access Validation Unit (AVU) There are multiple ways to propagate a set of updates at once. Self-invalidation is one of the most common way to do that. There are two types of self-invalidation: bulk self-invalidation and selective invalidation. The bulk self-invalidation leads to ush the whole private cache which forces the SM to go to L2 cache for all subsequent accesses. Although this approach is simple, it is not ecient and may lead to worsen cache performance. Selective self-invalidation is another approach but it needs some guidance to determine which block to invalidate. In our scheme, we adopt the selective invalidation. The structure that is responsible for selective self-invalidation is called the access validation unit (AVU). The AVU is a per-SM structure that keeps track of the seen updates. It uses this information to validate the accesses to a cached block. As mentioned in the race-detection mechanism, when a synchronization point is created in a given SM, the signature in the RDU is copied to the AVU associated with that SM which holds the information about all the updated addresses until that synchronization point (i.e. during the perivous epoch). The AVU can hold multiple signatures from a set of prior epoch every read request from the 90 SM that hits in the AVU validates that the address of the cached data is not updated since it is fetched. In our implementation, we allow AVU to hold only one signature that represents the updates in the previous epoch. Cache blocks that are fetched after the last synchronization access (i.e. in the current epoch) are not checked since the AVU does not have information about the updates during that period. Hits to cache blocks fetched in the previous epoch are validated by the AVU to make sure that the cached data is up-to-date and has not been updated since it is fetched. Any cache that is fetched more that one epoch ago is considered invalid because AVU does not have information about the updates two epochs ago. Note that AVU could easily keep track of such information but we made a design tradeo decision to store only a single signature. L1 cache miss accesses are not checked in AVU since they are sent to the shared L2 cache and will observe the up-to-date version. Accesses to cache in the presence of the AVU are treated as follows: Access to a block that is fetched in the current epoch: always hits as long as the block is still in the cache and not evicted or replaced. This is because the AVU does not have any information about updates happened after the last synchronization point. If the block is evicted or replaced in the private cache, then it needs to be fetched again. Access to a block that is fetched in the previous epoch and still in the private cache: is checked with the signature in the AVU and access is granted unless it also hits in the AVU. A hit in AVU implies that in the previous epoch at least one other SM has written to that address. A hit in AVU is essentially considered as a miss and the updated version is fetched from shared L2 cache. Access to a block that is fetched two or more epochs ago and still in the private cache: is considered a miss and is re-fetched again from the L2 cache. Even though the block may not be updated since it is fetched but AVU cannot guarantees about that. Some of these misses can be avoided if AVU is congured to keep multiple signatures for multiple epochs. 91 However, multiple signatures AVUs encounter higher energy consumption and longer access laten cy that may aect the L1 cache access latency. Access to a block that is not in the cache: is sent to the lower level cache to fetch the block. The AVU is not used to validate the access because fetching the block from L2 cache guarantees an up-to-date version. (a) (b) Figure 4.4: Example of the Mechanism of the RDU. We will use a simple example to explain the operation of the RDU. Assume that SM0 and SM1 updates memory locations A and B, respectively as shown in gure 4.4a. Both these memory locations are sent to RDU. Memory location A is inserted into the signatures of SM1, SM2 and SM3 but not SM0 since SM0 initiated the store to location A. Similarly, for memory location B SM0, SM2 and SM3 signatures are updated. Note that SM2 and SM3 private caches are not aware of the updates to memory locationA andB until they tried to execute their reads as shown in gure 4.4b. The read operation in step (4.4b) is sent to RDU which detects a race due to RAW. As a result of the race detection, the signatures in RDU are cleared and their contents are sent to the AVU of that SM (clearing a signature is as simple as resetting all bits to zeros). Figure 4.5 shows an example of a kernel execution under dynamic race detection. S 0 and S 1 are synchronization points. Updates happened in epoch 2 (after S 1 ) are accumulated at the RDU 92 Figure 4.5: Example of execution of that shows what updated locations are kept in the RDU and which are kept in the AVU while the updates in the previous epoch (epoch 1) have been transferred to the AVU in the L1 cache when S 1 is created in order to verify that no access is granted to any block that is updated during epoch 1. Any block that is fetched before S 0 (i.e. in epoch 0 or earlier) is cannot be accessed since the signature of the updates before S 0 is lost. Hence, any block that is fetched before S 0 is always considered a miss. AVU can be extended to keep multiple signatures but our analysis shows that the number of hits due to this extension is very small and performance gain does not justify the additional hardware complexity and energy overhead. 4.2.3 GPU Speculative Execution Details Since GPUs do not have a mechanism to roll back execution state (which is also our baseline assumption), we cannot issue multiple requests to L2 cache at the same time because there is no guarantees that memory operations are executed in order (because of hits and misses in dierent memory system). If an outstanding memory operation resulted in a synchronization point creation, then all subsequent operations need to be validated against the new signature. For this reason, memory operations are stalled when a read operation misses in the L1 cache or a write operation is issued. In order to clarify the above issue, consider a load operation (ld 1 ) that misses in L1 cache followed by another load operation (ld 2 ) that hits in L1 cache and passed the AVU check. The SC ordering is violated in this case since read-read order is honored in SC. Moreover, If ld 1 caused a 93 race and the memory location of ld 2 hits in the new signature, then ld 2 violated the SC ordering and need to be re-executed to observe the new value. Hence, our baseline mechanism may lead to unnecessary stalls to enforce SC. We evaluated the performance of the proposed scheme under ideal conguration, where the race-check can return the result without any stall overhead and the creation of the synchronization points can be done instantly. The performance overhead was around 7% compared to a baseline GPU that does not provide any SC guarantees. We also evaluated the performance under the normal conguration, where we considered the delay for race-check and synchronization point creation. We found out that the performance overhead increased to 20% and hence the dierence between the ideal and normal congurations was around 13%. While 20% may be justiable for providing SC in some scenarios we explored options for further reducing the overhead due to stalls. We realized that the performance can be enhanced if we are able to reduce/eliminate the stalls in the load/store unit waiting for a race-check result to come from the RDU or a synchronization point creation process. To reduce/eliminate the unnecessary stalls in the memory system upon read misses or writes, we propose to execute memory operation even under a miss without waiting for a conrmation from RDU and then validate them when the missed access is resolved. If the missed access does not trigger a race at the RDU, then all speculative accesses can be converted from being speculative to non-speculative (nal). On the other hand, if the missed access triggered a race in the RDU, then the speculative accesses need to be validated with the updated new signature form the RDU. Any memory operation failed the validation (i.e. its address hits in the new signature) should be re-executed. This speculative execution can allow multiple outstanding misses to go into the lower level cache without stalling the pipeline. The main constraint is that the RDU should process misses from the same SM in their program order (i.e. the same execution order in the LDST unit) to fulll the sequential consistency constrains. This ordering can be ensured by the RDU input queue. 94 Performing speculative execution in GPU has been investigated recently. Menon, et. la. [50] and Kim, et. al [39] proposed frameworks that enable GPU to execute instructions speculatively. Even though we could adapt one of these frameworks to work with dynamic race detection memory system, we propose a much simpler scheme for speculative execution of just the loads, while still honoring the memory ordering demands. Our proposed speculative execution framework relies on the register le underutilization as shown in prior works [1, 23, 33, 31]. We exploit this underutilization to implement speculation. In our approach, we propose to use the unused registers in the register les to hold the speculative values of the speculative instructions. The speculative instructions including speculative memory operations that hit in the L1 cache under a miss as well as any instruction (ALU, SFU or LDST) instruction that is dependent on them. The speculative execution framework requires a simple renaming table that associate each renamed register with actual physical register so the speculative values can be copied to the actual physical register upon validation. This approach has been detailed in prior work [1, 23, 33, 31, 39] In the conventional GPUs, there are a total of 1024 physical wide-registers in the per-SM register le. Each register is typically 1024-bit wide registers to store the value of a register for all threads in a warp (32-bit register32 threads/warp). The registers are allocated to warps in sequential manner based on how many registers are requested by the launched kernel. The remaining unassigned registers are available for speculative execution. Each warp is allowed to use up to 63 registers [39, 31]. Hence the register renaming table may at most have 3024 entries (48 warps63 registers/warp) and each entry is a 10-bit physical register ID. The register renaming table also keep track of the speculative and the temporary non-speculative results. Speculative results are the results of speculative hits that are not validated yet and all dependent operations (the operations that used a speculative result as an input). The results of other operations that are executed after a speculative instruction but does not depend on a speculative result is considered temporary non-speculative results. The AVU keeps track of the memory addresses, the PC of speculative read operations, and the number of outstanding memory operation when 95 the speculative read operation is performed. The number of outstanding memory operation for each entry is decremented by one when a race-check response is received from RDU. If a race is detected and updates the signature in the AVU, the addresses are also validated against the new signature. When the number of outstanding memory operations of an entry reach zero and its address is validated against all previous signatures, the speculative read operation is declared correct and all dependent speculative results as well as temporary non-speculative results in the register renaming table are declared nal and their reserved registered are freed (the registers used to hold speculative values). On the other hand, if the entry on the top of the queue with zero outstanding memory operations does not pass the validation process against any of the new signatures, then the PC is resent to the SM for re-execution. Only the failed memory instruction and its dependent instructions are re-executed. The temporary non-speculative values are committed without re-execution. 4.2.4 An Example Dynamic Race Detection with Speculative Execution The speculative execution of the dynamic race detection memory system is described with an example. Consider two memory loads (ld 1 and ld 2 ) followed by an ALU operation (op 1 ) that depends on the value returned by ld 2 . Let us assume that ld 1 misses in the L1 cache and the request is send to L2 cache to fetch the data and a race-check request with the accessed memory location is also sent to the RDU in order to check for possible race. The SM will proceed with the execution of the next instruction which is ld 2 . Let us assume that ld 2 hits in the L1 cache and the value is read. Meanwhile, the execution of ld 2 is considered speculative even if it passed the validation check in the AVU. If ld 1 triggers a race detection in the RDU and AVU gets a new signature, then ld 2 needs to be re-validated with the new signature in the AVU. The execution continues to execute op 1 as well and the result is also considered speculative since it is based on a result of a speculative instruction. The status of ld 2 and op 1 will change once the response of the race-check of ld 1 is resolved. If the RDU return no-race detected then the execution of both ld 2 96 and op 1 is converted to non-speculative without any further actions whereas if race is detected, ld 2 is re-validated with the new signature in the AVU and if it failed the validation, both ld 2 and op 1 are squashed and re-executed. The re-execution of ld 2 and op 1 is not considered speculative any more. Figure 4.6 shows the code executed on two dierent SM. According to SC rules, the combina- tion of values read by instructions A3 and B3 cannot be (A=0,B=0) but all other combinations are acceptable (i.e. [A=1,B=1] or [A=0,B=1] or [A=1,B=0]). We will show that dynamic race detection mechanism guarantees one of the allowed combinations all the time and does not allow the forbidden wrong combination. Let us assume two dierent execution scenarios: the rst one where both SMs run at similar pace, and the second where SM0 executes at much faster pace. Let us assume all variables are set to 0 at the beginning of the execution. Figure 4.6: Example Code to Explain the Dynamic Race Detection First Scenario: When both SMs are executing at similar pace, the instructions are executed in an interleaved fashion. Both SMs load the cache blocks at addresses A, and B and place them in their private caches. These blocks are fetched in execution epoch 0 for both SMs. Then SM0 updates the value of [A] to 1 (instruction A1). Since L1 cache is a write-through cache, the store is sent to the L2 cache and its address is sent to the RDU. The RDU inserts the address A to the signature of SM1, and sends the current signature of SM0 with an acknowledgement to SM0. Upon receiving the RDU acknowledgment, SM0 creates a synchronization point which includes replacing the signature in its AVU with the signature in the acknowledgement and increments the 97 epoch number to 1. Since there are no unseen updates by SM0, the new signature in its AVU is empty. The same procedure is followed by SM1 when it executes instruction B2 which updates the value of [B]. The new signature that is received with the acknowledgment from RDU contains the address of A. After that, SM0 tries to read [B] (instruction A3). It nds the block in its private cache but since the block is fetched in the previous epoch, the access need to be validated by the AVU. The address of [B] does not hit in the AVU signature which means that SM0 has not observed any update to the cached data and hence the access is granted. SM1 tries to access block [A] in its private cache (instruction B3). Since the cached block is fetched in the previous epoch, the access needs to be validated with the signature in the AVU. The validation fails since the address of block [A] is presented in the AVU signature meaning that the cached block at address [B] is updated by another SM and hence an updated version needs to be fetched. The new data is fetched from the shared L2 cache and placed in the private cache. A race-check request is sent the RDU along with the miss request to check if there is another updated to address [B] after the observed one. This race-check is necessary to ensure that if there is another update to the same address, then the updates between the synchronization point and the update to address [B] are observed before the update to address [B] is observed. After that, SM0 updates the value of address [C] (instruction A4). The store operation sends the address of [C] to the RDU and creates a synchronization point similar to the one that it created when it performed instruction A2. After the completion of instruction A4, the AVU signature will contain the address of [B] which means that the update performed by SM1 on block [B] is now observed by SM0. The read operation to address [C] by SM1 (instruction B4) results in a miss in the private cache and hence the request is sent to the shared L2 cache and a race-check request is sent to the RDU. The race-check detects a race at address [C] which created a synchronization point in SM1 execution. The RDU signature of SM1 is sent to AVU in SM1 and the epoch count is incremented to 2. The memory operations in this example are ordered as (fA1,B1g!A2!A3!B2!B3!A4!B4) where A1 and B1 can be 98 ordered in any way (concurrent instructions). Figure 4.7 shows the timeline of the execution of this scenario. Figure 4.7: The Timeline of The First Scenario. The Square Brackets Next to The Epoch Number Shows The Contents of The AVU Signature. Second Scenario: Let us consider the scenario when SM0 is able to execute its code at a much higher rate that SM1. In this scenario, SM0 is able to execute its whole code before SM1 executes any of its instructions. We also assume that instructions A1 and B1 are executed at the beginning of the execution to ensure that the blocks of addresses A, and B are cached in the private caches. The store to address [A] by SM0 (instruction A2) creates a synchronization point as explained in scenario 1. Also the read operation to address [B] (instruction A3) is able to access the private cached copies since the block is fetched in the previous epoch and the access passes the AVU validation procedure. The store to address [C] (instruction A4) creates a synchronization point and updates the signature in AVU in SM0. Since there is no unseen updates, the AVU signature in SM0 is empty. On the other hand, when SM1 execute the store to address [B], the store created a synchronization point and the AVU signature is updated. The new AVU signature contains the addresses of [A] and [C]. Hence, when SM1 tries to access the cached copy of block [A] (instruction B3), the access failed the validation process in AVU so the updated data is fetched from shared L2 cache. The read operation to address [C] (instruction B4) misses in the private cache. The request is sent to the shared L2 cache to fetch the block and a race-check request is sent to the RDU as well. The race-check request comes negative. The memory operations in 99 this example are ordered as (fA1,B1g!A2!A3!A4!B2!B3!B4) where A1 and B1 can be ordered in any way (concurrent instructions). Figure 4.8 shows the timeline of the execution of this scenario. A similar result is obtained when SM1 executes its code at higher rate than SM0 Figure 4.8: The Timeline of The Second Scenario. The Square Brackets Next to The Epoch Number Shows The Contents of The AVU Signature. As we can observe from the above scenarios discussed, the read values by instructions A3, and B3 always compatible with the allowed values by the sequential consistency rules. The dynamic race detection scheme does not allow these two instructions to observe the value pairs that violates the SC. 4.3 Evaluation 4.3.1 Simulation Environment We used GPGPU-sim [8] v3.2.2 to build and simulate the dynamic race detection scheme. We used NVIDIA Fermi architecture (GTX 480) as the baseline architecture. We used GPUWattch [46] to estimate the power consumption of the system. The architecture of the simulated GPU system includes 16 streaming multiprocessors (SMs) each with 16KB of L1 cache. The warp scheduling policy is loose round-robin (LRR) that gives fair share for each warp. The number of registers in the register le in each SM is 32K. The maximum number of threads per SM is 1536 threads. 100 Table 4.1: GPGPU-Sim Conguration for Dynamic Race Detection Scheme Evaluation. Simulation Conguration Number of Cores 16 Core Conguration 32 SIMT lanes, 1.4GHz LRR warp scheduler Threads/SM 1536 L1 Data Cache 16KB, 4-way assoc, 128B block Shared Memory 48KB L2 Unied Cache 768KB total, 128KB/channel 8-way assoc, 128B block Registers/Core 32768 Interconnection 2D mesh, 1.4GHz, Conguration 32B channel width GDDR Clock 1.4GHz DRAM Model FR-FCFS (32 queue/channel) 6MCs, Channel BW=8B/cycle GDDR5 Timing tCL=12, tRP=12, tRC=40, RAS=28, tRCD=12, tRRD=6 The summary of the simulator parameters are listed in Table 4.1. The baseline conguration is the GPU system with release consistency model. The memory model does not guarantee neither the propagation of updates during kernel execution nor the order of the observed updates. It only guarantees the updates propagation at the end of kernel execution. It re ects the memory consistency model discribed in section 4.1. All the results reported in the following sections are normalized to the result of the baseline conguration. 4.3.2 Benchmarks We evaluated our scheme with 30 benchmarks from Parboil [70], Rodinia [13], and ISPASS-2009 [8] benchmark suits. Also we evaluated the benchmarks used by Singh et al. [67] since these bench- marks requires coherent memory system to execute correctly. We also classied the benchmarks to memory-intensive and non memory-intensive benchmarks based on the percentage of memory instructions executed. Memory-intensive benchmarks have 20% or more memory instructions. We evaluated 16 memory-intensive benchmarks and 14 non memory-intensive benchmarks. 101 4.3.3 Dynamic Race Detection Scheme Conguration Since both RDU and AVU are signature-based structures, we used bloom lters [10] as signatures. The bloom lter is probabilistic structure that is used to check element membership to a set. It consists of a vector of m bits, initially set to 0, and k independent hash functions (h 1 ;h 2 ;:::;h k ), each with output range 1;:::;m. To add an elemente to the lter, the bits ath 1 (e);h 2 (e);:::;h k (e) are set to 1. To check the membership of element e, the bits at h 1 (e);h 2 (e);:::;h k (e) are checked and any of them is not set then element e is not in the set. The structure and operation of the bloom lter can result in a false positive results. We want to reduce the probability of the false positive in order to avoid detecting false races. To calculate the probability of false positive, let the number of inserted elements is n. Then the probability of false positive is P fp = 1 1 1 m kn ! k . To get the probability of false positive to about 2%, we set k to 6, set n to 128, 256, 512, 1024,and 2048 and set m to 1Kb, 2Kb, 4Kb, 8Kb, and 16Kb respectively. Since RDU and AVU can work independently from cache systems, we simulate dierent cache detection granularities: byte-level, word-level (4bytes), cache block-level (128bytes) and half page- level (512bytes).We choose these granularities because byte is the smallest writable unit and it gives the most accurate tracking unit without aliasing; word is the accessible unit by a single thread; cache block is the unit used in caches and it is the unit used to transfer data between dierent cache levels; and half page to show how coarse grain race tracking can aect performance of the scheme and what is the eect of aliasing. 4.3.4 Performance Evaluation we evaluate the ideal performance of the GPU system with it assuming there is no overhead of race detection (i.e. race-check and race-detection mechanism can be done instantaneously 102 with no overhead). Our estimation shows that a perfect dynamic race detection scheme can be implemented with a performance overhead of around 8% for all evaluated benchmarks. We then evaluated the proposed scheme discussed in section 4.2.2 without deploying the specu- lative execution of the memory instructions. This conguration shows the maximum performance overhead of the scheme. We found that the performance overhead of this conguration is be- tween 19.6 to 20.7% on average for all benchmarks. We also found out that memory-intensive benchmarks suer 27.8 to 29.7% performance overhead compared to only 2.5% under the ideal conguration at worst. Figure 4.9: Performance of GPU with Dynamic Race Detection Scheme. Figure 4.9 shows the normalized execution time of the proposed scheme under dierent con- guration and signature sizes. The Ideal conguration represents the implementation with no overhead for race-check and race-detection mechanism, the No Spec is the conguration without deploying the speculative memory instruction execution, and Byte-level, word-level, block-level, half page-level are the implementations where the races are tracked at byte, world, block, and half-page levels respectively. The labels 1K sig, 2K sig, 4K sig, 8K sig, and 16K sig indicates the signature sizes. 103 The gure shows that as the signature size increases, the performance overhead drops.This is an expected behaviour since larger signatures leads to lower false positive probability and allows more updates to accumulate in the RDU before forcing a synchronization. Larger signatures allows more elements to be inserted into the signature before its false positive probability increases as discussed in 4.3.3. The gure shows that tracking and detecting races at cache block granularity give the best performance. However, larger signatures for the half-page granularity does not always guarantees better performance. A signature of size 16KB performs slightly worse than a signature of size 2KB. This can be due to an increase in race aliasing (i.e. a race is detected on two dierent addresses belong to the same half-page). The performance overhead of the proposed scheme with block-level and half-page-level race tracking is lower than the performance overhead of ideal conguration. This is due to an advan- tage in the implementation of the proposed scheme with speculative memory instructions over the ideal implementations. In speculative execution, a memory access can be validated over multiple synchronization epochs without the need for a re-fetching. To illustrate this case, let us assume a read operation to memory location [F ], that triggers race detection in the RDU, followed by another read operation to memory location [A]. Let us assume also that the memory location [A] is present in the cache and fetched in the previous synchronization epoch. In ideal conguration, the race triggered by accessing location [F ] is detected instantaneously and the race-detection mechanism is activated before the next access to memory location [A] and since the synchro- nization epoch is incremented by 1, then the access to [A] will automatically result in a miss { since the information about the validity of [A] has lost in the race-detection mechanism {. With speculative execution of memory instructions, the access to [A] is executed speculatively and then validated when the race-check result for memory location [F ] comes. In the rst conguration, the access result in a miss and re-fetch for memory location [A] while in the second conguration, 104 the access is executed speculatively and then validated which saves the interconnection network bandwidth as well enhances the performance. Figure 4.9 shows that tracking races at cache block granularity provides a better trade o between the ne grain granularity as in byte-level and word-level and coarse grain granularity as in half-page level. Fine grain race tracking is more accurate and eliminates the possibility of aliasing while it causes the signature to saturate faster and hence increases the probability of false positive race detection and possibility forced synchronization { i.e. triggering the race-detection mechanism in order to avoid increase in the false positive probability { Coarse grain race tracking allows more writes to the signature but in turn increases the chance of aliasing. A race condition applied to any two memory locations belongs to the same half-page can trigger a race-detection mechanism although there is no actual race. 4.3.5 Race Detection Granularity and Signature Size We calculated the number of synchronization points created due to race detection and the number of forced synchronization points created due to signature saturation. The RDU forced a creation of a synchronization point when the probability of false positive exceeds a certain value (just above 2% in our evaluation). Figure 4.10 shows the percentage of both types of synchronization points for all simulated congurations with all signature sizes. Although the byte-level race detection provides the most accurate race detection, it suers from a huge number of forced synchronization points that reaches 95%. This result is because a single write operation can insert up to 128 entries in the signature (the size of a cache block is 128 bytes). Tracking races at block-level gives a reasonable trade o between the two extremes since the percentage of the forced synchronization points can be 20% of all synchronization points. Even though tracking races at a larger granularities such as half-page introduced the lowest percentage of forced synchronization points, the percentage of race aliasing (a race is falsely called because two addresses belongs to the same region) is higher as the half-page race tracking detects between 8 to 21% more races 105 than the block-level race tracking. Using larger granularities may aggravate the problem of race aliasing. Figure 4.10: Percentage of Synchronization Points by Race Detection and Synchronization Points due to Signature Saturation. n 4.3.6 Cache Performance Figure 4.11: Percentage of Coherence Misses in Private Cache. We also evaluated the performance of the private caches with the dynamic race detection scheme. Figure 4.11 shows the additional coherence misses percentage with respect to the total 106 misses in private caches. We classied these misses into two classes: expired misses and AVU detected misses. The expired misses are the accesses that called as misses because the cached data is fetched more than two epochs ago, whereas the AVU detected misses are the accesses to cached data that are fetched in the previous epoch but failed the AVU validation. The proposed scheme introduced between 22.1 to 32.1% additional misses across all congura- tions with dierent signature sizes. The block-level tracking with signature of size 16Kb has the lowest coherence misses at 22% while byte-level tracking with signature of size 1Kb introduced the most coherence misses at 32%. Byte-level and word-level tracking have higher percentage of AVU detected misses. This is due the fact that the signatures in AVU are almost saturated most of the time and which in turn increases the number of falsely called races (due to higher false positive percentage). The cache performance with half-page race tracking have a similar performance to the block level. However, the drop in the amount of scheme induced coherence misses is less drastic compared to the block-level. This can be related to the race aliasing because the AVU uses the same signature used by the RDU. The byte-level and word-level congurations suers from a high amount of coherence misses induced by the scheme. The frequent insertion of synchronization points due to signature saturation makes the epoch count increases faster and hence more accesses called as misses because the cached blocks are fetched more than an epoch ago. The low percentage of coherence misses induced by the scheme support the idea to keep a single signature in the AVU. Keeping multiple signatures in the AVU is possible but it comes with power and area overhead. The drop in the amount of coherence misses due to preserve more signatures in the AVU does not justify the additional power and area. Moreover, the GPU ability to hide the cache miss penalty by executing other pending instructions from other threads makes the performance gain marginal. 107 Table 4.2: Access Energy and Leakage Power for RDU, AVU, and Register Renaming Table Parameter Signature Size 1Kb 2Kb 4Kb 8Kb 16Kb AVU Access Energy (pJ) 0.94 1.88 3.76 7.51 15.01 Leakage Power (mW) 0.11 0.21 0.43 0.85 1.68 RDU Read Energy (pJ) 15.12 30.12 60.13 120.13 240.15 Write Energy (pJ) 15.81 31.51 62.89 125.66 251.20 Leakage Power (mW) 1.81 3.48 6.83 13.53 26.92 Register Renaming Read Energy (pJ) 1.39 Write Energy (pJ) 1.21 Leakage Power (mW) 3.63 4.3.7 Energy Consumption In order to estimate the energy overhead of our design, we used gpuwattch [46] to estimate the energy consumed by the GPU and Cacti V6.5 to estimate the energy of the RDU, AVU and register renaming table assuming 40nm technology. The results of the Cacti is listed in table 4.2. Figure 4.12: Energy Consumption Breakdown for Dynamic Race Detection Scheme. Figure 4.12 shows the energy consumed by each component relative to the energy of the baseline conguration. The gure shows that the energy consumed by the RDU for byte-level and word-level race tracking is very signicant that can be as much as 58% of the baseline conguration energy. This is due to the fact that a single write operation can trigger up to 1920 entry insertion 108 in the RDU (15 SMs times 128 Bytes written). Also a single race-check operation includes checking up to 128 addresses in the RDU. This numbers decreases dramatically in block-level and half-page-level race tracking where the RDU energy constitute less than 1% of the baseline energy. The energy consumed by the register renaming table and the AVU are at most 0.02% and 0.09% of the baseline energy respectively. The gure also shows that the dynamic race detection scheme can achieve SC with energy overhead as low as 3% of the baseline scheme as in the block-level and half-page-level race detection congurations. 4.3.8 G-TSC for Sequential Consistency and Dynamic Race Detection Scheme We also compared the implementation of the sequential consistency using G-TSC and the proposed dynamic race detection scheme. The relative performance is shown in gure 4.13. The simulation results show that both schemes achieve relatively similar performance with marginal variation of about 2%. The results also suggest that dynamic race detection scheme would reduce the interconnection bandwidth and L2 cache trac by nearly 5%. The GPU energy consumption with both schemes is almost equal (within 1% of each others on average). Hence, both designs can implement sequential consistency at very similar cost in terms of performance and power consumption. The main distinction between the two designs is in terms of the additional hardware and complexity. The G-TSC protocol requires two additional elds per cacheline in L1 and L2 cache to hold the wts and rts (16-bit each) and a 16-bit eld per warp to keep track of the warp ts. The hardware required for the dynamic race detection scheme is much simpler. It requires a 5-bit epoch eld per cacheline in the private L1 cache and the 5-bit register to keep record of the current epoch per SM. It also requires two signatures per SM (one in the RDU and the other in the AVU). It does not require any modications to the L2 cache. 109 Figure 4.13: Performance Comparison Between G-TSC and Dynamic Race Detection Scheme for Sequential Consistency. 4.4 Related Work Previous works [27, 26] on memory consistency models for GPUs conclude that strong consistency models such as sequential consistency achieves relatively similar performance as weaker consistency models such as relaxed consistency and TSO. They consider strict version of sequential consistency model in their evaluation. In our work, we showed that sequential consistency can be achieved with negligible performance overhead. Cache coherence and memory models for GPUs has been discussed in multiple works before. Temporal coherence (TC) [67] proposed by Singh et. al. is one of the rst coherence protocols for GPUs. TC is a time-based coherence protocol that relies on synchronization counters and self-invalidation to reduce the coherence trac. Each cache block cached in private cache is assigned a lifetime period (called lease) and it is self-invalidated once the lease expired. TC 110 implements release consistency model [67]. The introduction of data-race-free [3] heterogeneous- race-free(HRF) [29] urged the computer architects to adapt these models for GPUs. DeNovo [66] is a hardware-software hybrid protocol originally proposed for CPUs [16]. DeNovo implements two memory models: DRF (DeNovo-D) and HRF (DeNovo-H). Unlike our scheme, DeNovo requires software assistance to declare data and synchronization variables and it ushes the entire cache upon acquire (read operation that causes a race) except written data. Lazy release consistency for GPUs (hLRC) [5] is a memory model for GPUs that tries to overcome HRF shortcomings. hLRC uses a DeNovo-like mechanism to track ownership of synchronization variables, lazily performing coherence actions only when a synchronization variable changes locations. QuickRelease (QR) [25] is memory system framework that uses FIFO queues to enforce memory operations ordering and enforce cache coherence. QR improves performance by using ecient synchronization FIFOs to track outstanding writes, and reduces the cost of write support by partitioning the read- and write-cache resources. QR is based on relaxed memory cosnsistnecy model Hardware-assisted detection of read-after-write races has been investigated in CPUs. Racer [63] is a framework for ecient self-invalidation approach to implement TSO memory model. It uses a runtime approach similar to the one proposed in this paper to detect read-after-write races and enforce self-invalidation on the race reader private cache to propagate updates. Although Racer implements TSO memory model, it maintains the relaxed-ordering of SC-for-DRF proto- cols. SigRace [52] is a hardware-assisted data race detection scheme relies on hardware address signatures and timestamps. Each processor accumulates the accessed addresses in a signature and at certain point these signatures are sent to a hardware model that intersects these signatures to detect a race. Once race is detected, all processors involved in the race are forced to roll back to the previous checkpoint and re-execute. Cost-eective order-recording and data race detection mechanism (CORD) [58] is a hardware-assisted race detection mechanism based on timestamp ordering and history buering. The timestamps of previous con icting accesses for each memory location cached in private cache are kept. ReEnact [59] is another hardware-assisted race detection 111 framework that extends the communication monitoring mechanisms in thread-level speculation (TLS) to detect data races. It extends TLS's rollback capabilities to be able to roll back and re-execute the code with races. HARD [79] is a hardware implementation of lockset algorithm to exploit the race detection capability of this algorithm with minimal overhead. it stores lock sets in hardware bloom lters and use simple bitwise operations to detect races. Self-invalidation in private caches has been explored in context of cache coherence. Dynamic self-invalidation (DSI) [44] reduces cache coherence overhead and reduce invalidation messages by speculatively identifying which block to invalidate when they are brought into the cache but deferring the actual invalidation to future time. DSI still requires explicit messages to the directory to acknowledge self-invalidation. DSI can reduce the trac by using tear-o blocks that are self- invalidated at synchronization instructions. A similar idea is proposed in [64] that extends the tear-o blocks to all cache blocks in order to entirely eliminate coherence directories. Last-Touch Predictors (LTP) [41] triggers speculative self-invalidation of memory blocks in distributed shared memory. 4.5 Conclusion In this chapter, we proposed a hardware-assisted dynamic race detection scheme for GPUs that implements DRF model without software assistance. The proposed scheme includes a mechanism to detect races at runtime and treat the racing read operation as a read acquire operation in the DRF model. It also include a simple cost ecient mechanism to validate the accesses to cache data such that the read operations return the most update data. The proposed scheme is evaluated with a cycle accurate GPU simulator showing that it can achieve the DRF memory model with performance overhead as low as 1%. The proposed scheme is evaluated with dierent race tracking and detection granularities. 112 Reference List [1] M. Abdel-Majeed and M. Annavaram. Warped Register File: A Power Ecient Register File for GPGPUs. In 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), pages 412{423, Feb 2013. [2] Sarita V Adve and Kourosh Gharachorloo. Shared Memory Consistency Models: A tutorial. computer, 29(12):66{76, 1996. [3] Sarita V. Adve and Mark D. Hill. Weak Ordering - a New Denition. In Proceedings of the 17th Annual International Symposium on Computer Architecture, ISCA '90, pages 2{14, New York, NY, USA, 1990. ACM. [4] Jade Alglave, Mark Batty, Alastair F. Donaldson, Ganesh Gopalakrishnan, Jeroen Ketema, Daniel Poetzl, Tyler Sorensen, and John Wickerson. GPU Concurrency: Weak Behaviours and Programming Assumptions. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '15, pages 577{591, New York, NY, USA, 2015. ACM. [5] J. Alsop, M. S. Orr, B. M. Beckmann, and D. A. Wood. Lazy Release Consistency for GPUs. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 1{14, Oct 2016. [6] Nasser Anssari. Using Hybrid Shared And Distributed Caching For Mixed-Coherency GPU Workloads. 2013. [7] J.-L. Baer and W.-H. Wang. On the Inclusion Properties for Multi-level Cache Hierarchies. In Proceedings of the 15th Annual International Symposium on Computer Architecture, ISCA '88, pages 73{80, Los Alamitos, CA, USA, 1988. IEEE Computer Society Press. [8] Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pages 163{174. IEEE, 2009. [9] R. Behrends, L. K. Dillon, S. D. Fleming, and R. E. K. Stirewalt. AMD Graphics Cores Next Architecture, Generation 3. Technical report, Advanced Micro Devices Inc., August 2016. [10] Burton H. Bloom. Space/Time Trade-os in Hash Coding with Allowable Errors. Commun. ACM, 13(7):422{426, July 1970. [11] Martin Burtscher and Keshav Pingali. An Ecient CUDA Implementation of The Tree- Based Barnes Hut N-Body Algorithm. GPU computing Gems Emerald edition, page 75, 2011. [12] Daniel Cederman and Philippas Tsigas. On Dynamic Load Balancing on Graphics Processors. In Proceedings of the 23rd ACM SIGGRAPH/EUROGRAPHICS symposium on Graphics hardware, pages 57{64. Eurographics Association, 2008. 113 [13] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaer, Sang-Ha Lee, and Kevin Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Work- load Characterization, 2009. IISWC 2009. IEEE International Symposium on, pages 44{54. IEEE, 2009. [14] Xuhao Chen, Li-Wen Chang, Christopher I Rodrigues, Jie Lv, Zhiying Wang, and Wen-Mei Hwu. Adaptive cache management for energy-ecient GPU computing. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pages 343{355. IEEE Computer Society, 2014. [15] Xuhao Chen, Shengzhao Wu, Li-Wen Chang, Wei-Sheng Huang, Carl Pearson, Zhiying Wang, and Wen-Mei W Hwu. Adaptive cache bypass and insertion for many-core accelerators. In Proceedings of International Workshop on Manycore Embedded Systems, page 1. ACM, 2014. [16] B. Choi, R. Komuravelli, H. Sung, R. Smolinski, N. Honarmand, S. V. Adve, V. S. Adve, N. P. Carter, and C. T. Chou. DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism. In 2011 International Conference on Parallel Architectures and Compilation Techniques, pages 155{166, Oct 2011. [17] AMD Corporation. AMD GRAPHICS CORES NEXT (GCN) ARCHITECTURE White paper. Technical report, AMD Corporation, 2012. [18] NVIDIA Corporation. NVIDIA CUDA C Programming Guide v8.0. 2016. [19] Bronis R. de Supinski, Craig Williams, and Paul F. Reynolds, Jr. Performance Evalua- tion of the Late Delta Cache Coherence Protocol. Technical report, University of Virginia, Charlottesville, VA, USA, 1996. [20] Marco Elver and Vijay Nagarajan. TSO-CC: Consistency Directed Cache Coherence For TSO. In 2014 IEEE 20th International Symposium on High Performance Computer Archi- tecture (HPCA), pages 165{176. IEEE, 2014. [21] John Feehrer, Paul Rotker, Milton Shih, Paul Gingras, Peter Yakutis, Stephen Phillips, and John Heath. Coherency Hub Design For Multisocket SUN Servers With Coolthreads Technology. IEEE Micro, 29(4):36{47, 2009. [22] Mark Gebhart, Daniel R. Johnson, David Tarjan, Stephen W. Keckler, William J. Dally, Erik Lindholm, and Kevin Skadron. Energy-ecient mechanisms for managing thread context in throughput processors. In Proceedings of the 38th Annual International Symposium on Computer Architecture, ISCA '11, pages 235{246, New York, NY, USA, 2011. ACM. [23] Mark Gebhart, Daniel R. Johnson, David Tarjan, Stephen W. Keckler, William J. Dally, Erik Lindholm, and Kevin Skadron. Energy-ecient Mechanisms for Managing Thread Context in Throughput Processors. In Proceedings of the 38th Annual International Symposium on Computer Architecture, ISCA '11, pages 235{246, New York, NY, USA, 2011. ACM. [24] Pawan Harish and PJ Narayanan. Accelerating Large Graph Algorithms on The GPU Us- ing CUDA. In International Conference on High-Performance Computing, pages 197{208. Springer, 2007. [25] B. A. Hechtman, S. Che, D. R. Hower, Y. Tian, B. M. Beckmann, M. D. Hill, S. K. Reinhardt, and D. A. Wood. QuickRelease: A Throughput-oriented Approach to Release Consistency on GPUs. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA), pages 189{200, Feb 2014. 114 [26] B. A. Hechtman and D. J. Sorin. Evaluating Cache Coherent Shared Virtual Memory for Heterogeneous Multicore Chips. In 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 118{119, April 2013. [27] Blake A. Hechtman and Daniel J. Sorin. Exploring Memory Consistency for Massively- threaded Throughput-oriented Processors. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA '13, pages 201{212, New York, NY, USA, 2013. ACM. [28] Tayler H Hetherington, Timothy G Rogers, Lisa Hsu, Mike O'Connor, and Tor M Aamodt. Characterizing And Evaluating A Key-Value Store Application on Heterogeneous CPU-GPU Systems. In Performance Analysis of Systems and Software (ISPASS), 2012 IEEE Interna- tional Symposium on, pages 88{98. IEEE, 2012. [29] Derek R Hower, Blake A Hechtman, Bradford M Beckmann, Benedict R Gaster, Mark D Hill, Steven K Reinhardt, and David A Wood. Heterogeneous-Race-Free Memory Models. ACM SIGPLAN Notices, 49(4):427{440, 2014. [30] Advanced Micro Devices Inc. Press Release: AMD Delivers Enthusiast Performance Leader- ship with the Introduction of the ATI Radeon HD 3870 X2, 2008. [31] Hyeran Jeon, Gokul Subramanian Ravi, Nam Sung Kim, and Murali Annavaram. GPU Register File Virtualization. In Proceedings of the 48th International Symposium on Mi- croarchitecture, MICRO-48, pages 420{432, New York, NY, USA, 2015. ACM. [32] Wenhao Jia, Kelly Shaw, Margaret Martonosi, et al. MRPB: Memory Request Prioritization For Massively Parallel Processors. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on, pages 272{283. IEEE, 2014. [33] Naifeng Jing, Yao Shen, Yao Lu, Shrikanth Ganapathy, Zhigang Mao, Minyi Guo, Ramon Canal, and Xiaoyao Liang. An Energy-ecient and Scalable eDRAM-based Register File Architecture for GPGPU. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA '13, pages 344{355, New York, NY, USA, 2013. ACM. [34] Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K Mishra, Mah- mut T Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R Das. OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance. ACM SIGARCH Computer Architecture News, 41(1):395{406, 2013. [35] Onur Kayran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. Neither more nor less: optimizing thread-level parallelism for gpgpus. In Proceedings of the 22nd international conference on Parallel architectures and compilation techniques, pages 157{166. IEEE Press, 2013. [36] John H Kelm, Daniel R Johnson, William Tuohy, Steven S Lumetta, and Sanjay J Patel. Cohesion: A Hybrid Memory Model For Accelerators. In ACM SIGARCH Computer Archi- tecture News, volume 38, pages 429{440. ACM, 2010. [37] Mohammad Mahdi Keshtegar, Hajar Falahati, and Shaahin Hessabi. Cluster-based Approach for Improving Graphics Processing Unit Performance by Inter Streaming Multiprocessors Locality. IET Computers & Digital Techniques, 9(5):275{282, 2015. [38] Mahmoud Khairy, Mohamed Zahran, and Amr G Wassal. Ecient Utilization of GPGPU Cache Hierarchy. In Proceedings of the 8th Workshop on General Purpose Processing Using GPUs, pages 36{47. ACM, 2015. 115 [39] K. Kim, S. Lee, M. K. Yoon, G. Koo, W. W. Ro, and M. Annavaram. Warped-Preexecution: A GPU Pre-Execution Approach for Improving Latency Hiding. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 163{175, March 2016. [40] Gunjae Koo, Hyeran Jeon, and Murali Annavaram. Revealing Critical Loads and Hidden Data Locality in GPGPU Applications. In Workload Characterization (IISWC), 2015 IEEE International Symposium on, pages 120{129. IEEE, 2015. [41] An-Chow Lai and Babak Falsa. Selective, Accurate, And Timely Self-Invalidation Using Last-Touch Prediction. In Computer Architecture, 2000. Proceedings of the 27th International Symposium on, pages 139{148. IEEE, 2000. [42] Leslie Lamport. Time, Clocks, and The Ordering of Events in a Distributed System. Com- munications of the ACM, 21(7):558{565, 1978. [43] Leslie Lamport. How to Make a Multiprocessor Computer That Correctly Executes Multi- process Programs. IEEE transactions on computers, 100(9):690{691, 1979. [44] Alvin R Lebeck and David A Wood. Dynamic Self-Invalidation: Reducing Coherence Over- head in Shared-Memory Multiprocessors. In ACM SIGARCH Computer Architecture News, volume 23, pages 48{59. ACM, 1995. [45] Minseok Lee, Seokwoo Song, Joosik Moon, Jung-Ho Kim, Woong Seo, Yeongon Cho, and Soojung Ryu. Improving GPGPU Resource Utilization Through Alternative Thread Block Scheduling. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th Inter- national Symposium on, pages 260{271. IEEE, 2014. [46] Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M Aamodt, and Vijay Janapa Reddi. GPUWattch: Enabling Energy Optimizations in GPG- PUs. ACM SIGARCH Computer Architecture News, 41(3):487{498, 2013. [47] Dong Li et al. Orchestrating Thread Scheduling and Cache Management to Improve Memory System Throughput in Throughput Processors. PhD thesis, University of Texas at Austin, 2014. [48] Milo MK Martin, Daniel J Sorin, Anatassia Ailamaki, Alaa R Alameldeen, Ross M Dickson, Carl J Mauer, Kevin E Moore, Manoj Plakal, Mark D Hill, and David A Wood. Timestamp Snooping: An Approach for Extending SMPs. In ACM SIGARCH Computer Architecture News, volume 28, pages 25{36. ACM, 2000. [49] Milo MK Martin, Daniel J Sorin, Bradford M Beckmann, Michael R Marty, Min Xu, Alaa R Alameldeen, Kevin E Moore, Mark D Hill, and David A Wood. Multifacet's General Execution-Driven Multiprocessor Simulator (GEMS) Toolset. ACM SIGARCH Computer Architecture News, 33(4):92{99, 2005. [50] Jaikrishnan Menon, Marc De Kruijf, and Karthikeyan Sankaralingam. iGPU: Exception Support and Speculative Execution on GPUs. In Proceedings of the 39th Annual International Symposium on Computer Architecture, ISCA '12, pages 72{83, Washington, DC, USA, 2012. IEEE Computer Society. [51] Sang Lyul Min, J-L Baer, and MMM Mn. A timestamp-Based Cache Coherence Scheme. 1989. 116 [52] Abdullah Muzahid, Dario Su arez, Shanxiang Qi, and Josep Torrellas. SigRace: Signature- based Data Race Detection. In Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA '09, pages 337{348, New York, NY, USA, 2009. ACM. [53] SK Nandy and Ranjani Narayan. An Incessantly Coherent Cache Scheme for Shared Memory Multithreaded Systems. In International Workshop on Parallel Processing. Citeseer, 1994. [54] Robert Netzer Netzer and Barton P. Miller. Detecting Data Races in Parallel Program Execu- tions. In In Advances in Languages and Compilers for Parallel Computing, 1990 Workshop, pages 109{129. MIT Press, 1989. [55] Fermi NVIDIA. NVIDIA's Next Generation CUDA Compute Architecture. NVIDIA, Santa Clara, Calif, USA, 2009. [56] Kepler NVIDIA. NVIDIA's Next Generation CUDA Compute Architecture. NVIDIA, Santa Clara, Calif, USA, 2012. [57] Khronos group. OpenCL. https://www.khronos.org/opencl/. [58] M. Prvulovic. CORD: Cost-Eective (and nearly overhead-free) Order-Recording and Data Race Detection. In The Twelfth International Symposium on High-Performance Computer Architecture, 2006., pages 232{243, Feb 2006. [59] M. Prvulovic and J. Torrellas. ReEnact: Using Thread-Level Speculation Mechanisms to Debug Data Races in Multithreaded Codes. In 30th Annual International Symposium on Computer Architecture, 2003. Proceedings., pages 110{121, June 2003. [60] Minsoo Rhu, Michael Sullivan, Jingwen Leng, and Mattan Erez. A locality-aware memory hi- erarchy for energy-ecient GPU architectures. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pages 86{98. ACM, 2013. [61] Timothy G Rogers, Mike O'Connor, and Tor M Aamodt. Cache-conscious wavefront schedul- ing. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Mi- croarchitecture, pages 72{83. IEEE Computer Society, 2012. [62] Timothy G Rogers, Mike O'Connor, and Tor M Aamodt. Divergence-Aware Warp Schedul- ing. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchi- tecture, pages 99{110. ACM, 2013. [63] A. Ros and S. Kaxiras. Racer: TSO consistency via race detection. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 1{13, Oct 2016. [64] Alberto Ros and Stefanos Kaxiras. Complexity-Eective Multicore Coherence. In Proceed- ings of the 21st international conference on Parallel architectures and compilation techniques, pages 241{252. ACM, 2012. [65] Keun Sup Shim, Myong Hyon Cho, Mieszko Lis, Omer Khan, and Srinivas Devadas. Library Cache Coherence. Technical report, MIT, 2011. [66] Matthew D. Sinclair, Johnathan Alsop, and Sarita V. Adve. Ecient GPU Synchronization Without Scopes: Saying No to Complex Consistency Models. In Proceedings of the 48th International Symposium on Microarchitecture, MICRO-48, pages 647{659, New York, NY, USA, 2015. ACM. 117 [67] Inderpreet Singh, Arrvindh Shriraman, Wilson WL Fung, Mike O'Connor, and Tor M Aamodt. Cache Coherence For GPU Architectures. In High Performance Computer Archi- tecture (HPCA2013), 2013 IEEE 19th International Symposium on, pages 578{590. IEEE, 2013. [68] Tyler Sorensen, Ganesh Gopalakrishnan, and Vinod Grover. Towards Shared Memory Con- sistency Models for GPUs. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS '13, pages 489{490, New York, NY, USA, 2013. ACM. [69] Daniel J Sorin, Mark D Hill, and David A Wood. A Primer on Memory Consistency And Cache Coherence. Synthesis Lectures on Computer Architecture, 6(3):1{212, 2011. [70] John A Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-Mei W Hwu. Parboil: A revised benchmark suite for scientic and commercial throughput computing. Center for Reliable and High-Performance Computing, 2012. [71] David Tarjan and Kevin Skadron. The Sharing Tracker: Using Ideas from Sache Coherence Hardware to Reduce O-Chip Memory Trac with Non-Coherent Caches. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Network- ing, Storage and Analysis, pages 1{10. IEEE Computer Society, 2010. [72] Dean M. Tullsen, Susan J. Eggers, and Henry M. Levy. Simultaneous Multithreading: Max- imizing On-chip Parallelism. In Proceedings of the 22Nd Annual International Symposium on Computer Architecture, ISCA '95, pages 392{403, New York, NY, USA, 1995. ACM. [73] Vibhav Vineet and PJ Narayanan. CUDA cuts: Fast Graph Cuts on The GPU. In Com- puter Vision and Pattern Recognition Workshops, 2008. CVPRW'08. IEEE Computer Society Conference on, pages 1{8. IEEE, 2008. [74] Christoph von Praun, Harold W. Cain, Jong-Deok Choi, and Kyung Dong Ryu. Conditional Memory Ordering. In Proceedings of the 33rd Annual International Symposium on Computer Architecture, ISCA '06, pages 41{52, Washington, DC, USA, 2006. IEEE Computer Society. [75] Yuan Yao, Guanhua Wang, Zhiguo Ge, Tulika Mitra, Wenzhi Chen, and Naxin Zhang. Ef- cient Timestamp-Based Cache Coherence Protocol for Many-Core Architectures. In Pro- ceedings of the 2016 International Conference on Supercomputing, ICS '16, pages 19:1{19:13, New York, NY, USA, 2016. ACM. [76] X. Yu, H. Liu, E. Zou, and S. Devadas. Tardis 2.0: Optimized Time Traveling Coherence for Relaxed Consistency Models. In 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT), pages 261{274, Sept 2016. [77] Xiangyao Yu and Srinivas Devadas. Tardis: Time Traveling Coherence Algorithm For Dis- tributed Shared Memory. In 2015 International Conference on Parallel Architecture and Compilation (PACT), pages 227{240. IEEE, 2015. [78] Stefano Zanella, Alessandra Nardi, Andrea Neviani, Michele Quarantelli, Sharad Saxena, and Carlo Guardiani. Analysis of The Impact of Process Variations on Clock Skew. IEEE Transactions on Semiconductor Manufacturing, 13(4):401{407, 2000. [79] P. Zhou, R. Teodorescu, and Y. Zhou. HARD: Hardware-Assisted Lockset-based Race De- tection. In 2007 IEEE 13th International Symposium on High Performance Computer Ar- chitecture, pages 121{132, Feb 2007. 118 Appendix A Detailed Results of The Dynamic Race Detection Scheme Evaluation In this appendix, we attached the detailed evaluation numbers of each benchmark used in the eval- uation of the dynamic race detection scheme for sequential consistency presented in chapter 4. The rst set of tables (tables A.1 A.2 A.3) show the execution time of the baseline conguration (in cycles) and the relative execution time for all dierent GPU congurations with dierent signature sizes. Tables A.4 A.5 A.6 A.7 A.8 A.9 present the percentage of the synchronization points that are created due to signature saturation (forced) and the synchronization points that are created by the RDU due to race detection of store operation (RDU ). The tables A.10 A.11 A.12 A.13 A.14 A.15 show the percentage of the scheme induced misses relative to the overall misses in the private L1 cache. The misses are classied into expired misses (expired), which are the misses happened when the accessed cacheline is fetched more than two epochs ago and there is no information about the cached data validity, AVU misses, which are the misses that occur because of failed AVU valida- tion. The last set of tables (tables A.16 A.17 A.18 A.19 A.20 A.21 A.22 A.23 A.24 A.25 A.26 A.27) show the energy consumption per component relative to the energy consumed by the baseline GPU. The energy estimation of the RDU, AVU, and the register renaming table includes the dynamic access energy (per-access energy) and the static leackage power. 119 Table A.1: Individual Benchmarks Normalized Execution Time (in Cycles) (1) Benchmark Baseline Ideal No Spec 1K sig 2K sig 4K sig 8K sig 16K sig 1K sig 2K sig 4K sig 8K sig 16K sig Backprop 356342 1.194 1.180 1.188 1.188 1.188 1.191 1.188 1.178 1.181 1.181 Barneshut 467573 1.146 1.118 1.110 1.135 1.136 1.242 1.241 1.246 1.251 1.245 BFS-ISPASS 752854 1.159 1.155 1.155 1.153 1.155 1.223 1.227 1.221 1.219 1.221 BFS-Rodinia 9350622 1.097 1.098 1.095 1.095 1.096 1.183 1.181 1.183 1.180 1.181 Btree 2020506 1.261 1.261 1.261 1.261 1.261 1.523 1.523 1.523 1.523 1.523 CP 282073 1.000 1.000 1.000 1.000 1.000 1.050 1.050 1.050 1.050 1.050 CudaCuts 1256035 1.206 1.207 1.197 1.197 1.197 1.333 1.309 1.345 1.123 1.123 CutCP 14136576 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 Gaussian 50766292 1.128 1.128 1.127 1.127 1.127 1.132 1.132 1.132 1.132 1.132 Heartwall 22439908 1.003 1.003 0.997 0.998 1.004 1.011 1.005 1.009 1.006 1.005 Hotspot 213309 1.005 0.998 0.998 0.998 0.998 1.006 1.006 1.006 1.006 1.006 KMeans 24835682 1.001 1.002 1.002 1.002 1.002 1.002 1.002 1.002 1.002 1.002 LBM 20046029 0.993 0.993 0.993 0.993 0.993 0.983 0.983 0.983 0.983 0.983 Leukocyte 58814402 1.000 1.000 1.000 1.000 1.000 1.002 1.002 1.002 1.002 1.002 LIB 5114426 1.013 1.012 1.013 1.012 1.012 1.381 1.382 1.383 1.383 1.383 LUD 2514711 1.014 1.013 1.012 1.011 1.011 1.047 1.045 1.044 1.045 1.045 MRI-Q 20510309 1.007 1.007 1.007 1.007 1.007 1.015 1.015 1.015 1.015 1.015 MUM 2258780 1.173 1.177 1.178 1.178 1.178 1.198 1.192 1.192 1.192 1.192 NN 2219648 1.018 1.018 1.018 1.018 1.018 1.056 1.055 1.056 1.056 1.056 NW 5311386 1.000 1.000 1.000 1.000 1.000 1.134 1.134 1.134 1.134 1.134 OctreePart 992841 1.341 1.330 1.341 1.330 1.330 1.565 1.540 1.552 1.552 1.544 Pathnder 863169 1.003 0.993 0.997 0.997 0.997 1.008 1.002 1.007 1.007 1.007 SAD 54968763 1.002 1.002 1.002 1.002 1.002 1.074 1.074 1.074 1.074 1.074 SGEMM 6963916 1.000 1.001 1.000 1.000 1.000 1.010 1.010 1.010 1.010 1.010 SRAD 6422507 1.040 1.027 1.018 1.009 1.010 1.040 1.024 1.016 1.011 1.008 Stencil 7982116 0.994 0.990 0.998 0.997 0.997 2.262 2.264 2.264 2.264 2.264 Streamcluster 174624808 1.054 1.046 1.052 1.049 1.052 1.314 1.308 1.314 1.306 1.310 VPR 826265 1.313 1.319 1.327 1.321 1.321 1.387 1.397 1.397 1.391 1.396 Wave 1062129 1.318 1.274 1.252 1.250 1.252 1.644 1.628 1.625 1.607 1.626 WP 5437920 1.009 1.013 1.004 1.030 1.032 1.088 1.095 1.068 1.078 1.081 Average 1.083 1.079 1.078 1.079 1.079 1.204 1.201 1.201 1.193 1.193 120 Table A.2: Individual Benchmarks Normalized Execution Time (in Cycles) (2) Benchmark Baseline Byte-Level Word-Level 1K sig 2K sig 4K sig 8K sig 16K sig 1K sig 2K sig 4K sig 8K sig 16K sig Backprop 356342 1.185 1.174 1.174 1.170 1.167 1.176 1.172 1.168 1.171 1.164 Barneshut 467573 1.226 1.185 1.174 1.164 1.174 1.170 1.203 1.212 1.174 1.165 BFS-ISPASS 752854 1.160 1.162 1.159 1.166 1.149 1.164 1.155 1.146 1.146 1.147 BFS-Rodinia 9350622 1.100 1.099 1.098 1.098 1.091 1.098 1.094 1.093 1.089 1.085 Btree 2020506 1.389 1.324 1.252 1.204 1.190 1.229 1.193 1.188 1.184 1.190 CP 282073 1.049 1.049 1.145 0.949 0.950 1.145 1.145 0.997 0.950 1.097 CudaCuts 1256035 1.240 1.259 1.236 1.234 1.208 1.240 1.230 1.227 1.470 1.015 CutCP 14136576 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 Gaussian 50766292 1.110 1.107 1.105 1.101 1.096 1.106 1.102 1.097 1.089 1.081 Heartwall 22439908 1.163 1.086 1.051 1.030 1.020 1.052 1.032 1.015 1.009 1.000 Hotspot 213309 1.011 0.999 1.000 1.001 1.002 1.003 1.007 1.009 1.002 1.006 KMeans 24835682 1.003 1.003 1.001 1.001 1.002 1.002 1.000 1.001 1.002 1.003 LBM 20046029 0.993 0.993 0.993 0.993 0.993 0.993 0.993 0.993 0.993 0.993 Leukocyte 58814402 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 LIB 5114426 1.012 1.012 1.013 1.013 1.013 1.014 1.012 1.014 1.012 1.012 LUD 2514711 1.017 1.016 1.016 1.017 1.016 1.019 1.018 1.016 1.015 1.013 MRI-Q 20510309 1.007 1.007 1.007 1.007 1.007 1.007 1.007 1.007 1.007 1.007 MUM 2258780 1.182 1.174 1.181 1.185 1.187 1.185 1.183 1.182 1.178 1.178 NN 2219648 1.135 1.118 1.098 1.080 1.054 1.092 1.069 1.049 1.030 1.018 NW 5311386 0.998 0.998 0.999 0.999 1.000 0.999 0.999 0.998 1.000 0.999 OctreePart 992841 1.365 1.335 1.319 1.310 1.301 1.333 1.330 1.298 1.285 1.273 Pathnder 863169 0.999 0.991 0.992 0.998 0.994 0.996 0.995 1.005 0.994 1.005 SAD 54968763 1.002 1.002 1.002 1.002 1.002 1.002 1.003 1.003 1.003 1.002 SGEMM 6963916 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 SRAD 6422507 1.048 1.047 1.050 1.040 1.049 1.048 1.046 1.045 1.052 1.043 Stencil 7982116 0.988 0.993 0.995 0.988 0.995 0.994 0.994 0.989 0.993 0.994 Streamcluster 174624808 1.076 1.074 1.071 1.073 0.837 1.076 1.068 1.067 1.058 1.053 VPR 826265 1.625 1.573 1.500 1.395 1.282 1.499 1.350 1.251 1.201 1.187 Wave 1062129 1.887 1.877 1.789 1.715 1.623 1.799 1.706 1.607 1.472 1.345 WP 5437920 0.996 0.995 0.975 0.982 0.972 1.005 1.013 0.990 0.998 0.989 Average 1.132 1.122 1.113 1.097 1.079 1.115 1.104 1.089 1.086 1.069 121 Table A.3: Individual Benchmarks Normalized Execution Time (in Cycles) (3) Benchmark Baseline Block-Level Half-Page-Level 1K sig 2K sig 4K sig 8K sig 16K sig 1K sig 2K sig 4K sig 8K sig 16K sig Backprop 356342 1.166 1.145 1.116 1.094 1.096 1.130 1.099 1.093 1.094 1.094 Barneshut 467573 1.121 1.087 1.108 1.110 1.118 1.154 1.134 1.139 1.139 1.139 BFS-ISPASS 752854 1.145 1.154 1.133 1.142 1.143 1.152 1.136 1.140 1.137 1.137 BFS-Rodinia 9350622 1.087 1.086 1.085 1.085 1.084 1.085 1.084 1.085 1.085 1.084 Btree 2020506 1.278 1.278 1.278 1.278 1.278 1.281 1.281 1.281 1.281 1.281 CP 282073 1.047 1.145 0.950 0.904 0.904 1.146 0.997 0.997 0.997 0.997 CudaCuts 1256035 1.188 1.000 1.013 1.186 1.186 1.194 1.188 1.186 1.186 1.186 CutCP 14136576 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 Gaussian 50766292 1.085 1.079 1.078 1.079 1.079 1.083 1.079 1.079 1.079 1.079 Heartwall 22439908 1.002 1.002 1.002 1.003 1.002 1.002 1.003 1.002 1.001 1.003 Hotspot 213309 1.002 0.997 0.997 0.997 0.997 0.997 0.997 0.997 0.997 0.997 KMeans 24835682 1.000 1.001 1.001 1.001 1.001 1.004 1.001 1.001 1.001 1.001 LBM 20046029 0.993 0.993 0.993 0.993 0.993 0.993 0.993 0.993 0.993 0.993 Leukocyte 58814402 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 LIB 5114426 1.011 1.011 1.011 1.011 1.011 1.011 1.012 1.011 1.011 1.011 LUD 2514711 1.012 1.012 1.011 1.011 1.010 1.011 1.012 1.011 1.011 1.011 MRI-Q 20510309 1.007 1.007 1.007 1.007 1.007 1.007 1.007 1.007 1.007 1.007 MUM 2258780 1.174 1.176 1.179 1.172 1.172 1.176 1.168 1.172 1.172 1.172 NN 2219648 1.009 1.009 1.007 1.007 1.007 1.008 1.010 1.010 1.010 1.010 NW 5311386 0.999 1.000 1.000 1.000 1.000 0.999 1.000 1.000 1.000 1.000 OctreePart 992841 1.271 1.269 1.262 1.286 1.286 1.284 1.253 1.253 1.253 1.253 Pathnder 863169 1.001 1.004 0.992 0.992 0.992 1.000 0.992 0.992 0.992 0.992 SAD 54968763 1.002 1.002 1.002 1.003 1.003 1.002 1.002 1.003 1.003 1.003 SGEMM 6963916 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 SRAD 6422507 1.032 1.021 1.014 1.018 1.003 1.027 1.012 1.012 1.018 1.017 Stencil 7982116 0.990 0.988 0.988 0.991 0.991 0.993 0.992 0.991 0.991 0.991 Streamcluster 174624808 1.041 1.035 1.038 1.038 1.039 0.818 1.036 1.035 1.039 1.035 VPR 826265 1.189 1.181 1.162 1.182 1.183 1.185 1.195 1.171 1.182 1.188 Wave 1062129 1.239 1.206 1.190 1.182 1.185 1.217 1.209 1.217 1.223 1.223 WP 5437920 1.015 1.001 0.987 0.982 0.989 0.976 0.991 1.006 1.006 1.005 Average 1.070 1.063 1.053 1.058 1.059 1.064 1.063 1.063 1.064 1.064 122 Table A.4: Synchronization Points Classication Per Benchmark (Ideal Conguration) Benchmark Ideal 1K sig 2K sig 4K sig 8K sig 16K sig Forced RDU Forced RDU Forced RDU Forced RDU Forced RDU Backprop 90% 10% 72% 28% 33% 67% 0% 100% 0% 100% Barneshut 33% 67% 5% 95% 0% 100% 0% 100% 0% 100% BFS-ISPASS 83% 17% 55% 45% 10% 90% 0% 100% 0% 100% BFS-Rodinia 86% 14% 67% 33% 27% 73% 0% 100% 0% 100% Btree 0% 100% 0% 100% 0% 100% 0% 100% 0% 100% CP 97% 3% 97% 3% 50% 50% { { { { CudaCuts 73% 27% 30% 70% 0% 100% 0% 100% 0% 100% CutCP 80% 20% 60% 40% 29% 71% 0% 100% { { Gaussian 41% 59% 3% 97% 0% 100% 0% 100% 0% 100% Heartwall 93% 7% 87% 13% 73% 27% 33% 67% 32% 68% Hotspot 83% 17% 0% 100% { { { { { { KMeans 47% 53% 36% 64% 13% 87% 0% 100% { { LBM 95% 5% 89% 11% 80% 20% 65% 35% 44% 56% Leukocyte 99% 1% 100% 0% 99% 1% 100% 0% 100% 0% LIB 98% 2% 88% 12% 72% 28% 53% 47% 0% 100% LUD 57% 43% 25% 75% 4% 96% 0% 100% 0% 100% MRI-Q 94% 6% 96% 4% 100% 0% 100% 0% { { MUM 74% 26% 42% 58% 0% 100% { { { { NN 42% 58% 50% 50% { { { { { { NW 100% 0% 99% 1% 99% 1% 98% 2% 82% 18% OctreePart 4% 96% 0% 100% 0% 100% 0% 100% 0% 100% Pathnder 69% 31% 44% 56% 0% 100% 0% 100% { { SAD 94% 6% 92% 8% 95% 5% 99% 1% 83% 17% SGEMM 91% 9% 23% 77% 8% 92% 0% 100% { { SRAD 84% 16% 77% 23% 68% 32% 65% 35% 67% 33% Stencil 80% 20% 71% 29% 0% 100% { { { { Streamcluster 83% 17% 69% 31% 44% 56% 22% 78% 8% 92% VPR 57% 43% 32% 68% 0% 100% 0% 100% 0% 100% Wave 48% 52% 24% 76% 10% 90% 5% 95% 0% 100% WP 86% 14% 64% 36% 20% 80% 1% 99% 0% 100% Average 72% 28% 53% 47% 33% 67% 26% 74% 21% 79% 123 Table A.5: Synchronization Points Classication Per Benchmark (No Speculation Conguration) Benchmark No Spec 1K sig 2K sig 4K sig 8K sig 16K sig Forced RDU Forced RDU Forced RDU Forced RDU Forced RDU Backprop 88% 12% 68% 32% 27% 73% 0% 100% 0% 100% Barneshut 30% 70% 4% 96% 0% 100% 0% 100% 0% 100% BFS-ISPASS 84% 16% 58% 42% 7% 93% 0% 100% 0% 100% BFS-Rodinia 85% 15% 67% 33% 35% 65% 2% 98% 0% 100% Btree 0% 100% 0% 100% 0% 100% 0% 100% 0% 100% CP 96% 4% 93% 7% 43% 57% { { { { CudaCuts 72% 28% 26% 74% 0% 100% 0% 100% 0% 100% CutCP 79% 21% 60% 40% 30% 70% 0% 100% 0% 100% Gaussian 59% 41% 6% 94% 0% 100% 0% 100% 0% 100% Heartwall 93% 7% 88% 12% 70% 30% 27% 73% 29% 71% Hotspot 83% 17% 29% 71% { { { { { { KMeans 47% 53% 37% 63% 19% 81% 0% 100% { { LBM 93% 7% 87% 13% 79% 21% 67% 33% 52% 48% Leukocyte 99% 1% 99% 1% 100% 0% 100% 0% 100% 0% LIB 71% 29% 68% 32% 48% 52% 20% 80% 0% 100% LUD 58% 42% 28% 72% 3% 97% 0% 100% 0% 100% MRI-Q 90% 10% 95% 5% 99% 1% 100% 0% { { MUM 79% 21% 35% 65% 17% 83% { { { { NN 33% 67% 33% 67% { { { { { { NW 100% 0% 99% 1% 90% 10% 30% 70% 0% 100% OctreePart 3% 97% 0% 100% 0% 100% 0% 100% 0% 100% Pathnder 71% 29% 34% 66% 0% 100% 0% 100% { { SAD 94% 6% 93% 7% 98% 2% 100% 0% 100% 0% SGEMM 88% 12% 40% 60% 35% 65% 50% 50% { { SRAD 83% 17% 73% 27% 64% 36% 67% 33% 79% 21% Stencil 46% 54% 0% 100% { { { { { { Streamcluster 76% 24% 59% 41% 39% 61% 23% 77% 14% 86% VPR 51% 49% 32% 68% 0% 100% 0% 100% 0% 100% Wave 50% 50% 28% 72% 17% 83% 8% 92% 0% 100% WP 81% 19% 50% 50% 9% 91% 0% 100% 0% 100% Average 69% 31% 50% 50% 34% 66% 24% 76% 18% 82% 124 Table A.6: Synchronization Points Classication Per Benchmark (Byte-Level Conguration) Benchmark Byte-Level 1K sig 2K sig 4K sig 8K sig 16K sig Forced RDU Forced RDU Forced RDU Forced RDU Forced RDU Backprop 99% 1% 92% 8% 86% 14% 76% 24% 65% 35% Barneshut 97% 3% 91% 9% 82% 18% 69% 31% 45% 55% BFS-ISPASS 95% 5% 88% 12% 76% 24% 58% 42% 35% 65% BFS-Rodinia 94% 6% 86% 14% 72% 28% 53% 47% 30% 70% Btree 65% 35% 33% 67% 1% 99% 0% 100% 0% 100% CP 99% 1% 95% 5% 94% 6% 86% 14% 79% 21% CudaCuts 96% 4% 88% 12% 78% 22% 68% 32% 53% 47% CutCP 99% 1% 97% 3% 94% 6% 92% 8% 94% 6% Gaussian 98% 2% 95% 5% 87% 13% 73% 27% 51% 49% Heartwall 99% 1% 90% 10% 79% 21% 65% 35% 48% 52% Hotspot 97% 3% 90% 10% 77% 23% 63% 37% 49% 51% KMeans 100% 0% 95% 5% 79% 21% 56% 44% 31% 69% LBM 100% 0% 98% 2% 93% 7% 86% 14% 74% 26% Leukocyte 100% 0% 99% 1% 99% 1% 100% 0% 99% 1% LIB 100% 0% 99% 1% 93% 7% 84% 16% 72% 28% LUD 99% 1% 93% 7% 83% 17% 70% 30% 52% 48% MRI-Q 100% 0% 94% 6% 75% 25% 57% 43% 44% 56% MUM 99% 1% 93% 7% 84% 16% 74% 26% 62% 38% NN 76% 24% 37% 63% 18% 82% 6% 94% 1% 99% NW 100% 0% 100% 0% 100% 0% 100% 0% 100% 0% OctreePart 93% 7% 86% 14% 82% 18% 72% 28% 56% 44% Pathnder 98% 2% 89% 11% 85% 15% 72% 28% 56% 44% SAD 100% 0% 98% 2% 94% 6% 87% 13% 76% 24% SGEMM 100% 0% 98% 2% 93% 7% 83% 17% 73% 27% SRAD 99% 1% 94% 6% 85% 15% 68% 32% 50% 50% Stencil 100% 0% 94% 6% 75% 25% 54% 46% 35% 65% Streamcluster 99% 1% 95% 5% 84% 16% 71% 29% 48% 52% VPR 73% 27% 49% 51% 29% 71% 16% 84% 9% 91% Wave 95% 5% 81% 19% 56% 44% 30% 70% 9% 91% WP 98% 2% 93% 7% 87% 13% 81% 19% 71% 29% Average 96% 4% 88% 12% 77% 23% 66% 34% 52% 48% 125 Table A.7: Synchronization Points Classication Per Benchmark (Word-Level Conguration) Benchmark Word-Level 1K sig 2K sig 4K sig 8K sig 16K sig Forced RDU Forced RDU Forced RDU Forced RDU Forced RDU Backprop 94% 6% 89% 11% 80% 20% 69% 31% 53% 47% Barneshut 93% 7% 84% 16% 69% 31% 50% 50% 20% 80% BFS-ISPASS 90% 10% 80% 20% 63% 37% 39% 61% 17% 83% BFS-Rodinia 88% 12% 78% 22% 60% 40% 35% 65% 13% 87% Btree 4% 96% 0% 100% 0% 100% 0% 100% 0% 100% CP 97% 3% 95% 5% 89% 11% 88% 12% 88% 12% CudaCuts 90% 10% 83% 17% 71% 29% 51% 49% 28% 72% CutCP 97% 3% 94% 6% 98% 2% 83% 17% 59% 41% Gaussian 95% 5% 89% 11% 74% 26% 48% 52% 18% 82% Heartwall 90% 10% 82% 18% 71% 29% 57% 43% 36% 64% Hotspot 89% 11% 77% 23% 65% 35% 53% 47% 44% 56% KMeans 91% 9% 78% 22% 58% 42% 33% 67% 16% 84% LBM 97% 3% 94% 6% 87% 13% 76% 24% 57% 43% Leukocyte 99% 1% 100% 0% 99% 1% 98% 2% 95% 5% LIB 97% 3% 94% 6% 88% 12% 78% 22% 78% 22% LUD 94% 6% 87% 13% 74% 26% 55% 45% 36% 64% MRI-Q 90% 10% 76% 24% 63% 37% 54% 46% 54% 46% MUM 93% 7% 88% 12% 79% 21% 44% 56% 44% 56% NN 31% 69% 12% 88% 5% 95% 1% 99% 0% 100% NW 100% 0% 100% 0% 100% 0% 100% 0% 99% 1% OctreePart 89% 11% 81% 19% 66% 34% 38% 62% 13% 87% Pathnder 93% 7% 84% 16% 73% 27% 59% 41% 38% 62% SAD 98% 2% 95% 5% 90% 10% 80% 20% 62% 38% SGEMM 97% 3% 93% 7% 87% 13% 80% 20% 76% 24% SRAD 94% 6% 86% 14% 73% 27% 58% 42% 42% 58% Stencil 90% 10% 77% 23% 63% 37% 56% 44% 60% 40% Streamcluster 93% 7% 84% 16% 70% 30% 52% 48% 31% 69% VPR 57% 43% 40% 60% 25% 75% 14% 86% 4% 96% Wave 75% 25% 53% 47% 26% 74% 8% 92% 1% 99% WP 95% 5% 91% 9% 84% 16% 73% 27% 58% 42% Average 87% 13% 79% 21% 68% 32% 54% 46% 41% 59% 126 Table A.8: Synchronization Points Classication Per Benchmark (Block-Level Conguration) Benchmark Block-Level 1K sig 2K sig 4K sig 8K sig 16K sig Forced RDU Forced RDU Forced RDU Forced RDU Forced RDU Backprop 90% 10% 75% 25% 39% 61% 0% 100% 0% 100% Barneshut 34% 66% 6% 94% 0% 100% 0% 100% 0% 100% BFS-ISPASS 83% 17% 56% 44% 9% 91% 0% 100% 0% 100% BFS-Rodinia 86% 14% 68% 32% 28% 72% 1% 99% 0% 100% Btree 0% 100% 0% 100% 0% 100% 0% 100% 0% 100% CP 99% 1% 90% 10% 75% 25% { { { { CudaCuts 73% 27% 35% 65% 1% 99% 0% 100% 0% 100% CutCP 80% 20% 59% 41% 34% 66% 0% 100% 0% 100% Gaussian 41% 59% 3% 97% 0% 100% 0% 100% 0% 100% Heartwall 93% 7% 87% 13% 72% 28% 26% 74% 41% 59% Hotspot 85% 15% 0% 100% { { { { { { KMeans 47% 53% 33% 67% 7% 93% 0% 100% { { LBM 95% 5% 89% 11% 80% 20% 65% 35% 44% 56% Leukocyte 99% 1% 100% 0% 99% 1% 100% 0% 100% 0% LIB 99% 1% 85% 15% 71% 29% 68% 32% 30% 70% LUD 58% 42% 26% 74% 4% 96% 0% 100% 0% 100% MRI-Q 94% 6% 96% 4% 100% 0% 100% 0% { { MUM 74% 26% 45% 55% 0% 100% { { { { NN 71% 29% 50% 50% { { { { { { NW 100% 0% 99% 1% 99% 1% 98% 2% 82% 18% OctreePart 5% 95% 0% 100% 0% 100% 0% 100% 0% 100% Pathnder 74% 26% 39% 61% 0% 100% 0% 100% { { SAD 95% 5% 93% 7% 95% 5% 99% 1% 84% 16% SGEMM 91% 9% 20% 80% 8% 92% 0% 100% { { SRAD 85% 15% 77% 23% 70% 30% 66% 34% 68% 32% Stencil 81% 19% 69% 31% 0% 100% { { { { Streamcluster 83% 17% 70% 30% 44% 56% 21% 79% 7% 93% VPR 60% 40% 32% 68% 0% 100% 0% 100% 0% 100% Wave 48% 52% 24% 76% 12% 88% 6% 94% 0% 100% WP 86% 14% 64% 36% 20% 80% 0% 100% 0% 100% Average 74% 26% 53% 47% 34% 66% 26% 74% 22% 78% 127 Table A.9: Synchronization Points Classication Per Benchmark (Half-Page-Level Conguration) Benchmark Half-Page-Level 1K sig 2K sig 4K sig 8K sig 16K sig Forced RDU Forced RDU Forced RDU Forced RDU Forced RDU Backprop 41% 59% 0% 100% 0% 100% 0% 100% 0% 100% Barneshut 4% 96% 0% 100% 0% 100% 0% 100% 0% 100% BFS-ISPASS 68% 32% 17% 83% 0% 100% 0% 100% 0% 100% BFS-Rodinia 83% 17% 55% 45% 9% 91% 0% 100% 0% 100% Btree 0% 100% 0% 100% 0% 100% 0% 100% 0% 100% CP 13% 88% 0% 100% 0% 100% 0% 100% 0% 100% CudaCuts 15% 85% 0% 100% 0% 100% 0% 100% 0% 100% CutCP 34% 66% 0% 100% 0% 100% { { { { Gaussian 33% 67% 1% 99% 0% 100% 0% 100% 0% 100% Heartwall 93% 7% 87% 13% 71% 29% 22% 78% 4% 96% Hotspot 0% 100% { { { { { { { { KMeans 27% 73% 0% 100% 0% 100% 0% 100% { { LBM 90% 10% 78% 22% 57% 43% 29% 71% 0% 100% Leukocyte 100% 0% 100% 0% 100% 0% { { { { LIB 51% 49% 6% 94% 1% 99% 0% 100% 0% 100% LUD 6% 94% 0% 100% 0% 100% 0% 100% 0% 100% MRI-Q 94% 6% 98% 2% 100% 0% { { { { MUM 62% 38% 5% 95% 0% 100% { { { { NN 4% 96% 0% 100% 0% 100% 0% 100% 0% 100% NW 100% 0% 99% 1% 99% 1% 94% 6% 75% 25% OctreePart 0% 100% 0% 100% 0% 100% 0% 100% 0% 100% Pathnder 0% 100% 0% 100% { { { { { { SAD 91% 9% 95% 5% 94% 6% 0% 100% { { SGEMM 4% 96% 0% 100% { { { { { { SRAD 80% 20% 67% 33% 57% 43% 0% 100% 0% 100% Stencil 11% 89% 0% 100% { { { { { { Streamcluster 54% 46% 23% 77% 6% 94% 0% 100% 0% 100% VPR 52% 48% 24% 76% 0% 100% 0% 100% 0% 100% Wave 7% 93% 3% 97% 0% 100% 0% 100% 0% 100% WP 76% 24% 36% 64% 5% 95% 0% 100% 0% 100% Average 43% 57% 27% 73% 23% 77% 7% 93% 4% 96% 128 Table A.10: Coherence Misses Classication Per Benchmark (Ideal Conguration) Benchmark Ideal 1K sig 2K sig 4K sig 8K sig 16K sig Expired AVU misses Expired AVU misses Expired AVU misses Expired AVU misses Expired AVU misses Backprop 54.94% 0.04% 54.34% 0.02% 54.43% 0.00% 54.28% 0.00% 54.28% 0.00% Barneshut 21.34% 0.25% 19.69% 0.24% 19.04% 0.25% 20.06% 0.24% 19.85% 0.24% BFS-ISPASS 11.57% 0.00% 11.44% 0.00% 11.25% 0.00% 11.24% 0.00% 11.28% 0.00% BFS-Rodinia 14.22% 0.00% 14.22% 0.00% 14.19% 0.00% 14.13% 0.00% 14.15% 0.00% Btree 51.05% 2.60% 51.05% 2.60% 51.05% 2.60% 51.05% 2.60% 51.05% 2.60% CP 72.98% 0.01% 72.83% 0.00% 72.68% 0.01% 72.64% 0.00% 72.64% 0.00% CudaCuts 36.68% 0.04% 36.40% 0.01% 36.59% 0.00% 36.13% 0.00% 36.13% 0.00% CutCP 2.33% 0.05% 2.21% 0.02% 2.17% 0.01% 2.17% 0.00% 2.17% 0.00% Gaussian 46.62% 0.01% 46.62% 0.01% 46.62% 0.01% 46.61% 0.01% 46.61% 0.01% Heartwall 47.14% 0.26% 44.66% 0.12% 46.23% 0.17% 46.39% 0.04% 42.51% 0.00% Hotspot 4.69% 0.00% 4.64% 0.00% 4.64% 0.00% 4.64% 0.00% 4.64% 0.00% KMeans 0.81% 0.00% 0.91% 0.00% 0.91% 0.00% 0.91% 0.00% 0.91% 0.00% LBM 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% Leukocyte 8.24% 2.69% 4.05% 2.86% 1.87% 1.97% 1.01% 1.05% 1.01% 0.05% LIB 32.70% 0.00% 32.70% 0.00% 32.70% 0.00% 32.70% 0.00% 32.70% 0.00% LUD 30.72% 0.13% 30.16% 0.15% 30.04% 0.14% 30.08% 0.15% 30.08% 0.15% MRI-Q 15.93% 0.00% 15.92% 0.00% 15.89% 0.00% 15.88% 0.00% 15.88% 0.00% MUM 55.12% 0.00% 54.59% 0.00% 54.63% 0.00% 54.63% 0.00% 54.63% 0.00% NN 23.10% 0.05% 23.13% 0.01% 23.48% 0.00% 23.48% 0.00% 23.48% 0.00% NW 22.27% 0.00% 22.23% 0.00% 22.18% 0.00% 22.12% 0.00% 22.11% 0.00% OctreePart 58.64% 0.76% 58.39% 0.77% 58.63% 0.77% 58.38% 0.77% 58.38% 0.77% Pathnder 0.35% 0.00% 0.34% 0.00% 0.32% 0.00% 0.32% 0.00% 0.32% 0.00% SAD 11.37% 0.00% 11.37% 0.00% 11.37% 0.00% 11.37% 0.00% 11.37% 0.00% SGEMM 1.25% 0.00% 1.23% 0.00% 1.23% 0.00% 1.23% 0.00% 1.23% 0.00% SRAD 4.24% 0.04% 2.55% 0.04% 1.41% 0.04% 1.13% 0.02% 1.11% 0.00% Stencil 51.89% 0.01% 51.97% 0.00% 51.94% 0.00% 51.92% 0.00% 51.92% 0.00% Streamcluster 9.28% 0.01% 8.82% 0.01% 8.88% 0.00% 8.86% 0.00% 8.83% 0.00% VPR 37.27% 0.20% 36.77% 0.05% 37.40% 0.00% 36.81% 0.00% 36.81% 0.00% Wave 24.38% 0.24% 22.08% 0.11% 21.73% 0.06% 20.90% 0.04% 21.02% 0.04% WP 19.30% 0.01% 19.17% 0.01% 19.11% 0.01% 19.11% 0.01% 19.11% 0.01% Average 25.68% 0.25% 25.15% 0.24% 25.09% 0.20% 25.01% 0.16% 24.87% 0.13% 129 Table A.11: Coherence Misses Classication Per Benchmark (No Speculation Conguration) Benchmark No Spec 1K sig 2K sig 4K sig 8K sig 16K sig Expired AVU misses Expired AVU misses Expired AVU misses Expired AVU misses Expired AVU misses Backprop 54.68% 0.05% 54.08% 0.02% 53.98% 0.01% 53.90% 0.00% 53.90% 0.00% Barneshut 21.61% 0.22% 20.32% 0.22% 20.36% 0.22% 20.71% 0.20% 20.61% 0.20% BFS-ISPASS 13.76% 0.00% 13.63% 0.00% 13.51% 0.00% 13.52% 0.00% 13.48% 0.00% BFS-Rodinia 17.33% 0.00% 17.19% 0.00% 17.14% 0.00% 17.13% 0.00% 17.15% 0.00% Btree 50.00% 3.02% 50.00% 3.02% 50.00% 3.02% 50.00% 3.02% 50.00% 3.02% CP 72.86% 0.00% 72.65% 0.01% 72.62% 0.00% 72.52% 0.00% 72.52% 0.00% CudaCuts 35.91% 0.04% 35.67% 0.01% 35.83% 0.00% 33.88% 0.00% 33.88% 0.00% CutCP 2.55% 0.05% 2.38% 0.03% 2.34% 0.01% 2.36% 0.00% 2.37% 0.00% Gaussian 45.02% 0.03% 44.95% 0.03% 44.93% 0.03% 44.95% 0.03% 44.95% 0.03% Heartwall 47.27% 0.25% 44.96% 0.27% 45.81% 0.03% 41.86% 0.04% 41.48% 0.01% Hotspot 4.86% 0.00% 4.86% 0.00% 4.86% 0.00% 4.86% 0.00% 4.86% 0.00% KMeans 0.41% 0.00% 0.41% 0.00% 0.41% 0.00% 0.41% 0.00% 0.41% 0.00% LBM 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% Leukocyte 8.49% 2.64% 4.34% 2.77% 1.82% 1.99% 1.01% 1.03% 1.01% 0.05% LIB 32.61% 0.00% 32.61% 0.00% 32.61% 0.00% 32.60% 0.00% 32.60% 0.00% LUD 30.90% 0.14% 30.16% 0.15% 29.91% 0.13% 29.92% 0.13% 29.81% 0.13% MRI-Q 22.00% 0.00% 21.96% 0.00% 21.90% 0.00% 21.81% 0.00% 21.81% 0.00% MUM 47.98% 0.00% 47.23% 0.00% 47.34% 0.00% 47.34% 0.00% 47.34% 0.00% NN 19.76% 0.03% 18.77% 0.01% 19.44% 0.00% 19.44% 0.00% 19.44% 0.00% NW 22.18% 0.00% 22.14% 0.00% 22.12% 0.00% 22.11% 0.00% 22.11% 0.00% OctreePart 57.55% 0.89% 57.15% 0.84% 57.46% 0.86% 57.46% 0.86% 57.31% 0.85% Pathnder 0.29% 0.00% 0.25% 0.00% 0.25% 0.00% 0.25% 0.00% 0.25% 0.00% SAD 5.17% 0.00% 5.17% 0.00% 5.17% 0.00% 5.17% 0.00% 5.17% 0.00% SGEMM 1.23% 0.00% 1.22% 0.00% 1.22% 0.00% 1.22% 0.00% 1.22% 0.00% SRAD 3.50% 0.04% 2.02% 0.04% 1.22% 0.03% 1.01% 0.01% 1.01% 0.00% Stencil 47.12% 0.01% 47.13% 0.00% 47.11% 0.00% 47.11% 0.00% 47.11% 0.00% Streamcluster 5.85% 0.01% 5.83% 0.00% 5.69% 0.00% 5.66% 0.00% 5.68% 0.00% VPR 35.50% 0.16% 34.77% 0.04% 34.04% 0.00% 34.68% 0.00% 33.98% 0.00% Wave 21.53% 0.20% 20.79% 0.06% 20.64% 0.02% 20.35% 0.01% 20.43% 0.00% WP 19.52% 0.01% 19.42% 0.01% 19.38% 0.01% 19.39% 0.01% 19.40% 0.01% Average 24.91% 0.26% 24.40% 0.25% 24.30% 0.21% 24.09% 0.18% 24.04% 0.14% 130 Table A.12: Coherence Misses Classication Per Benchmark (Byte-Level Conguration) Benchmark Byte-Level 1K sig 2K sig 4K sig 8K sig 16K sig Expired AVU misses Expired AVU misses Expired AVU misses Expired AVU misses Expired AVU misses Backprop 55.30% 0.05% 53.66% 0.12% 53.28% 0.13% 52.82% 0.12% 52.88% 0.12% Barneshut 28.73% 0.26% 28.24% 0.19% 25.58% 0.15% 25.93% 0.28% 24.52% 0.21% BFS-ISPASS 11.96% 0.01% 11.89% 0.01% 11.74% 0.02% 11.50% 0.02% 11.32% 0.02% BFS-Rodinia 14.65% 0.01% 14.53% 0.03% 14.39% 0.05% 14.14% 0.05% 13.96% 0.03% Btree 58.86% 3.24% 49.09% 4.59% 44.14% 1.74% 40.42% 1.03% 39.80% 0.86% CP 74.20% 0.11% 73.60% 0.24% 73.17% 0.35% 72.37% 0.62% 71.65% 0.28% CudaCuts 40.53% 0.26% 40.35% 0.34% 39.85% 0.42% 39.48% 0.56% 38.37% 0.64% CutCP 4.61% 1.85% 4.51% 1.91% 4.18% 2.04% 3.86% 2.17% 3.58% 1.83% Gaussian 45.58% 0.01% 45.36% 0.01% 45.12% 0.01% 44.86% 0.02% 44.55% 0.02% Heartwall 49.89% 26.49% 60.45% 9.84% 54.79% 7.50% 53.43% 4.58% 49.89% 3.17% Hotspot 6.14% 0.05% 6.21% 0.09% 5.54% 0.09% 5.85% 0.14% 5.82% 0.13% KMeans 0.46% 0.00% 0.46% 0.00% 0.70% 0.00% 0.70% 0.00% 0.57% 0.00% LBM 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% Leukocyte 9.88% 20.73% 9.86% 20.75% 9.15% 21.45% 9.21% 21.36% 8.05% 22.39% LIB 32.74% 0.00% 32.74% 0.00% 32.74% 0.01% 32.73% 0.01% 32.73% 0.01% LUD 33.32% 0.09% 33.15% 0.09% 33.02% 0.09% 32.72% 0.12% 32.53% 0.17% MRI-Q 15.95% 0.02% 15.95% 0.02% 15.95% 0.02% 15.94% 0.02% 15.94% 0.02% MUM 54.75% 0.01% 55.08% 0.01% 54.73% 0.01% 54.86% 0.01% 54.98% 0.03% NN 57.10% 13.72% 62.66% 5.68% 60.36% 4.40% 55.07% 3.42% 45.60% 2.25% NW 22.30% 0.16% 22.30% 0.16% 22.29% 0.16% 22.29% 0.15% 22.28% 0.15% OctreePart 62.93% 2.53% 60.14% 2.42% 59.10% 2.61% 56.79% 2.72% 54.60% 2.54% Pathnder 0.75% 0.15% 0.82% 0.16% 0.62% 0.19% 0.51% 0.11% 0.43% 0.12% SAD 11.37% 0.00% 11.35% 0.00% 11.30% 0.00% 11.29% 0.00% 11.29% 0.00% SGEMM 1.29% 0.01% 1.29% 0.01% 1.29% 0.01% 1.28% 0.01% 1.28% 0.02% SRAD 9.44% 0.02% 9.03% 0.04% 8.59% 0.07% 8.52% 0.10% 7.82% 0.15% Stencil 48.40% 0.05% 48.26% 0.07% 48.03% 0.11% 47.57% 0.15% 47.09% 0.18% Streamcluster 11.35% 0.11% 11.12% 0.13% 10.73% 0.14% 10.54% 0.17% 9.36% 0.17% VPR 60.84% 0.64% 57.94% 0.99% 53.11% 1.58% 46.10% 1.89% 37.16% 1.55% Wave 48.69% 0.91% 47.45% 1.38% 44.29% 1.02% 41.71% 0.85% 38.67% 0.47% WP 19.93% 0.11% 19.95% 0.12% 19.92% 0.13% 19.86% 0.14% 19.80% 0.14% Average 29.73% 2.39% 29.58% 1.65% 28.59% 1.48% 27.75% 1.36% 26.55% 1.26% 131 Table A.13: Coherence Misses Classication Per Benchmark (Word-Level Conguration) Benchmark Word-Level 1K sig 2K sig 4K sig 8K sig 16K sig Expired AVU misses Expired AVU misses Expired AVU misses Expired AVU misses Expired AVU misses Backprop 54.41% 0.09% 53.95% 0.12% 53.84% 0.13% 53.60% 0.18% 52.89% 0.33% Barneshut 28.09% 0.17% 26.98% 0.17% 25.03% 0.23% 18.80% 0.22% 17.02% 0.20% BFS-ISPASS 11.86% 0.01% 11.55% 0.01% 11.36% 0.01% 11.06% 0.01% 10.78% 0.00% BFS-Rodinia 14.40% 0.04% 14.20% 0.03% 13.92% 0.02% 13.64% 0.01% 13.47% 0.00% Btree 42.32% 1.39% 39.55% 0.94% 39.75% 0.88% 38.87% 0.86% 39.67% 0.86% CP 73.19% 0.34% 73.03% 0.15% 72.32% 0.29% 71.86% 0.26% 71.10% 0.17% CudaCuts 39.71% 0.42% 39.60% 0.52% 38.22% 0.60% 36.81% 0.53% 33.65% 0.26% CutCP 4.12% 2.00% 3.73% 2.01% 3.57% 1.33% 3.12% 1.09% 2.80% 0.76% Gaussian 45.34% 0.01% 45.10% 0.02% 44.77% 0.02% 44.31% 0.02% 43.95% 0.02% Heartwall 56.76% 5.56% 51.13% 3.88% 48.22% 2.53% 44.19% 1.25% 41.44% 1.19% Hotspot 6.06% 0.12% 5.71% 0.11% 5.46% 0.12% 4.75% 0.04% 4.93% 0.01% KMeans 0.70% 0.00% 0.71% 0.00% 0.75% 0.00% 0.65% 0.00% 0.81% 0.00% LBM 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% Leukocyte 9.46% 21.14% 9.83% 20.78% 7.97% 22.45% 7.91% 22.43% 6.37% 21.59% LIB 32.74% 0.01% 32.73% 0.01% 32.73% 0.01% 32.68% 0.01% 32.68% 0.01% LUD 33.06% 0.09% 33.08% 0.11% 32.69% 0.15% 31.99% 0.24% 31.22% 0.29% MRI-Q 15.95% 0.02% 15.95% 0.02% 15.94% 0.02% 15.94% 0.03% 15.94% 0.03% MUM 55.00% 0.01% 54.98% 0.02% 55.09% 0.02% 54.50% 0.01% 54.50% 0.01% NN 57.92% 3.84% 52.11% 2.89% 42.42% 1.82% 32.57% 1.02% 23.25% 0.41% NW 22.29% 0.15% 22.29% 0.15% 22.28% 0.15% 22.26% 0.14% 22.25% 0.12% OctreePart 59.89% 2.11% 57.52% 2.16% 55.27% 2.20% 54.38% 1.62% 53.61% 1.12% Pathnder 0.69% 0.18% 0.61% 0.13% 0.38% 0.12% 0.24% 0.06% 0.15% 0.02% SAD 11.34% 0.00% 11.32% 0.00% 11.30% 0.00% 11.27% 0.00% 11.33% 0.00% SGEMM 1.29% 0.01% 1.28% 0.02% 1.28% 0.02% 1.27% 0.02% 1.26% 0.01% SRAD 9.19% 0.03% 9.01% 0.07% 8.63% 0.12% 7.18% 0.23% 5.17% 0.35% Stencil 48.09% 0.14% 47.66% 0.18% 47.08% 0.23% 45.55% 0.29% 42.18% 0.27% Streamcluster 11.00% 0.14% 10.56% 0.17% 10.17% 0.19% 9.56% 0.17% 8.95% 0.11% VPR 52.46% 1.52% 43.92% 1.92% 34.19% 1.36% 27.27% 0.58% 24.81% 0.18% Wave 45.10% 0.85% 42.70% 0.76% 37.14% 0.52% 32.02% 0.35% 24.46% 0.26% WP 19.90% 0.13% 19.88% 0.13% 19.80% 0.12% 19.61% 0.11% 19.31% 0.10% Average 28.74% 1.35% 27.69% 1.25% 26.39% 1.19% 24.93% 1.06% 23.66% 0.96% 132 Table A.14: Coherence Misses Classication Per Benchmark (Block-Level Conguration) Benchmark Block-Level 1K sig 2K sig 4K sig 8K sig 16K sig Expired AVU misses Expired AVU misses Expired AVU misses Expired AVU misses Expired AVU misses Backprop 53.35% 0.04% 52.07% 0.03% 50.67% 0.01% 49.59% 0.00% 49.77% 0.00% Barneshut 18.34% 0.16% 16.94% 0.15% 18.00% 0.13% 17.80% 0.13% 17.77% 0.13% BFS-ISPASS 11.04% 0.00% 10.75% 0.00% 10.68% 0.00% 10.55% 0.00% 10.60% 0.00% BFS-Rodinia 13.55% 0.00% 13.38% 0.00% 13.33% 0.00% 13.30% 0.00% 13.38% 0.00% Btree 41.45% 0.99% 41.45% 0.99% 41.45% 0.99% 41.45% 0.99% 41.45% 0.99% CP 70.37% 0.03% 69.76% 0.01% 69.29% 0.01% 69.13% 0.00% 69.13% 0.00% CudaCuts 35.10% 0.04% 33.13% 0.00% 32.82% 0.00% 34.51% 0.00% 34.51% 0.00% CutCP 2.26% 0.04% 2.08% 0.03% 2.06% 0.01% 2.06% 0.00% 2.06% 0.00% Gaussian 43.63% 0.01% 43.25% 0.01% 43.23% 0.01% 43.24% 0.01% 43.25% 0.01% Heartwall 39.76% 0.26% 37.59% 0.10% 36.33% 0.04% 36.87% 0.01% 34.84% 0.01% Hotspot 4.62% 0.00% 4.69% 0.00% 4.69% 0.00% 4.69% 0.00% 4.69% 0.00% KMeans 0.50% 0.00% 0.51% 0.00% 0.51% 0.00% 0.51% 0.00% 0.51% 0.00% LBM 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% Leukocyte 8.19% 2.75% 4.05% 2.86% 1.87% 1.97% 1.01% 1.05% 1.01% 0.05% LIB 32.22% 0.00% 32.22% 0.00% 32.22% 0.00% 32.22% 0.00% 32.22% 0.00% LUD 30.79% 0.14% 29.82% 0.15% 29.82% 0.13% 29.64% 0.15% 29.61% 0.15% MRI-Q 15.93% 0.00% 15.92% 0.00% 15.89% 0.00% 15.88% 0.00% 15.88% 0.00% MUM 54.92% 0.00% 54.57% 0.00% 54.26% 0.00% 54.56% 0.00% 54.56% 0.00% NN 12.16% 0.04% 12.14% 0.00% 11.18% 0.00% 11.18% 0.00% 11.18% 0.00% NW 22.27% 0.00% 22.23% 0.00% 22.18% 0.00% 22.12% 0.00% 22.11% 0.00% OctreePart 53.39% 0.63% 53.11% 0.64% 52.80% 0.63% 53.68% 0.64% 53.68% 0.64% Pathnder 0.12% 0.00% 0.09% 0.00% 0.09% 0.00% 0.09% 0.00% 0.09% 0.00% SAD 11.35% 0.00% 11.38% 0.00% 11.34% 0.00% 11.26% 0.00% 11.26% 0.00% SGEMM 1.25% 0.00% 1.23% 0.00% 1.23% 0.00% 1.23% 0.00% 1.23% 0.00% SRAD 4.16% 0.04% 2.21% 0.04% 1.12% 0.03% 0.83% 0.01% 0.81% 0.00% Stencil 38.96% 0.01% 38.76% 0.00% 38.54% 0.00% 38.64% 0.00% 38.64% 0.00% Streamcluster 8.27% 0.01% 7.75% 0.01% 7.58% 0.00% 7.61% 0.00% 7.64% 0.00% VPR 25.94% 0.21% 23.61% 0.04% 23.65% 0.00% 23.32% 0.00% 23.47% 0.00% Wave 21.33% 0.22% 17.23% 0.10% 16.22% 0.04% 15.92% 0.03% 15.84% 0.03% WP 19.08% 0.01% 18.85% 0.01% 18.66% 0.01% 18.76% 0.01% 18.76% 0.01% Average 23.14% 0.19% 22.36% 0.17% 22.06% 0.13% 22.06% 0.10% 22.00% 0.07% 133 Table A.15: Coherence Misses Classication Per Benchmark (Half-Page-Level Conguration) Benchmark Half-Page-Level 1K sig 2K sig 4K sig 8K sig 16K sig Expired AVU misses Expired AVU misses Expired AVU misses Expired AVU misses Expired AVU misses Backprop 51.06% 0.02% 49.92% 0.01% 49.54% 0.01% 49.50% 0.01% 49.50% 0.01% Barneshut 20.34% 0.12% 19.36% 0.13% 19.34% 0.12% 19.34% 0.12% 19.34% 0.12% BFS-ISPASS 11.05% 0.01% 10.87% 0.01% 10.74% 0.01% 10.67% 0.01% 10.67% 0.01% BFS-Rodinia 13.52% 0.00% 13.39% 0.00% 13.33% 0.00% 13.34% 0.00% 13.36% 0.00% Btree 49.14% 0.92% 49.14% 0.92% 49.14% 0.92% 49.14% 0.92% 49.14% 0.92% CP 67.47% 0.38% 66.64% 0.46% 66.64% 0.46% 66.64% 0.46% 66.64% 0.46% CudaCuts 35.37% 0.14% 35.12% 0.12% 34.90% 0.12% 34.90% 0.12% 34.90% 0.12% CutCP 2.06% 0.01% 2.06% 0.00% 2.06% 0.00% 2.06% 0.00% 2.06% 0.00% Gaussian 44.26% 0.01% 43.93% 0.01% 43.95% 0.01% 43.94% 0.01% 43.94% 0.01% Heartwall 37.00% 0.17% 35.56% 0.04% 36.59% 0.01% 36.36% 0.00% 35.61% 0.00% Hotspot 4.69% 0.00% 4.69% 0.00% 4.69% 0.00% 4.69% 0.00% 4.69% 0.00% KMeans 0.52% 0.00% 0.51% 0.00% 0.51% 0.00% 0.51% 0.00% 0.51% 0.00% LBM 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% 0.00% Leukocyte 1.95% 2.16% 1.01% 1.13% 1.01% 0.05% 1.01% 0.00% 1.01% 0.00% LIB 32.60% 0.00% 32.49% 0.01% 32.50% 0.01% 32.47% 0.01% 32.53% 0.01% LUD 29.98% 0.19% 29.87% 0.19% 29.94% 0.19% 29.94% 0.19% 29.94% 0.19% MRI-Q 15.92% 0.00% 15.90% 0.00% 15.88% 0.00% 15.88% 0.00% 15.88% 0.00% MUM 54.43% 0.00% 54.53% 0.00% 54.56% 0.00% 54.56% 0.00% 54.56% 0.00% NN 11.88% 0.01% 11.06% 0.01% 11.06% 0.01% 11.06% 0.01% 11.06% 0.01% NW 22.26% 0.00% 22.22% 0.00% 22.15% 0.00% 22.12% 0.00% 22.11% 0.00% OctreePart 54.74% 0.81% 53.94% 0.76% 53.94% 0.76% 53.94% 0.76% 53.94% 0.76% Pathnder 0.08% 0.00% 0.09% 0.00% 0.09% 0.00% 0.09% 0.00% 0.09% 0.00% SAD 11.27% 0.00% 11.38% 0.00% 11.27% 0.00% 11.26% 0.00% 11.26% 0.00% SGEMM 1.23% 0.00% 1.23% 0.00% 1.23% 0.00% 1.23% 0.00% 1.23% 0.00% SRAD 2.75% 0.03% 1.14% 0.03% 0.81% 0.01% 0.79% 0.00% 0.78% 0.00% Stencil 38.67% 0.00% 38.61% 0.00% 38.64% 0.00% 38.64% 0.00% 38.64% 0.00% Streamcluster 7.25% 0.00% 7.58% 0.00% 7.63% 0.00% 7.67% 0.00% 7.62% 0.00% VPR 25.71% 0.22% 24.03% 0.05% 23.63% 0.01% 23.43% 0.01% 23.28% 0.01% Wave 18.47% 0.17% 18.15% 0.13% 17.65% 0.13% 17.95% 0.13% 17.95% 0.13% WP 19.14% 0.02% 18.75% 0.02% 18.76% 0.01% 18.80% 0.02% 18.81% 0.02% Average 22.83% 0.18% 22.44% 0.13% 22.41% 0.09% 22.40% 0.09% 22.37% 0.09% 134 Table A.16: Detailed Energy Consumption for Individual Benchmarks (1) Benchmark Baseline Byte Level 1K sig 2K sig GPU RDU AVU RR GPU RDU AVU RR Backprop 63.485 103.545% 7.238% 0.000% 0.013% 103.340% 14.398% 0.000% 0.009% Barneshut 27.546 111.715% 7.907% 0.000% 0.032% 108.335% 15.632% 0.001% 0.021% BFS-ISPASS 44.352 109.068% 7.817% 0.000% 0.010% 109.153% 15.622% 0.000% 0.000% BFS-Rodinia 1011.252 103.040% 3.724% 0.000% 0.005% 103.077% 7.417% 0.000% 0.000% Btree 156.392 115.986% 2.923% 0.009% 0.022% 112.925% 5.546% 0.036% 0.013% CP 36.962 102.224% 0.168% 0.000% 0.037% 102.130% 0.333% 0.000% 0.033% CudaCuts 99.797 113.180% 6.036% 0.000% 0.008% 114.219% 12.015% 0.001% 0.000% CutCP 1892.855 100.022% 0.053% 0.000% 0.006% 100.025% 0.107% 0.000% 0.003% Gaussian 3119.641 110.045% 4.519% 0.000% 0.009% 109.995% 9.003% 0.000% 0.000% Heartwall 2531.550 108.739% 3.866% 0.037% 0.043% 105.114% 7.691% 0.120% 0.038% Hotspot 28.048 100.570% 1.730% 0.000% 0.043% 100.434% 3.446% 0.000% 0.039% KMeans 3970.755 100.401% 0.327% 0.000% 0.005% 100.401% 0.651% 0.000% 0.001% LBM 2321.134 99.389% 9.913% 0.000% 0.007% 99.389% 19.749% 0.000% 0.003% Leukocyte 4574.771 100.020% 0.060% 0.000% 0.026% 100.016% 0.120% 0.000% 0.020% LIB 440.693 113.396% 9.296% 0.000% 0.020% 113.397% 18.520% 0.000% 0.014% LUD 143.217 101.323% 2.158% 0.000% 0.009% 101.257% 4.297% 0.000% 0.000% MRI-Q 2267.109 100.338% 0.048% 0.000% 0.025% 100.338% 0.096% 0.000% 0.020% MUM 243.938 110.390% 2.954% 0.000% 0.021% 109.802% 5.884% 0.000% 0.016% NN 214.362 104.537% 2.237% 0.011% 0.006% 103.954% 4.401% 0.024% 0.000% NW 287.355 100.182% 2.384% 0.000% 0.010% 100.182% 4.749% 0.000% 0.000% OctreePart 89.658 98.583% 4.370% 0.001% 0.031% 96.770% 8.652% 0.003% 0.022% Pathnder 139.000 100.056% 1.051% 0.000% 0.011% 99.713% 2.090% 0.000% 0.008% SAD 6060.514 100.506% 3.301% 0.000% 0.005% 100.527% 6.576% 0.000% 0.000% SGEMM 749.784 100.011% 0.732% 0.000% 0.070% 100.011% 1.459% 0.000% 0.065% SRAD 907.454 101.438% 5.468% 0.000% 0.008% 101.405% 10.890% 0.000% 0.004% Stencil 796.055 102.710% 7.300% 0.000% 0.005% 102.846% 14.531% 0.001% 0.000% Streamcluster 18591.297 102.669% 2.756% 0.000% 0.005% 102.585% 5.478% 0.000% 0.000% VPR 59.683 130.962% 1.819% 0.002% 0.039% 128.424% 3.503% 0.008% 0.027% Wave 58.331 140.068% 6.342% 0.002% 0.035% 139.559% 12.570% 0.010% 0.017% WP 311.672 99.782% 4.480% 0.000% 0.023% 99.531% 8.925% 0.000% 0.014% 135 Table A.17: Detailed Energy Consumption for Individual Benchmarks (2) Benchmark Baseline Byte Level 4K sig 8K sig GPU RDU AVU RR GPU RDU AVU RR Backprop 63.485 103.340% 28.728% 0.001% 0.009% 103.308% 57.392% 0.002% 0.009% Barneshut 27.546 108.306% 30.918% 0.002% 0.021% 108.092% 62.017% 0.013% 0.021% BFS-ISPASS 44.352 108.921% 31.143% 0.000% 0.000% 109.589% 62.202% 0.002% 0.000% BFS-Rodinia 1011.252 102.997% 14.799% 0.000% 0.000% 103.086% 29.554% 0.001% 0.000% Btree 156.392 109.663% 10.566% 0.096% 0.012% 107.499% 20.559% 0.199% 0.012% CP 36.962 105.803% 0.665% 0.000% 0.033% 98.178% 1.325% 0.000% 0.033% CudaCuts 99.797 113.228% 23.866% 0.002% 0.000% 112.408% 47.506% 0.006% 0.000% CutCP 1892.855 100.026% 0.212% 0.000% 0.003% 100.013% 0.424% 0.000% 0.003% Gaussian 3119.641 109.935% 17.968% 0.000% 0.000% 109.818% 35.895% 0.001% 0.000% Heartwall 2531.550 102.677% 15.878% 0.255% 0.037% 101.239% 32.786% 0.525% 0.037% Hotspot 28.048 100.385% 6.876% 0.000% 0.039% 100.446% 13.736% 0.001% 0.039% KMeans 3970.755 99.747% 1.300% 0.000% 0.001% 99.749% 2.598% 0.000% 0.001% LBM 2321.134 99.389% 39.422% 0.000% 0.003% 99.389% 78.767% 0.000% 0.003% Leukocyte 4574.771 100.015% 0.240% 0.000% 0.020% 100.016% 0.479% 0.001% 0.020% LIB 440.693 113.381% 36.968% 0.000% 0.014% 113.382% 73.863% 0.000% 0.014% LUD 143.217 101.255% 8.571% 0.000% 0.000% 101.325% 17.119% 0.001% 0.000% MRI-Q 2267.109 100.338% 0.191% 0.000% 0.020% 100.338% 0.382% 0.000% 0.020% MUM 243.938 110.285% 11.746% 0.000% 0.016% 110.455% 23.468% 0.000% 0.016% NN 214.362 103.377% 8.744% 0.057% 0.000% 102.716% 17.258% 0.131% 0.000% NW 287.355 100.141% 9.479% 0.000% 0.000% 100.119% 18.940% 0.001% 0.000% OctreePart 89.658 97.211% 17.293% 0.010% 0.023% 97.234% 34.567% 0.018% 0.023% Pathnder 139.000 99.740% 4.151% 0.000% 0.008% 99.905% 8.295% 0.001% 0.008% SAD 6060.514 100.536% 13.126% 0.000% 0.000% 100.539% 26.226% 0.000% 0.000% SGEMM 749.784 100.011% 2.912% 0.000% 0.065% 100.011% 5.818% 0.000% 0.065% SRAD 907.454 101.445% 21.728% 0.000% 0.004% 101.228% 43.401% 0.001% 0.004% Stencil 796.055 102.851% 28.979% 0.002% 0.000% 102.647% 57.815% 0.007% 0.000% Streamcluster 18591.297 102.282% 10.799% 0.001% 0.000% 102.613% 22.300% 0.003% 0.000% VPR 59.683 125.010% 6.625% 0.031% 0.027% 120.442% 12.187% 0.102% 0.026% Wave 58.331 135.127% 24.616% 0.036% 0.017% 132.197% 48.652% 0.121% 0.018% WP 311.672 97.740% 17.816% 0.000% 0.014% 98.597% 35.597% 0.000% 0.014% 136 Table A.18: Detailed Energy Consumption for Individual Benchmarks (3) Benchmark Baseline Byte Level 16K sig GPU RDU AVU RR Backprop 63.485 103.244% 114.727% 0.004% 0.009% Barneshut 27.546 108.884% 123.319% 0.020% 0.021% BFS-ISPASS 44.352 108.146% 124.442% 0.006% 0.000% BFS-Rodinia 1011.252 102.777% 59.031% 0.004% 0.000% Btree 156.392 106.908% 40.844% 0.334% 0.012% CP 36.962 98.078% 2.645% 0.000% 0.033% CudaCuts 99.797 110.495% 94.259% 0.018% 0.000% CutCP 1892.855 100.021% 0.846% 0.001% 0.003% Gaussian 3119.641 109.651% 71.744% 0.002% 0.000% Heartwall 2531.550 100.345% 66.970% 1.044% 0.037% Hotspot 28.048 100.359% 27.432% 0.002% 0.039% KMeans 3970.755 100.099% 5.193% 0.000% 0.001% LBM 2321.134 99.389% 157.458% 0.000% 0.003% Leukocyte 4574.771 100.019% 0.957% 0.002% 0.020% LIB 440.693 113.408% 147.652% 0.000% 0.014% LUD 143.217 101.221% 34.203% 0.002% 0.000% MRI-Q 2267.109 100.338% 0.763% 0.000% 0.020% MUM 243.938 110.469% 46.908% 0.000% 0.016% NN 214.362 101.910% 34.785% 0.294% 0.000% NW 287.355 100.198% 37.862% 0.001% 0.000% OctreePart 89.658 97.074% 69.088% 0.043% 0.023% Pathnder 139.000 99.838% 16.593% 0.002% 0.008% SAD 6060.514 100.525% 52.427% 0.000% 0.000% SGEMM 749.784 100.011% 11.630% 0.000% 0.065% SRAD 907.454 101.338% 86.709% 0.003% 0.004% Stencil 796.055 102.750% 115.408% 0.025% 0.000% Streamcluster 18591.297 80.544% 32.442% 0.006% 0.000% VPR 59.683 115.135% 22.295% 0.254% 0.026% Wave 58.331 128.878% 96.314% 0.372% 0.018% WP 311.672 97.719% 71.157% 0.001% 0.014% 137 Table A.19: Detailed Energy Consumption for Individual Benchmarks (4) Benchmark Baseline Word Level 1K sig 2K sig GPU RDU AVU RR GPU RDU AVU RR Backprop 63.485 103.355% 2.432% 0.000% 0.009% 103.365% 4.840% 0.001% 0.009% Barneshut 27.546 108.320% 2.793% 0.001% 0.021% 109.726% 5.618% 0.002% 0.021% BFS-ISPASS 44.352 109.122% 3.098% 0.000% 0.000% 108.759% 6.172% 0.001% 0.000% BFS-Rodinia 1011.252 103.030% 1.638% 0.000% 0.000% 102.786% 3.258% 0.000% 0.000% Btree 156.392 108.618% 2.246% 0.026% 0.012% 107.008% 4.385% 0.047% 0.012% CP 36.962 105.595% 0.049% 0.000% 0.033% 105.661% 0.098% 0.000% 0.033% CudaCuts 99.797 113.366% 2.556% 0.001% 0.000% 113.126% 5.124% 0.002% 0.000% CutCP 1892.855 100.021% 0.029% 0.000% 0.003% 100.020% 0.059% 0.000% 0.003% Gaussian 3119.641 109.962% 1.514% 0.000% 0.000% 109.857% 3.015% 0.000% 0.000% Heartwall 2531.550 102.327% 2.627% 0.067% 0.037% 101.045% 5.450% 0.133% 0.037% Hotspot 28.048 100.467% 0.692% 0.000% 0.039% 100.512% 1.376% 0.000% 0.039% KMeans 3970.755 99.772% 0.145% 0.000% 0.001% 99.715% 0.290% 0.000% 0.001% LBM 2321.134 99.389% 2.929% 0.000% 0.003% 99.389% 5.834% 0.000% 0.003% Leukocyte 4574.771 100.019% 0.040% 0.000% 0.020% 100.020% 0.079% 0.000% 0.020% LIB 440.693 113.469% 2.856% 0.000% 0.014% 113.390% 5.690% 0.000% 0.014% LUD 143.217 101.448% 0.802% 0.000% 0.000% 101.401% 1.596% 0.000% 0.000% MRI-Q 2267.109 100.338% 0.017% 0.000% 0.020% 100.338% 0.034% 0.000% 0.020% MUM 243.938 110.609% 0.987% 0.000% 0.016% 110.169% 1.967% 0.000% 0.016% NN 214.362 103.248% 2.149% 0.016% 0.000% 102.419% 4.249% 0.036% 0.000% NW 287.355 100.119% 0.748% 0.000% 0.000% 100.235% 1.489% 0.000% 0.000% OctreePart 89.658 97.909% 1.649% 0.002% 0.023% 97.966% 3.273% 0.005% 0.023% Pathnder 139.000 100.002% 0.710% 0.000% 0.008% 99.868% 1.417% 0.000% 0.008% SAD 6060.514 100.528% 0.911% 0.000% 0.000% 100.547% 1.815% 0.000% 0.000% SGEMM 749.784 100.011% 0.637% 0.000% 0.065% 100.011% 1.268% 0.000% 0.065% SRAD 907.454 101.438% 1.783% 0.000% 0.004% 101.386% 3.549% 0.000% 0.004% Stencil 796.055 102.801% 3.253% 0.001% 0.000% 102.799% 6.462% 0.003% 0.000% Streamcluster 18591.297 102.650% 1.999% 0.000% 0.000% 102.120% 3.952% 0.001% 0.000% VPR 59.683 125.559% 1.175% 0.010% 0.027% 118.144% 2.045% 0.031% 0.026% Wave 58.331 136.664% 4.487% 0.012% 0.018% 132.708% 8.771% 0.039% 0.018% WP 311.672 100.356% 1.471% 0.000% 0.014% 100.792% 2.930% 0.000% 0.014% 138 Table A.20: Detailed Energy Consumption for Individual Benchmarks (5) Benchmark Baseline Word Level 4K sig 8K sig GPU RDU AVU RR GPU RDU AVU RR Backprop 63.485 103.273% 9.655% 0.001% 0.009% 103.372% 19.278% 0.003% 0.009% Barneshut 27.546 111.701% 11.250% 0.007% 0.022% 108.669% 21.839% 0.046% 0.022% BFS-ISPASS 44.352 107.989% 12.261% 0.002% 0.000% 108.260% 24.446% 0.004% 0.000% BFS-Rodinia 1011.252 102.774% 6.489% 0.001% 0.000% 102.735% 12.935% 0.003% 0.000% Btree 156.392 106.879% 8.751% 0.083% 0.012% 106.640% 17.411% 0.161% 0.012% CP 36.962 100.204% 0.195% 0.000% 0.033% 98.286% 0.389% 0.000% 0.033% CudaCuts 99.797 113.453% 10.098% 0.006% 0.000% 134.827% 24.187% 0.023% 0.000% CutCP 1892.855 100.018% 0.116% 0.000% 0.003% 100.013% 0.231% 0.000% 0.003% Gaussian 3119.641 109.687% 6.014% 0.001% 0.000% 109.468% 12.010% 0.002% 0.000% Heartwall 2531.550 100.111% 11.219% 0.256% 0.037% 99.503% 23.044% 0.473% 0.037% Hotspot 28.048 100.492% 2.738% 0.001% 0.039% 100.413% 5.471% 0.001% 0.039% KMeans 3970.755 99.765% 0.578% 0.000% 0.001% 100.009% 1.155% 0.000% 0.001% LBM 2321.134 99.389% 11.646% 0.000% 0.003% 99.389% 23.269% 0.000% 0.003% Leukocyte 4574.771 100.016% 0.157% 0.000% 0.020% 100.018% 0.314% 0.001% 0.020% LIB 440.693 113.433% 11.358% 0.000% 0.014% 113.348% 22.692% 0.000% 0.014% LUD 143.217 101.192% 3.181% 0.001% 0.000% 101.204% 6.336% 0.002% 0.000% MRI-Q 2267.109 100.338% 0.069% 0.000% 0.020% 100.338% 0.137% 0.000% 0.020% MUM 243.938 110.072% 3.925% 0.000% 0.016% 110.120% 7.832% 0.000% 0.016% NN 214.362 101.755% 8.548% 0.074% 0.000% 101.127% 17.384% 0.154% 0.000% NW 287.355 100.219% 2.973% 0.000% 0.000% 100.210% 5.940% 0.001% 0.000% OctreePart 89.658 97.374% 6.501% 0.013% 0.023% 97.680% 12.990% 0.030% 0.023% Pathnder 139.000 100.226% 2.819% 0.001% 0.008% 99.789% 5.630% 0.001% 0.008% SAD 6060.514 100.547% 3.623% 0.000% 0.000% 100.550% 7.238% 0.000% 0.000% SGEMM 749.784 100.011% 2.531% 0.000% 0.065% 100.011% 5.057% 0.000% 0.065% SRAD 907.454 101.338% 7.074% 0.001% 0.004% 101.357% 14.070% 0.005% 0.004% Stencil 796.055 102.632% 12.834% 0.008% 0.000% 102.539% 25.325% 0.023% 0.000% Streamcluster 18591.297 102.311% 7.905% 0.002% 0.000% 101.899% 15.688% 0.006% 0.000% VPR 59.683 113.436% 3.597% 0.065% 0.026% 110.849% 6.838% 0.107% 0.026% Wave 58.331 128.647% 17.072% 0.114% 0.018% 122.452% 33.256% 0.294% 0.018% WP 311.672 99.777% 5.847% 0.000% 0.014% 99.809% 11.678% 0.001% 0.014% 139 Table A.21: Detailed Energy Consumption for Individual Benchmarks (6) Benchmark Baseline Word Level 16K sig GPU RDU AVU RR Backprop 63.485 103.206% 38.447% 0.016% 0.009% Barneshut 27.546 110.348% 43.483% 0.107% 0.022% BFS-ISPASS 44.352 108.343% 48.850% 0.010% 0.000% BFS-Rodinia 1011.252 102.569% 25.806% 0.005% 0.000% Btree 156.392 106.882% 34.930% 0.325% 0.012% CP 36.962 103.650% 0.777% 0.000% 0.033% CudaCuts 99.797 95.175% 31.862% 0.033% 0.000% CutCP 1892.855 100.017% 0.459% 0.001% 0.003% Gaussian 3119.641 109.249% 24.006% 0.003% 0.000% Heartwall 2531.550 99.485% 46.685% 0.861% 0.037% Hotspot 28.048 100.371% 10.902% 0.002% 0.039% KMeans 3970.755 99.839% 2.307% 0.000% 0.001% LBM 2321.134 99.389% 46.516% 0.000% 0.003% Leukocyte 4574.771 100.017% 0.628% 0.003% 0.020% LIB 440.693 113.348% 45.361% 0.000% 0.014% LUD 143.217 101.090% 12.593% 0.010% 0.000% MRI-Q 2267.109 100.338% 0.274% 0.000% 0.020% MUM 243.938 110.120% 15.655% 0.000% 0.016% NN 214.362 100.606% 35.282% 0.267% 0.000% NW 287.355 100.152% 11.875% 0.001% 0.000% OctreePart 89.658 98.100% 25.835% 0.057% 0.023% Pathnder 139.000 100.187% 11.236% 0.001% 0.008% SAD 6060.514 100.519% 14.469% 0.000% 0.000% SGEMM 749.784 100.011% 10.108% 0.000% 0.065% SRAD 907.454 101.071% 27.909% 0.022% 0.004% Stencil 796.055 102.227% 49.364% 0.066% 0.000% Streamcluster 18591.297 101.794% 31.257% 0.015% 0.000% VPR 59.683 110.246% 13.586% 0.144% 0.026% Wave 58.331 117.079% 65.913% 0.666% 0.019% WP 311.672 99.160% 23.320% 0.003% 0.014% 140 Table A.22: Detailed Energy Consumption for Individual Benchmarks (7) Benchmark Baseline Block Level 1K sig 2K sig GPU RDU AVU RR GPU RDU AVU RR Backprop 63.485 103.245% 0.090% 0.000% 0.009% 102.960% 0.179% 0.000% 0.009% Barneshut 27.546 105.310% 0.144% 0.000% 0.021% 103.502% 0.289% 0.000% 0.021% BFS-ISPASS 44.352 108.166% 0.225% 0.000% 0.000% 108.766% 0.448% 0.000% 0.000% BFS-Rodinia 1011.252 102.684% 0.096% 0.000% 0.000% 102.676% 0.192% 0.000% 0.000% Btree 156.392 110.290% 0.053% 0.000% 0.013% 110.290% 0.106% 0.001% 0.013% CP 36.962 101.959% 0.003% 0.000% 0.033% 105.655% 0.005% 0.000% 0.033% CudaCuts 99.797 110.921% 0.073% 0.000% 0.000% 94.315% 0.121% 0.000% 0.000% CutCP 1892.855 100.008% 0.001% 0.000% 0.003% 100.007% 0.002% 0.000% 0.003% Gaussian 3119.641 109.316% 0.101% 0.000% 0.000% 109.180% 0.201% 0.000% 0.000% Heartwall 2531.550 99.721% 0.074% 0.001% 0.037% 100.107% 0.147% 0.001% 0.037% Hotspot 28.048 100.283% 0.026% 0.000% 0.039% 100.235% 0.052% 0.000% 0.039% KMeans 3970.755 99.912% 0.005% 0.000% 0.001% 99.896% 0.009% 0.000% 0.001% LBM 2321.134 99.389% 0.100% 0.000% 0.003% 99.389% 0.199% 0.000% 0.003% Leukocyte 4574.771 100.008% 0.001% 0.000% 0.020% 100.007% 0.002% 0.000% 0.020% LIB 440.693 113.258% 0.073% 0.000% 0.014% 113.178% 0.145% 0.000% 0.014% LUD 143.217 100.990% 0.033% 0.000% 0.000% 100.948% 0.067% 0.000% 0.000% MRI-Q 2267.109 100.338% 0.000% 0.000% 0.020% 100.338% 0.001% 0.000% 0.020% MUM 243.938 109.744% 0.045% 0.000% 0.016% 110.024% 0.090% 0.000% 0.016% NN 214.362 100.381% 0.056% 0.000% 0.000% 100.353% 0.112% 0.000% 0.000% NW 287.355 100.159% 0.033% 0.000% 0.000% 100.265% 0.066% 0.000% 0.000% OctreePart 89.658 97.769% 0.068% 0.000% 0.023% 97.417% 0.135% 0.000% 0.023% Pathnder 139.000 100.119% 0.012% 0.000% 0.008% 100.107% 0.024% 0.000% 0.008% SAD 6060.514 100.521% 0.037% 0.000% 0.000% 100.498% 0.074% 0.000% 0.000% SGEMM 749.784 100.010% 0.006% 0.000% 0.065% 100.010% 0.012% 0.000% 0.065% SRAD 907.454 100.764% 0.086% 0.000% 0.004% 100.461% 0.171% 0.000% 0.004% Stencil 796.055 101.856% 0.058% 0.000% 0.000% 101.769% 0.114% 0.000% 0.000% Streamcluster 18591.297 101.234% 0.029% 0.000% 0.000% 101.051% 0.057% 0.000% 0.000% VPR 59.683 110.189% 0.037% 0.000% 0.026% 109.807% 0.074% 0.000% 0.026% Wave 58.331 112.755% 0.101% 0.001% 0.019% 111.320% 0.203% 0.001% 0.019% WP 311.672 101.262% 0.056% 0.000% 0.014% 100.266% 0.112% 0.000% 0.014% 141 Table A.23: Detailed Energy Consumption for Individual Benchmarks (8) Benchmark Baseline Block Level 4K sig 8K sig GPU RDU AVU RR GPU RDU AVU RR Backprop 63.485 102.581% 0.357% 0.000% 0.009% 102.224% 0.712% 0.000% 0.009% Barneshut 27.546 104.506% 0.566% 0.000% 0.021% 104.574% 1.125% 0.001% 0.021% BFS-ISPASS 44.352 107.127% 0.890% 0.000% 0.000% 107.974% 1.777% 0.000% 0.000% BFS-Rodinia 1011.252 102.616% 0.383% 0.000% 0.000% 102.556% 0.765% 0.000% 0.000% Btree 156.392 110.290% 0.212% 0.002% 0.013% 110.290% 0.424% 0.003% 0.013% CP 36.962 98.079% 0.010% 0.000% 0.033% 96.333% 0.021% 0.000% 0.033% CudaCuts 99.797 94.561% 0.243% 0.000% 0.000% 110.821% 0.582% 0.000% 0.000% CutCP 1892.855 100.009% 0.003% 0.000% 0.003% 100.015% 0.007% 0.000% 0.003% Gaussian 3119.641 109.132% 0.401% 0.000% 0.000% 109.176% 0.801% 0.000% 0.000% Heartwall 2531.550 99.668% 0.294% 0.002% 0.037% 99.233% 0.588% 0.004% 0.037% Hotspot 28.048 100.235% 0.103% 0.000% 0.039% 100.235% 0.206% 0.000% 0.039% KMeans 3970.755 99.896% 0.018% 0.000% 0.001% 99.896% 0.036% 0.000% 0.001% LBM 2321.134 99.389% 0.398% 0.000% 0.003% 99.389% 0.794% 0.000% 0.003% Leukocyte 4574.771 100.003% 0.005% 0.000% 0.020% 100.004% 0.009% 0.000% 0.020% LIB 440.693 113.178% 0.289% 0.000% 0.014% 113.178% 0.577% 0.000% 0.014% LUD 143.217 100.884% 0.133% 0.000% 0.000% 100.923% 0.265% 0.000% 0.000% MRI-Q 2267.109 100.338% 0.002% 0.000% 0.020% 100.338% 0.003% 0.000% 0.020% MUM 243.938 110.346% 0.179% 0.000% 0.016% 109.979% 0.357% 0.000% 0.016% NN 214.362 100.266% 0.224% 0.000% 0.000% 100.266% 0.447% 0.000% 0.000% NW 287.355 100.193% 0.132% 0.000% 0.000% 100.193% 0.264% 0.000% 0.000% OctreePart 89.658 97.072% 0.270% 0.000% 0.023% 98.135% 0.542% 0.001% 0.023% Pathnder 139.000 99.720% 0.049% 0.000% 0.008% 99.720% 0.097% 0.000% 0.008% SAD 6060.514 100.509% 0.148% 0.000% 0.000% 100.550% 0.297% 0.000% 0.000% SGEMM 749.784 100.010% 0.024% 0.000% 0.065% 100.010% 0.049% 0.000% 0.065% SRAD 907.454 100.272% 0.341% 0.000% 0.004% 100.312% 0.681% 0.000% 0.004% Stencil 796.055 101.730% 0.228% 0.000% 0.000% 101.872% 0.456% 0.000% 0.000% Streamcluster 18591.297 101.124% 0.115% 0.000% 0.000% 101.253% 0.232% 0.000% 0.000% VPR 59.683 108.482% 0.148% 0.000% 0.026% 109.813% 0.296% 0.001% 0.026% Wave 58.331 110.719% 0.408% 0.001% 0.019% 110.295% 0.814% 0.001% 0.019% WP 311.672 99.059% 0.224% 0.000% 0.014% 98.885% 0.447% 0.000% 0.014% 142 Table A.24: Detailed Energy Consumption for Individual Benchmarks (9) Benchmark Baseline Block Level 16K sig GPU RDU AVU RR Backprop 63.485 102.243% 1.423% 0.000% 0.009% Barneshut 27.546 104.916% 2.268% 0.002% 0.021% BFS-ISPASS 44.352 107.918% 3.554% 0.000% 0.000% BFS-Rodinia 1011.252 102.591% 1.528% 0.000% 0.000% Btree 156.392 110.290% 0.848% 0.006% 0.013% CP 36.962 96.333% 0.041% 0.000% 0.033% CudaCuts 99.797 110.821% 1.163% 0.000% 0.000% CutCP 1892.855 100.007% 0.013% 0.000% 0.003% Gaussian 3119.641 109.149% 1.601% 0.000% 0.000% Heartwall 2531.550 99.687% 1.174% 0.009% 0.037% Hotspot 28.048 100.235% 0.412% 0.000% 0.039% KMeans 3970.755 99.896% 0.072% 0.000% 0.001% LBM 2321.134 99.389% 1.588% 0.000% 0.003% Leukocyte 4574.771 100.002% 0.018% 0.000% 0.020% LIB 440.693 113.178% 1.153% 0.000% 0.014% LUD 143.217 100.874% 0.531% 0.000% 0.000% MRI-Q 2267.109 100.338% 0.006% 0.000% 0.020% MUM 243.938 109.979% 0.714% 0.000% 0.016% NN 214.362 100.266% 0.893% 0.001% 0.000% NW 287.355 100.193% 0.528% 0.000% 0.000% OctreePart 89.658 98.135% 1.084% 0.001% 0.023% Pathnder 139.000 99.720% 0.194% 0.000% 0.008% SAD 6060.514 100.550% 0.593% 0.000% 0.000% SGEMM 749.784 100.010% 0.098% 0.000% 0.065% SRAD 907.454 100.065% 1.361% 0.000% 0.004% Stencil 796.055 101.872% 0.912% 0.001% 0.000% Streamcluster 18591.297 101.278% 0.465% 0.000% 0.000% VPR 59.683 109.980% 0.593% 0.001% 0.026% Wave 58.331 110.210% 1.630% 0.003% 0.019% WP 311.672 99.390% 0.894% 0.000% 0.014% 143 Table A.25: Detailed Energy Consumption for Individual Benchmarks (10) Benchmark Baseline Half-Page Level 1K sig 2K sig GPU RDU AVU RR GPU RDU AVU RR Backprop 63.485 102.804% 0.090% 0.000% 0.009% 102.282% 0.178% 0.000% 0.009% Barneshut 27.546 106.325% 0.142% 0.000% 0.021% 104.951% 0.286% 0.000% 0.021% BFS-ISPASS 44.352 108.520% 0.225% 0.000% 0.000% 107.240% 0.447% 0.000% 0.000% BFS-Rodinia 1011.252 102.571% 0.096% 0.000% 0.000% 102.519% 0.192% 0.000% 0.000% Btree 156.392 111.226% 0.053% 0.000% 0.013% 111.226% 0.106% 0.001% 0.013% CP 36.962 105.484% 0.003% 0.000% 0.033% 100.063% 0.005% 0.000% 0.033% CudaCuts 99.797 110.916% 0.073% 0.000% 0.000% 110.609% 0.145% 0.000% 0.000% CutCP 1892.855 100.009% 0.001% 0.000% 0.003% 100.015% 0.002% 0.000% 0.003% Gaussian 3119.641 109.319% 0.101% 0.000% 0.000% 109.195% 0.201% 0.000% 0.000% Heartwall 2531.550 99.165% 0.074% 0.001% 0.037% 98.608% 0.147% 0.001% 0.037% Hotspot 28.048 100.235% 0.026% 0.000% 0.039% 100.235% 0.052% 0.000% 0.039% KMeans 3970.755 100.051% 0.005% 0.000% 0.001% 99.896% 0.009% 0.000% 0.001% LBM 2321.134 99.389% 0.100% 0.000% 0.003% 99.389% 0.199% 0.000% 0.003% Leukocyte 4574.771 100.005% 0.001% 0.000% 0.020% 100.004% 0.002% 0.000% 0.020% LIB 440.693 113.356% 0.073% 0.000% 0.014% 113.290% 0.145% 0.000% 0.014% LUD 143.217 100.850% 0.033% 0.000% 0.000% 100.998% 0.067% 0.000% 0.000% MRI-Q 2267.109 100.338% 0.000% 0.000% 0.020% 100.338% 0.001% 0.000% 0.020% MUM 243.938 110.206% 0.045% 0.000% 0.016% 109.746% 0.090% 0.000% 0.016% NN 214.362 100.258% 0.056% 0.000% 0.000% 100.399% 0.112% 0.000% 0.000% NW 287.355 100.184% 0.033% 0.000% 0.000% 100.194% 0.066% 0.000% 0.000% OctreePart 89.658 98.479% 0.069% 0.000% 0.023% 96.935% 0.135% 0.000% 0.023% Pathnder 139.000 99.962% 0.012% 0.000% 0.008% 99.720% 0.024% 0.000% 0.008% SAD 6060.514 100.532% 0.037% 0.000% 0.000% 100.503% 0.074% 0.000% 0.000% SGEMM 749.784 100.010% 0.006% 0.000% 0.065% 100.010% 0.012% 0.000% 0.065% SRAD 907.454 100.570% 0.086% 0.000% 0.004% 100.227% 0.171% 0.000% 0.004% Stencil 796.055 101.911% 0.057% 0.000% 0.000% 101.885% 0.114% 0.000% 0.000% Streamcluster 18591.297 79.917% 0.022% 0.000% 0.000% 101.109% 0.058% 0.000% 0.000% VPR 59.683 109.981% 0.037% 0.000% 0.026% 110.617% 0.074% 0.000% 0.026% Wave 58.331 111.777% 0.102% 0.000% 0.019% 111.508% 0.203% 0.001% 0.019% WP 311.672 98.550% 0.056% 0.000% 0.014% 99.623% 0.112% 0.000% 0.014% 144 Table A.26: Detailed Energy Consumption for Individual Benchmarks (11) Benchmark Baseline Half-Page Level 4K sig 8K sig GPU RDU AVU RR GPU RDU AVU RR Backprop 63.485 102.210% 0.356% 0.000% 0.009% 102.226% 0.711% 0.000% 0.009% Barneshut 27.546 105.252% 0.559% 0.000% 0.021% 105.252% 1.116% 0.001% 0.021% BFS-ISPASS 44.352 107.659% 0.890% 0.000% 0.000% 107.488% 1.780% 0.000% 0.000% BFS-Rodinia 1011.252 102.634% 0.383% 0.000% 0.000% 102.639% 0.765% 0.000% 0.000% Btree 156.392 111.226% 0.212% 0.001% 0.013% 111.226% 0.423% 0.003% 0.013% CP 36.962 100.063% 0.010% 0.000% 0.033% 100.063% 0.021% 0.000% 0.033% CudaCuts 99.797 110.442% 0.289% 0.000% 0.000% 110.442% 0.577% 0.000% 0.000% CutCP 1892.855 100.007% 0.003% 0.000% 0.003% 100.007% 0.007% 0.000% 0.003% Gaussian 3119.641 109.193% 0.401% 0.000% 0.000% 109.188% 0.801% 0.000% 0.000% Heartwall 2531.550 100.174% 0.294% 0.002% 0.037% 99.089% 0.588% 0.004% 0.037% Hotspot 28.048 100.235% 0.103% 0.000% 0.039% 100.235% 0.206% 0.000% 0.039% KMeans 3970.755 99.896% 0.018% 0.000% 0.001% 99.896% 0.036% 0.000% 0.001% LBM 2321.134 99.389% 0.398% 0.000% 0.003% 99.389% 0.794% 0.000% 0.003% Leukocyte 4574.771 100.002% 0.005% 0.000% 0.020% 100.000% 0.009% 0.000% 0.020% LIB 440.693 113.281% 0.289% 0.000% 0.014% 113.203% 0.577% 0.000% 0.014% LUD 143.217 100.998% 0.133% 0.000% 0.000% 100.998% 0.265% 0.000% 0.000% MRI-Q 2267.109 100.338% 0.002% 0.000% 0.020% 100.338% 0.003% 0.000% 0.020% MUM 243.938 109.979% 0.179% 0.000% 0.016% 109.979% 0.357% 0.000% 0.016% NN 214.362 100.399% 0.224% 0.000% 0.000% 100.399% 0.447% 0.000% 0.000% NW 287.355 100.193% 0.132% 0.000% 0.000% 100.193% 0.264% 0.000% 0.000% OctreePart 89.658 96.935% 0.270% 0.000% 0.023% 96.935% 0.539% 0.001% 0.023% Pathnder 139.000 99.720% 0.049% 0.000% 0.008% 99.720% 0.097% 0.000% 0.008% SAD 6060.514 100.544% 0.148% 0.000% 0.000% 100.550% 0.297% 0.000% 0.000% SGEMM 749.784 100.010% 0.024% 0.000% 0.065% 100.010% 0.049% 0.000% 0.065% SRAD 907.454 100.214% 0.341% 0.000% 0.004% 100.318% 0.681% 0.000% 0.004% Stencil 796.055 101.872% 0.228% 0.000% 0.000% 101.872% 0.456% 0.000% 0.000% Streamcluster 18591.297 101.129% 0.116% 0.000% 0.000% 101.532% 0.234% 0.000% 0.000% VPR 59.683 109.095% 0.148% 0.000% 0.026% 109.916% 0.297% 0.001% 0.026% Wave 58.331 111.757% 0.407% 0.001% 0.019% 112.255% 0.817% 0.003% 0.019% WP 311.672 100.342% 0.224% 0.000% 0.014% 100.423% 0.447% 0.000% 0.014% 145 Table A.27: Detailed Energy Consumption for Individual Benchmarks (12) Benchmark Baseline Half-Page Level 16K sig GPU RDU AVU RR Backprop 63.485 102.226% 1.422% 0.000% 0.009% Barneshut 27.546 105.252% 2.232% 0.001% 0.021% BFS-ISPASS 44.352 107.488% 3.559% 0.000% 0.000% BFS-Rodinia 1011.252 102.501% 1.528% 0.000% 0.000% Btree 156.392 111.226% 0.847% 0.005% 0.013% CP 36.962 100.063% 0.041% 0.000% 0.033% CudaCuts 99.797 110.442% 1.153% 0.001% 0.000% CutCP 1892.855 100.007% 0.013% 0.000% 0.003% Gaussian 3119.641 109.188% 1.601% 0.000% 0.000% Heartwall 2531.550 99.163% 1.175% 0.009% 0.037% Hotspot 28.048 100.235% 0.412% 0.000% 0.039% KMeans 3970.755 99.896% 0.072% 0.000% 0.001% LBM 2321.134 99.389% 1.588% 0.000% 0.003% Leukocyte 4574.771 100.000% 0.018% 0.000% 0.020% LIB 440.693 113.295% 1.154% 0.000% 0.014% LUD 143.217 100.998% 0.531% 0.000% 0.000% MRI-Q 2267.109 100.338% 0.006% 0.000% 0.020% MUM 243.938 109.979% 0.714% 0.000% 0.016% NN 214.362 100.399% 0.893% 0.001% 0.000% NW 287.355 100.193% 0.528% 0.000% 0.000% OctreePart 89.658 96.935% 1.077% 0.001% 0.023% Pathnder 139.000 99.720% 0.194% 0.000% 0.008% SAD 6060.514 100.550% 0.593% 0.000% 0.000% SGEMM 749.784 100.010% 0.098% 0.000% 0.065% SRAD 907.454 100.294% 1.361% 0.000% 0.004% Stencil 796.055 101.872% 0.912% 0.001% 0.000% Streamcluster 18591.297 101.169% 0.462% 0.000% 0.000% VPR 59.683 110.280% 0.594% 0.002% 0.026% Wave 58.331 112.255% 1.634% 0.005% 0.019% WP 311.672 100.385% 0.894% 0.000% 0.014% 146
Abstract (if available)
Abstract
Graphics Processing Units (GPUs) are designed primarily to execute multimedia, and game rendering applications. These applications are characterized with streaming data that have little to no data sharing between threads. Because of their high power efficiency, massive parallel computational capability, and high off-chip memory bandwidth, GPUs are now making in-roads into executing general purpose applications that have significant, but somewhat irregular, parallelism. The improvements in the programming interfaces such as CUDA and OpenCL accelerate the adoption of GPUs for general purpose applications. However, these new application usages do not align well with the underlying GPU architecture. In particular, some of the irregular applications do share data between threads and they also exhibit inter-thread communication patterns that are not well supported in current GPU hardware. Unlike traditional graphics applications, that mostly deal with streaming data, the new class of applications also shows some level of temporal and spatial locality between threads executing in the same kernel or thread block. But GPUs have limited cache capacity and do not support efficient inter-thread communication through memory. As such the programmer/compiler ought to find ad-hoc solutions to tackle these challenges. This thesis presents a set of unifying GPU memory system improvements that enable efficient data sharing between threads, and also a comprehensive coherence and consistency models to enable efficient inter-thread communication. ❧ The first part of this thesis shows that there is significant data sharing across threads in a GPU while executing general purpose applications. However, due to poor thread scheduling data sharing leads to the replication of data in multiple private caches across many streaming multiprocessor cores (SMs) in a GPU, which in turn reduces the effective cache size. To tackle this challenge this thesis presents an efficient data sharing mechanism that reduces redundant data copies in the memory system. It includes a sharing-aware thread block (also called Cooperative Thread Array (CTA)) scheduler that attempts to assign CTAs with data sharing to the same SM to reduce redundant storage of data in private L1 caches across SMs. The design is further enhanced with a sharing-aware cache allocation and replacement policy. The sharing-aware cache management approach dynamically classifies private and shared data. Private blocks are given higher priority to stay longer in L1 cache, and shared blocks are given higher priority to stay longer in L2 cache. The evaluation experiments show that the proposed design reduces the off-chip traffic by 19% which translates to an average DRAM power reduction of 10% and performance improvement of 7%. ❧ The second part of the thesis focuses on supporting intuitive memory coherence and consistency models that programmers are familiar with in the CPU domain. The thesis presents a GPU-centric Time Stamp Coherence (G-TSC), a novel cache coherence protocol for GPUs that is based on timestamp ordering. G-TSC conducts its coherence transactions in logical time rather than physical time and uses time stamp based self invalidation of cached data, which reduce the coherence traffic dramatically. The thesis demonstrates the challenges in adopting timestamp coherence for GPUs which support massive thread parallelism and have unique microarchitecture features, and then presents a number of solutions that tackle GPU-centric challenges. Evaluation of G-TSC shows that it outperforms time-based coherence by 38% with release consistency. ❧ The third part of the thesis explores efficient approaches to enforce sequential consistency in GPUs. The main intuition behind this work is that a significant fraction of the coherence traffic can be curtailed by simply delaying the propagation of updated data values across SMs until the end of of an epoch, where an epoch is broadly defined as the time between two data race occurrences. A data race occurs when two threads concurrently access data where at least one access is a write access. The thesis presents a simple bloom filter based signature generation mechanism that keeps track of write-sets from each SM in a signature and uses the signature to dynamically detect races. Data updates are propagated when a race is detected from the signatures which in turn provides sequentially consistent execution. The evaluation of the proposed scheme shows that it can achieve sequential consistency with performance overhead as low as 5% and with energy overhead as low as 2.7%. ❧ Although GPUs are equipped with multi-level caches, general purpose applications on GPUs experience significant memory access bottlenecks. The miss rates in the L1 private cache and L2 shared caches are high despite the data locality among executed threads. The conventional data management in GPUs, which is inherited from CPUs, worsen the cache performance rather than enhance it. That is because the GPU execution model is totally different than that in CPUs. A single streaming multiprocessor (SM) is able to execute thousands of concurrent hardware threads whereas a CPU is able to execute a handful of concurrent hardware threads. This high level of thread-level parallelism (TLP) imposes an additional pressure on the per-SM private cache. Moreover, the poor cache performance affect the bandwidth and traffic in the interconnection network that connects the private and share caches. Hence, one of the goals of this dissertation is come up with a new cache management scheme that consider inter-CTA data sharing. The purpose of the proposed scheme is to maintain and improve the GPU power efficiency without sacrificing the GPU performance. ❧ Another issue with the current GPU memory system is the inter-kernel thread communication. Since the primary usage for the GPUs is to execute graphics applications which have streaming data with no communication between executing threads, they rely on a simple software-based coherence protocol and very weak memory model. The implemented coherence protocol ensure the propagation of the updates at kernel boundaries by flushing the private caches and spilling the updates to the main memory. Hence, any producer-consumer relationship between executing threads should be split into to two kernels. However this strategy is seems to be acceptable, providing a coherence protocol and an advanced memory model that allows inter-kernel thread communication can enhance the performance of the executed applications by roughly 30%. ❧ In this dissertation, we address the need to inter-kernel thread communication. We proposed a coherence protocol that allow threads to exchange updated data during kernel execution without the need to wait until the kernel boundaries. We also consider the implementation of different memory model that are easier for the programmers to utilize and built their parallel applications on them. These proposed schemes and designed exploits the GPU ability to hide the memory latency through TLP and quick context switching.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Enabling energy efficient and secure execution of concurrent kernels on graphics processing units
PDF
Demand based techniques to improve the energy efficiency of the execution units and the register file in general purpose graphics processing units
PDF
Architectural innovations for mitigating data movement cost on graphics processing units and storage systems
PDF
Resource underutilization exploitation for power efficient and reliable throughput processor
PDF
Efficient techniques for sharing on-chip resources in CMPs
PDF
Improving reliability, power and performance in hardware transactional memory
PDF
Hardware techniques for efficient communication in transactional systems
PDF
Efficient graph processing with graph semantics aware intelligent storage
PDF
Component-based distributed data stores
PDF
Exploiting variable task granularities for scalable and efficient parallel graph analytics
PDF
Asynchronous writes in cache augmented data stores
PDF
Acceleration of deep reinforcement learning: efficient algorithms and hardware mapping
PDF
Transparent consistency in cache augmented database management systems
PDF
Hardware and software techniques for irregular parallelism
PDF
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
PDF
A framework for runtime energy efficient mobile execution
PDF
Improving the efficiency of conflict detection and contention management in hardware transactional memory systems
PDF
CUDA deformers for model reduction
PDF
Performance-optimal read-only transactions
PDF
Efficient processing of streaming data in multi-user and multi-abstraction workflows
Asset Metadata
Creator
Tabbakh, Abdulaziz Salah
(author)
Core Title
Efficient memory coherence and consistency support for enabling data sharing in GPUs
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Publication Date
02/23/2018
Defense Date
02/20/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
accelerator,cache,computer architecture,GPU,memory,OAI-PMH Harvest
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Annavaram, Murali (
committee chair
), Ghandeharizadeh, Shahram (
committee member
), Qian, Xuehai (
committee member
)
Creator Email
azizsn@gmail.com,tabbakh@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-478336
Unique identifier
UC11268133
Identifier
etd-TabbakhAbd-6063.pdf (filename),usctheses-c40-478336 (legacy record id)
Legacy Identifier
etd-TabbakhAbd-6063.pdf
Dmrecord
478336
Document Type
Dissertation
Rights
Tabbakh, Abdulaziz Salah
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
accelerator
cache
computer architecture
GPU
memory