Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Enabling energy efficient and secure execution of concurrent kernels on graphics processing units
(USC Thesis Other)
Enabling energy efficient and secure execution of concurrent kernels on graphics processing units
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Enabling Energy Efficient and Secure Execution of Concurrent Kernels on Graphics Processing Units by Qiumin Xu A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2017 Copyright 2017 Qiumin Xu Dedication To my mother, Hong Zhao, whose love, courage, wisdom and devotion have been the strength of my striving. ii Acknowledgements I would like to express my heartfelt gratitude to my advisor Professor Murali Annavaram for the continuous support of my Ph.D study. He has been a tremendous advisor who led me into this exciting research area. He is always kind, patient, enthusiastic and open to giving advice in research, career and even life. He has always been a role model, who constantly encouraged me to pursue an academic career and to aim high. I especially appreciate that he allowed me the freedom to explore new ideas at my pace all the time through my Ph.D study. I have greatly benefited from many illuminating discussions with him. Without his guidance and persistent help, this dissertation would not have been possible. I would also like to thank his wife Sok for making the yard parties the most fun and warm memories. In addition to my advisor, I would particularly like to thank my thesis committee: Professor Timothy Pinkston and Professor Wyatt Lloyd, for being always supportive and providing insightful suggestions to improve my thesis. I would also thank Professor Mas- soud Pedram and Professor Xuehai Qian for serving on my qualifying exam committee and as references. I am grateful to my fellow labmate Hyeran Jeon, for the stimulating discussions, for the sleepless nights when we scrambled to make deadlines, and for the fun we have had in the last six years. Special thanks go to my other collaborators: Keunsoo Kim and Professor Won Woo Ro from Yonsei University as well as Hoda Naghibijouybari and Professor Nael Abu-Ghazaleh from University of California - Riverside. This thesis won’t be possible without their help. iii I would like to thank my internship mentors: Dr. Manu Awasthi, Dr. Ananth Nal- lamuthu, Dr. Mrinmoy Ghosh and Ishita Majumdar for providing me the opportunity to learn the cutting edge technologies in industry. Working with you has been a fantastic learning experience, which has enriched my mind and a great boon to my research. I would also like to thank many co-workers who offered me a lot of help and career advice during my internships: Vijay, Anahita, Zvika, Harry, Jingpei, Janki, Krishna, Aravindh, David, George, Sheshadri, Imran and Chris. At the University of Southern California, I received support from the faculty, staff and students. My graduate experience benefited greatly from the courses I took, the op- portunity I had under Professor Monte Ung, Professor Mark Redekopp and Professor Mary Eshaghian to serve as a teaching assistant. I also thank Professor Bhaskar Krish- namachari who has always been very kind, passionate and constantly shares insights on new technologies with the students. I would also like to thank our best EE staff Diane Demetras, Tim Boston, Estela Lopez and Kathy Kassar for tirelessly offering prompt help and advice during my last six years. I thank my other fellow labmates in SCIP group: Daniel Wong, Sangwon Lee, Gun- jae Koo, Abdulaziz Tabbakh, Krishna Giri Nara, Zhifeng Lin, Yunho Oh, Sangpil Lee, Myung Kuk Yoon, Kiran Matam, Ruisheng Wang, Lizhong Chen, Lakshmi Kumar Dab- biru, Mehrtash Manoochehri, Mohammad Abel-Majeed, Waleed Dweik, Melina De- mertzi and Bardia Zandian who supported me, inspired me and incentivized me to strive for my goal. I also thank my roommates: Yi Gai, Shangxing Wang, Ying Lu, and other friends at USC, for the joy we shared together and the emotional support needed to sur- vive the PhD journey we shared. Last but not the least, I owe my deep gratitude to my beloved family: my parents, grandparents, aunts, uncles and Sean for their unconditional love and support throughout writing this thesis and my life. iv Table of Contents Dedication ii Acknowledgements iii List of Tables viii List of Figures ix Abstract xi 1 Introduction: 1 1.1 GPU is a Critical Compute Engine Today . . . . . . . . . . . . . . . . 1 1.2 Introduction to GPU Computing . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 GPU Software Model . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.2 GPU Hardware Architecture . . . . . . . . . . . . . . . . . . . 5 1.2.3 SM Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.4 Branch Divergence . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3 Growing Amount of GPU Resources . . . . . . . . . . . . . . . . . . . 10 1.4 Challenge of Resource Underutilization in GPUs . . . . . . . . . . . . 12 1.5 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.5.1 Proposal #1: Pattern Aware Scheduling for GPU Power Gating . 15 1.5.2 Existing GPU Concurrent Kernel Support . . . . . . . . . . . . 16 1.5.3 Proposal #2: Intra-SM slicing for GPU Multiprogramming . . . 18 1.5.4 Security Problems of GPU Concurrent Kernel Execution . . . . 19 1.5.5 Proposal #3: Eliminating Covert Channel Attacks on GPUs . . 20 1.6 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . 22 2 Resource Underutilization on GPUs: A Case Study on Graph Applications 23 2.1 Chapter Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.1.1 Graph Processing on GPUs . . . . . . . . . . . . . . . . . . . . 23 2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.1 Graph Applications . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.2 Non-graph Applications . . . . . . . . . . . . . . . . . . . . . 28 v 2.2.3 Experimental Environment . . . . . . . . . . . . . . . . . . . . 29 2.3 Motivational Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.3.1 Kernel Execution Pattern . . . . . . . . . . . . . . . . . . . . . 30 2.3.2 Performance Bottlenecks . . . . . . . . . . . . . . . . . . . . . 33 2.3.3 SRAM Resource Sensitivity . . . . . . . . . . . . . . . . . . . 34 2.3.4 SIMT Lane Utilization . . . . . . . . . . . . . . . . . . . . . . 40 2.3.5 Execution Frequency of Instruction Types . . . . . . . . . . . . 41 2.3.6 Coarse and Fine-grain Load Balancing . . . . . . . . . . . . . . 42 2.3.7 Scheduler Sensitivity . . . . . . . . . . . . . . . . . . . . . . . 46 2.4 Improve Graph Processing Efficiency . . . . . . . . . . . . . . . . . . 46 2.4.1 Reduce Performance Bottleneck . . . . . . . . . . . . . . . . . 47 2.4.2 Reduce Load Imbalance . . . . . . . . . . . . . . . . . . . . . 48 2.5 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3 PATS: Pattern Aware Scheduling and Power Gating for GPUs 51 3.1 Chapter Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2 Explore the Divergence Patterns . . . . . . . . . . . . . . . . . . . . . 53 3.3 Challenges in Per Lane Power Gating in GPUs . . . . . . . . . . . . . 61 3.4 Pattern Aware Two-level Scheduler (PATS) . . . . . . . . . . . . . . . 64 3.4.1 Design Issues of Pattern Aware Two-level Scheduler (PATS) . . 66 3.4.2 Gating Penalty Avoidance Using Deterministic Lookahead . . . 70 3.4.3 Pattern and Instruction Aware Scheduler (PATS++) . . . . . . . 72 3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.5.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.5.2 Static Energy Impact . . . . . . . . . . . . . . . . . . . . . . . 74 3.5.3 Performance Impact . . . . . . . . . . . . . . . . . . . . . . . 77 3.5.4 Sensitivity to Power Gating Parameters . . . . . . . . . . . . . 77 3.5.5 Implementation Complexity . . . . . . . . . . . . . . . . . . . 78 3.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4 Warped-Slicer: Intra-SM Slicing for Efficient Concurrent Kernel Execution on GPUs 81 4.1 Chapter Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.1.1 GPU Multiprogramming . . . . . . . . . . . . . . . . . . . . . 82 4.2 Methodology and Motivation . . . . . . . . . . . . . . . . . . . . . . . 85 4.2.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.2.2 Motivational Analysis . . . . . . . . . . . . . . . . . . . . . . 86 4.3 Intra-SM Slicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.4 Intra-SM Resource Partitioning Using Water-Filling . . . . . . . . . . . 91 4.4.1 Profiling Strategy . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.4.2 Dealing with Phase Behavior . . . . . . . . . . . . . . . . . . . 99 vi 4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.5.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.5.2 Resource Utilization . . . . . . . . . . . . . . . . . . . . . . . 104 4.5.3 Cache Misses . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.5.4 Stall Cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.5.5 Multiple Kernels Sharing SM . . . . . . . . . . . . . . . . . . 106 4.5.6 Fairness Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.5.7 Power and Energy Analysis . . . . . . . . . . . . . . . . . . . 107 4.5.8 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . 108 4.5.9 Implementation Overhead . . . . . . . . . . . . . . . . . . . . 109 4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5 GPUGuard: Eliminating Covert Channel Attacks on GPUs 114 5.1 Chapter Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.1.1 Timing Attacks on GPUs . . . . . . . . . . . . . . . . . . . . . 115 5.1.2 Temporal Partitioning . . . . . . . . . . . . . . . . . . . . . . 116 5.1.3 Spatial Partitioning . . . . . . . . . . . . . . . . . . . . . . . . 117 5.2 GPUGuard Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.2.1 Timing Attack Detection . . . . . . . . . . . . . . . . . . . . . 119 5.2.1.1 Decision Tree Classifier . . . . . . . . . . . . . . . . 120 5.2.1.2 Feature Dataset Extraction . . . . . . . . . . . . . . 120 5.2.1.3 Feature Selection . . . . . . . . . . . . . . . . . . . 122 5.2.2 Timing Attack Defense: Security Domain Hierarchy . . . . . . 123 5.3 Tangram: Intra-SM Security Domains for Eliminating Timing Attacks . 125 5.3.1 Intra-SM Security Domains in Tangram . . . . . . . . . . . . . 125 5.3.2 Tangram Security Unit . . . . . . . . . . . . . . . . . . . . . . 130 5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.4.2 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . 136 5.4.3 Robustness of the Detection Scheme . . . . . . . . . . . . . . . 136 5.4.4 Hardware Overhead . . . . . . . . . . . . . . . . . . . . . . . 138 5.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 5.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6 Conclusions 141 Reference List 144 vii List of Tables 1.1 Comparison of NVIDIA Tesla GPUs released in the past 10 years [24, 17] 11 2.1 Experimental environment . . . . . . . . . . . . . . . . . . . . . . . . 29 2.2 12 Graph applications collected from various sources (V: Vertices, E: Edges) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.1 5 most common divergence patterns for 19 benchmarks. . . . . . . . . . 54 3.2 5 most common divergence patterns for 19 benchmarks (Cont.). . . . . 55 4.1 Baseline configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.2 Resource utilization fluctuates across 10 GPU applications (arithmetic mean of all cores across total cycles). . . . . . . . . . . . . . . . . . . 86 4.3 Resource partitioning when Warped-Slicer (Dyn) and Even multipro- gramming algorithms are used. . . . . . . . . . . . . . . . . . . . . . 103 5.1 All collected features to create dataset. . . . . . . . . . . . . . . . . . . 121 5.2 Class labels for different applications. . . . . . . . . . . . . . . . . . . 122 5.3 Baseline configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.4 Confusion matrix (P: Predicted, A: Actual) . . . . . . . . . . . . . . . 132 viii List of Figures 1.1 Performance comparison of peak performance of executing FP32 in- structions for NVIDIA GPUs (Source: [24, 17]) and Intel CPUs (* re- ported by SEGMM benchmark in Geekbench 4 [8] with the number of cores and frequency marked in the label). . . . . . . . . . . . . . . . . 2 1.2 Illustration of the GPU software model . . . . . . . . . . . . . . . . . . 3 1.3 GPU SIMT architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Basics of GPU architecture . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 GPU warp scheduler and register file . . . . . . . . . . . . . . . . . . . 7 1.6 GPU pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.7 GPU branch divergence . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.8 Resource underutilization in graph applications . . . . . . . . . . . . . 12 1.9 Load distribution: # of assigned CTAs across SMs . . . . . . . . . . . . 13 1.10 Task management hierarchy of GPUs . . . . . . . . . . . . . . . . . . 16 2.1 Kernel function invocation count for graph and non-graph applications. 31 2.2 Average kernel function execution times. . . . . . . . . . . . . . . . . . 32 2.3 Breakdown of reasons of pipeline stalls . . . . . . . . . . . . . . . . . 33 2.4 Utilization of SRAM structures . . . . . . . . . . . . . . . . . . . . . . 35 2.5 Normalized IPC w.r.t. cache size . . . . . . . . . . . . . . . . . . . . . 37 2.6 Cache miss rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.7 L1 cache misses as a fraction of total accesses to the GPU L1 cache and shared memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.8 SIMT lane utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.9 Execution frequency of instruction types . . . . . . . . . . . . . . . . . 41 2.10 Coarse grained load distribution: # assigned CTAs across SMs . . . . . 43 2.11 Fine grained load distribution: (a) execution time variance across CTAs and (b)coefficient of execution time variation across warps within a CTA 44 2.12 Performance w.r.t. scheduler . . . . . . . . . . . . . . . . . . . . . . . 46 3.1 Illustration of divergence pattern behavior . . . . . . . . . . . . . . . . 56 3.2 Divergence patterns for divergent benchmarks . . . . . . . . . . . . . . 58 3.3 Divergence control flow of pathfinder . . . . . . . . . . . . . . . . . . 59 3.4 Divergence control flow of bfs . . . . . . . . . . . . . . . . . . . . . . 60 3.5 Effect of warp scheduler on idle cycles . . . . . . . . . . . . . . . . . . 64 ix 3.6 Actual execution trace of the INT units collected from the hotspot bench- mark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.7 Illustrations of warp charger state machine . . . . . . . . . . . . . . . . 69 3.8 Illustrations of the deterministic look ahead rule (LAR) . . . . . . . . . 71 3.9 Static energy impact of proposed techniques. . . . . . . . . . . . . . . 75 3.10 Performance impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.11 Sensitivity of power gating parameters . . . . . . . . . . . . . . . . . . 78 4.1 Fraction of total cycles (of all cores) during which warps cannot be issued due to different reasons. . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.2 Illustration of proposed storage resource allocation strategies for improv- ing resource fragmentation. . . . . . . . . . . . . . . . . . . . . . . . . 89 4.3 (a) Performance vs. increasing CTA occupancy in one SM, (b) identify the performance sweet spot. . . . . . . . . . . . . . . . . . . . . . . . 91 4.4 Illustration of the proposed profiling strategy used in Warped-Slicer. . . 98 4.5 Sampling the program characteristics using a 5K cycles of sampling win- dow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.6 Performance results of all 30 pairs of applications: (a) Compute + Cache. (b) Compute + Memory. (c) Compute + Compute. The results are nor- malized to baseline Left-Over policy. GMEAN shows the overall geo- metric mean performance across the three workload categories. . . . . . 101 4.7 Assorted resource utilization of the proposed Warped-Slicer normalized by even partitioning policy . . . . . . . . . . . . . . . . . . . . . . . . 104 4.8 Cache miss rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.9 Breakdown of total stall cycles . . . . . . . . . . . . . . . . . . . . . . 105 4.10 Performance result when combining three applications in an SM. . . . . 107 4.11 Comparison of fairness improvement (Normalized to Left-Over Policy) and ANTT reduction between multiprogramming policies. . . . . . . . 107 4.12 Performance sensitivity to profiling parameters and warp schedulers. . . 108 5.1 Illustration of temporal partitioning technique on GPUs . . . . . . . . . 116 5.2 Illustration of existing spatial partitioning techniques on GPUs . . . . . 117 5.3 Overall design of the GPUGuard . . . . . . . . . . . . . . . . . . . . . 119 5.4 GPUGuard will select a security domain level based on a specific attack type (our contributions are highlighted) . . . . . . . . . . . . . . . . . 124 5.5 Intra-SM security protection for eliminating timing attacks (shaded units are our modifications) . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.6 Performance impact of Tangram and GPUGuard. The results are nor- malized to temporal partitioning. . . . . . . . . . . . . . . . . . . . . . 134 5.7 Performance impact of GPUGuard on Kepler Architecture . . . . . . . 136 5.8 Comparison of detection miss rate of GPUGuard and enhanced GPU Guard when the attacker tries to evade the detection by lowering the at- tacking bandwidth by 2x, 10x and 100x respectively. . . . . . . . . . . 137 x Abstract Graphics processing unit (GPU) is the computing platform of choice for many massively parallel applications, including high performance scientific computing, machine learn- ing and artificial intelligence. However, GPU energy efficiency has been significantly curtailed by severe resource underutilization. This thesis first presents a solution to improve the energy efficiency through power gating the unused resources. Pattern aware two-level scheduling (PATS) was proposed to deal with the divergent execution patterns and improve power gating efficiency. How- ever, PATS alone is insufficient, as it is still a waste that the resources built inside the GPU are not well used. Concurrent kernel execution sheds light on resolving the resource un- derutilization issue through co-execution of kernels with complementary resource usage demands. However, it also brings to the forefront the vulnerability to covert-channel at- tacks. This thesis evaluates two facets of concurrent kernel execution: Firstly, dynamic intra-SM slicing to enable efficient sharing of resources within an SM across a scalable number of kernels; Secondly, a machine learning based intra-SM defense scheme that can reliably close the covert channels. xi Chapter 1 Introduction: 1.1 GPU is a Critical Compute Engine Today Graphics processing unit (GPU) is the execution platform of choice for many mas- sively parallel applications. Traditionally, GPUs were targeted to execute only graphics- oriented applications. With the advent of new programming models, such as CUDA [14] (first released in 2007) and OpenCL [117] (first released in 2009), the barrier for pro- gramming GPUs has significantly reduced. Since these programming models support complex control-flow operations software developers are able to program even general purpose applications to execute on GPUs. The availability of massive parallel comput- ing power has attracted many researchers and developers to parallelize physics simula- tions [147], bioinformatics [144], financial applications [66] and many other scientific computations to run on GPUs. Prior works have demonstrated over 100x speedup when executing applications on GPUs, compared to running them on CPUs [63, 88, 108, 57]. For instance, Fang et, al. [57] reported a parallel Monte Carlo algorithm for model- ing photon migration in 3D turbid media which gained 300x speedup on an NVIDIA 8800GT(G92) GPU than on an Intel Xeon CPU. While there is an on-going debate on the how much the GPUs can actually speedup over CPU [98, 16], the march towards adopting GPUs for general purpose computing has steadily gained traction. 1 8800GTX C2050 (Fermi) K40 (Kepler) M40 (Maxwell) P100 (Pascal) V100 (V olta) E4300 (Core Dual, 2 Cores, 2.4GHz) E5620 (Westmere, 8 Cores, 2.4GHz) E5-2670(Sandy Bridge, 16 Cores, 2.6GHz) E5-2670 v3(Haswell, 24 Cores, 2.3GHz E5-2699 v4 (Broadwell, 22 Cores, 2.2GHz) E5-2699 v5 (Skylake, 32 Cores, 2.3GHz) 0 2000 4000 6000 8000 10000 12000 14000 16000 2004 2006 2008 2010 2012 2014 2016 2018 FP32 Gflops Release Date (Approx.) NVIDIA Peak FP32 Gflops Intel Peak FP32 Gflops for all cores* Figure 1.1: Performance comparison of peak performance of executing FP32 instructions for NVIDIA GPUs (Source: [24, 17]) and Intel CPUs (* reported by SEGMM benchmark in Geekbench 4 [8] with the number of cores and frequency marked in the label). Figure 1.1 compares the peak throughput of executing 32 bit floating point instruc- tions (FP32) for GPUs, and server class CPUs. FP32 flops has been an important metric to measure the compute capability. It is clear from the graph that despite increasing core and ALU counts with each new generation of CPU, the gap in FP32 Gflops between GPU and CPU has been increasing. Given their superior FP32 performance, GPUs are becoming the main computing platform for accelerating even big data processing. The popularity of GPUs can also be inferred from millions of CUDA downloads, hundreds of universities worldwide teaching GPU-centric application acceleration, and most importantly hundreds of thousands GPUs powering super computing centers all over the world [76]. Today, GPUs have become a critical computing engine, widely deployed in data centers [71], cloud platforms [9, 1] and TOP500 supercomputers [12]. 2 1.2 Introduction to GPU Computing The reason why GPU computing is becoming so powerful for large scale data processing lies in its unique computing model and hardware architecture. We provide an overview of the GPU architecture. 1.2.1 GPU Software Model Different from CPUs, which are mainly optimized for single thread applications with limited parallelism, GPUs are optimized for accelerating large data parallel applications that execute the same sequence of operations repeatedly on massive amount of data. Therefore, throughput metric, such as the number of floating point operations per second (FLOPS), rather than single thread latency is the first priority for GPUs. GPUs adopt single instruction multiple thread (SIMT) execution model, where a single instruction is executed concurrently on several execution lanes (called SIMT lanes) using multiple data items. The primary difference between SIMT and the more well known SIMD (single instruction multiple data) execution model is that SIMT allows the parallel threads to diverge and take different execution paths during control flow operations. Application Kernel 1 Kernel 2 Kernel 3 CTA3 CTA2 CTA1 CTA6 CTA5 CTA4 Warp3 Warp2 Warp1 Warp4 … 32 threads executing in lock step by sharing PC GPU issue CTAs to SMs Application -> Kernels -> CTAs -> Warps -> Threads Figure 1.2: Illustration of the GPU software model 3 Figure 1.2 shows the GPU software model. For simplicity, we will use NVIDIA ter- minology throughput this thesis. A GPU application consists of multiple kernels, where each kernel performs a specific function on a large dataset by exploiting the massive SIMT parallelism. Each kernel is composed of thousands of threads which are usually arranged in a three-dimensional grid format. The 3D grid format of threads maps quite well with multi-dimensional datasets that these threads operate on. The threads in a ker- nel are sub-divided into concurrent thread arrays (CTAs). Each CTA is essentially a slice of the 3D grid of threads that make up the kernel. Each CTA is assigned to execute on a stream multiprocessors (SM). An SM is a par- allel execution engine consisting of multiple integer, floating point and special function units and is equipped with a small on-chip cache and memory (more detailed SM descrip- tion is provided in the next section). All the hardware resources required for executing a CTA, such as the number of architectural registers and the size of the shared memory, may be determined statically. Furthermore, the size of each CTA within a kernel is also identical. To reduce the scheduling complexity CTAs once assigned to an SM cannot be preempted and must continue to execute till the CTA execution is complete. Since SM has limited number of hardware resources, namely the register file size, and shared memory, and the non-preemptive nature of CTA execution it is possible to determine the maximum number of CTAs that may fit within each SM at the time of launching a ker- nel execution. SMs may also limit the maximum CTA count due to other hardware and implementation constraints. Each CTA itself is sub-divided into groups of threads (typically 32 threads), and each of these thread groups is called a warp. The threads within a warp execute the same program in lock step manner. They share a Program Counter (PC) but access different data operands. A warp is a minimum unit of computation scheduled on an SM. Each CTA consists of tens of warps and since each SM has multiple CTAs, there are dozen of warps that co-reside within an SM. 4 Fetch Decode / Reg. Warp 1 Warp 2 Warp 3 Simple Control Data Parallel Single Instruction Fetch / Decode Conditional Multithread Execution Figure 1.3: GPU SIMT architecture As shown in Figure 1.3, all threads within a warp need to only fetch and decode a single instruction thereby amortizing the cost of generating control information. Further- more, threads within a warp do not have data dependence and do not need to synchronize. Hence, the simplified control coupled with massive parallel hardware makes GPUs an en- ergy efficient choice for throughput-oriented computing. We discuss how warps handle branch instructions and the associated control flow divergence later in this chapter. 1.2.2 GPU Hardware Architecture There are two types of GPUs: discrete and integrated. Integrated GPUs are in the same chip with the host CPU and shares the same physical memory with the CPU hosts. For example, AMD Acclerated Processing Unit (APU) features GPUs and CPUs on the same chip [2]. Discrete GPUs, on the other hand, are on a separate chip and have their own dedicated memory. In this thesis we do not make any specific assumptions on GPU integration. However, the NVIDIA GPUs mentioned in Figure 1.1 are available as a discrete GPU card that may be placed into a PCIe slot of a desktop computer. As such in this thesis, we focus on the discrete GPUs which communicate with the CPU host through PCIe Bus. 5 Warp Schedulers Register File Shared Mem / L1D Cache Special Function Unit x4 Load/Store Unit x16 Core x16 x2 (INT/FP Units) Const Cache Tex Cache Fetch / Decode Inst. Cache Global Thread Block Scheduler Interconnection Network Shared L2 Cache ... ... DRAM PCIe Bus, to CPU host GDDR5 GPU Streaming Multiprocessor (SM) Figure 1.4: Basics of GPU architecture Figure 1.4 shows our baseline GPU architecture model, which is based on NVIDIA’s Fermi architecture [15]. A GPU consists of a scalable number of streaming multiproces- sors (SMs). Each SM comprises of many execution units, including integer/floating units for arithmetic operations, load store units for memory operations and special function units (SFUs) for instructions such as sine, cosine, and square root. GPU also provides various on-chip memories, including a large register file, shared memory, L1 data cache, constant cache and texture cache. LD/ST units are connected to L1 data cache and shared memory. The SMs are connected to a large shared L2 cache through interconnection net- work. Our baseline model uses a crossbar switch for interconnections. The shared L2 cache is then connected to the off-chip DRAM through GDDR5 interface. GPUs exploit the massive number of parallel threads to tackle stalls in warp execu- tion. Instead of relying on complex latency avoiding techniques, GPUs simply switch between dozens of warps that co-reside in an SM. The fact that GPUs allow multiple warps to be switched very quickly on a cycle by cycle basis is the key to GPU’s latency hiding capability. To enable fast switching, the architectural states of all warps are main- tained using a very large register file within an SM. All the registers of all the warps are 6 stored in a large on-chip register file. Therefore, GPUs minimize the context switching cost to switch from one warp to another. The large register file is usually designed as a multi-banked register file that is shared across all the warps executing on the same SM. The register file size ranges from 128KB in NVIDIA GTX480 Fermi architecture [15] to 256KB in NVIDIA V100 V olta archi- tecture [24]. In NVIDIA V100, the total register file across the chip is 20MB, which is more than three times of the L2 cache size. Since the register demand of each warp is known a priori, and all the warps from the same kernel have the same register demand, the multi-banked regsiter file is split cleanly across each warp, as shown on the right in Figure 1.5. L1 Cache Const Cache Shared Mem Texture Cache … Warp Pool (upto 48-64 warps) INT/FP INT/FP LD/ST SFU Warp Scheduler Warp1 Warp2 Warp3 WarpN Warp4 Register Files (128KB - 256KB) Register Banks Warp 0’s registers Warp 1’s registers Warp N’s registers Warp1 Warp2 Bank0 Bank1 Bank2 Bank3 Figure 1.5: GPU warp scheduler and register file Figure 1.5 illustrates how the warp scheduler works. Each SM contains one or more warp schedulers and each scheduler is able to issue one warp instruction to any execution pipeline, namely INT/FP, LD/ST and SFU, in every scheduling cycle. Among all the warps that are ready to issue, a warp is selected by the warp scheduler according to a warp scheduling algorithm. A very basic warp scheduling algorithm is round-robin: warps are issued to execution sequentially based on warp ID. For better performance, many alternative warp scheduling algorithms have been proposed, including two-level scheduling [120], greedy then oldest scheduling and cache conscious scheduling [133]. 7 1.2.3 SM Pipeline Inst. Cache Decode Warp Sched./Issue Operand Collector LD/ST SFU SP Writeback I-Buffer Score Board Fetch Valid[1:N] Done (WID) V W1 inst R V W1 inst R ... Arbitrator Bank 0 Bank 1 Bank 2 Bank 3 W2 add 1 r2 1 1,2,...,0 1 r5 1 7,0,...,4 ... ... W0 mad 1 r3 0 ----- 1 r1 0 ----- Figure 1.6: GPU pipeline Figure 1.6 illustrates the details of an SM pipeline. Each SM uses an instruction buffer (I-Buffer) to store the decoded instructions for each warp. Depending on the max- imum number of warps supported within each SM the I-buffer size is correspondingly scaled. Each entry in the instruction buffer has the decoded instruction, a valid bit and a ready bit. The valid bit will be set when the instruction is decoded and filled into the buffer entry. The ready bit is set by the scoreboard logic when the operands for the in- structions are ready. In our baseline architecture, there are two entries for each warp in the I-Buffer. The per warp instruction buffer separates the sequential and parallel portion of the pipeline. Before the instruction buffer is the simple control logic which performs instruction fetch and decode. Recall that all threads in a warp execute the same instruc- tion and hence only a single instruction fetch and decode is done for all warp threads. In each cycle, the fetch unit checks the I-buffer entry associated with each warp in a round robin manner, and fetches an instruction for the first warp which has at least one empty I-buffer entry. The decode unit will then decode the instruction, fill the I-Buffer and set the valid bit. 8 In each cycle, the warp scheduler will select one warp from the warp pool to execute. In the order defined by the warp scheduling algorithm, the warp scheduler checks if the selected warp is ready to issue. The I-Buffer will be first checked to see if the warp has a valid instruction to issue. The warp will be stalled if its I-Buffer entries are empty. A scoreboard is used to track RAW and WAW hazards for the operands of each warp. The warp will not be scheduled if there are RAW or WAW hazards. If the input operand of the current warp instruction is dependent on a prior load that needs to be fetched from the global memory, the warp will be stalled on a long memory latency stall. The warp could also be waiting on synchronizations or memory barriers. Otherwise, if the warp instruction is ready to issue, the warp scheduler will check if there is an free execution unit to issue that instruction. Many instructions could take multiple cycle to execute on the execution unit. For example, SFU could be occupied for multiple cycles to execute a square root instruction. In some cases the SFU is not pipelined and hence no other warp can use the same SFU until a previously issued warp completes execution. This is an example of execution unit structure hazard. If there is no hazard, the warp instruction will be issued. An issued warp instruction’s operand register values are read from the multi-banked register file over multiple cycles, which is performed by a unit called operand collector (OC). The operand collector unit handles any potential bank conflicts while reading the register files and temporarily buffers the operands in operand collector buffers. The operand collector buffers allows the SM to handle bank conflicts with ease. A warp instruction is ready for execution when the input operands are all read into the buffers and the necessary execution pipeline is available. Note that if a single warp is stalled due to various reasons, such as waiting for load from the memory, the warp scheduler could select to issue the next warp to keep the pipeline busy. However, if all the warps are stalled from issue, the issue slot is wasted and pipeline will have a bubble during that scheduling slot. 9 1.2.4 Branch Divergence if (tid % 2 == 0) else Active Mask 101010 010101 Figure 1.7: GPU branch divergence GPU branch divergence incurs execution inefficiencies and can hurt the performance. It has been shown in several recent works [61, 93, 115, 130, 149] that when general purpose applications are ported to run on GPUs, these applications fail to fully utilize all the available hardware resources due to branch divergence. The problem of branch divergence is illustrated in Figure 1.7. Whenever a branch instruction is executed within a warp, some threads may take the branch while other threads traverse the not-taken path. Due to the SIMT execution model, GPU serializes the two branch paths. Thus all threads on the taken-path will be idle when executing the non-taken-path and vice-versa. To independently control the execution behavior of each thread within a warp, an active mask vector is used. Each thread’s execution is controlled by whether the corresponding active mask bit is set or reset. If the bit is set the corresponding thread will execute and update the architectural state. If the bit is unset then that thread’s computation does not update any of the architectural state. The SIMT execution model requires that all threads with taken and not-taken path to be executed sequentially. 1.3 Growing Amount of GPU Resources From the GPU architecture description provided in the previous section it is clear that GPUs rely on large number of parallel hardware resources to enable concurrent execution of warps. Driven by the technology scaling, the execution resources are growing in each new generation of the GPU. Table 1.1 lists the detailed specifications for NVIDIA high 10 performance GPUs over the last 10 years. From C2050 (Fermi) to V100 (V olta), the manufacturing process scaled down from 40nm to 12nm, the transistors count increased by 6 times with a 50% increase in die size. According to the International Roadmap for Semiconductors (ITRS) published in 2015, the manufacturing process will continue to scale down to 5nm in 5 years and eventually to 2 nm in 2030 [21]. As such the total available transistors in a die may continue to increase in the future. Tesla GPUs C2050 (Fermi) K40 (Kepler) M40 (Maxwell) P100 (Pascal) V100 (V olta) SMs 14 15 24 56 80 FP32 Cores / SM 32 192 128 64 64 FP32 Cores / GPU 448 2880 3072 3584 5120 FP64 Cores / SM NA 64 4 32 32 FP64 Cores / GPU NA 960 96 1792 2560 Tensor Cores / SM NA NA NA NA 8 Tensor Cores / GPU NA NA NA NA 640 GPU Boost Clock NA 810/875 MHz 1114 MHz 1480 MHz 1462 MHz Register File Size / SM 128KB 256 KB 256 KB 256 KB 256 KB Register File Size / GPU 1792KB 3840 KB 6144 KB 14336 KB 20480 KB Shared Memory Size / SM 16KB/48KB 16KB/32KB/48KB 96KB 64KB up to 96KB Texture Units 56 240 192 224 320 Memory Interface 384-bit GDDR5 384-bit GDDR5 384-bit GDDR5 4096-bit HBM2 4096-bit HBM2 Memory Size Up to 6GB Up to 12 GB Up to 24 GB 16 GB 16GB L2 Cache Size 786 KB 1536 KB 3072 KB 4096 KB 6144 KB TDP 245 Watts 235 Watts 250 Watts 300 Watts 300 Watts Transistors 3.2 billion 7.1 billion 8 billion 15.3 billion 21.1 billion GPU Die Size 526 mm2 551 mm2 601 mm2 610 mm2 815 mm2 Manufacturing Process 40 nm 28 nm 28 nm 16 nm FinFET+ 12 nm FFN Table 1.1: Comparison of NVIDIA Tesla GPUs released in the past 10 years [24, 17] With more and more transistors, both the number of SMs and the resources inside a single SM have been growing. In C2050, there are only 14 SMs while in V100 there are 80 SMs. Inside an SM, the number of FP32 cores doubled and additional FP64 and tensor cores have been added. The register file size, shared memory size and texture units also have increased. Across the GPU, those additional resources per each SM sums up to huge increase at the chip level for each type of resource. Furthermore, the L2 cache size has increased from 768 KB to 6144KB to serve the increased number of SMs. Given 11 these many resources, one of the emerging problems for GPUs is keep these resources fully utilized even in the presence of irregular application behavior. 1.4 Challenge of Resource Underutilization in GPUs The energy efficiency of GPUs can be significantly curtailed by inefficient resource us- age. Traditional graphics-oriented applications are successful in exploiting the resource availability to improve throughput. As mentioned before, with the advent of new pro- gramming models, general purpose applications are also relying on GPUs to derive the benefits of energy-efficient throughput computing. However, resource demands across general purpose applications can vary significantly, leading to the resource underutiliza- tion [84, 123, 25, 81, 26, 97, 173, 67, 28, 103, 164, 167]. For example, the execution lanes can be under-utilized when branch divergence occurs. In memory bound appli- cations, memory bandwidth of GPUs acts as a performance bottleneck, due to the fact that streaming cores are not fully utilized. Furthermore, new application classes such as graph algorithms do not scale well and may encounter even more severe resource underutilization problem, and hence, a large amount of power is wasted. 0% 50% 100% AGM APSP BFS CCL GCL GCO GCU MIS MST PR SP SSSP Avg M0_8 M9_16 M17_24 M25_31 M32 (a) SIMT Lane Utilization 0% 20% 40% 60% 80% 100% AGMAPSP BFS CCL GCL GCOGCU MIS MST PR SP SSSP Avg Others EXE Structure Hazard Short RAW Harzard Long Memory Latency Control Hazard or I-buffer Empty Barrir Synchronization (b) Percentages of pipeline stalls Figure 1.8: Resource underutilization in graph applications To quantify the resource underutilization problem, Figure 1.8a shows the breakdown of SIMT lane utilization while running graph applications on a Tesla C2050 Fermi GPU 12 that was collected from our experimental infrastructure. We present a detailed description of our simulation and emulation infrastructure in later chapters. The bar labeled M0 8 plots the fraction of instructions that are executed on at most 8 SIMT lanes. Similarly, M9 16 shows the fraction of instructions that are executed on 9 to 16 SIMT lanes and so on. M32 denote the portion of instructions that fully utilize the available 32 SIMT lanes. In all the graph applications except APSP, as shown in Figure 1.8a, the SIMT lane utilization varies considerably. For example, in SSSP, over 87% of instructions are executed by only 0 to 8 SIMT lanes, which means the remaining 24 SIMT lanes are idle during the 87% of the execution time. Overall, the execution lanes are underutilized during almost half of the total execution cycles. We also measured the pipeline stalls and analyzed for each pipeline stall what was the primary reason for the stall. Figure 1.8b shows the percentage of various pipeline stalls. On average, the pipelines are stalled 78% of total execution time. Long memory latency is the biggest bottleneck that causes over 70% of all pipeline stalls in graph applications. The execution structure hazard and short RAW data dependency are the second and third most prominent bottlenecks. Graph applications by nature work with large datasets, and hence they experience higher memory related stalls. Due to the ineffective cache usage in graph applications, the impact of long memory latency is higher in the graph applications than in regular applications. AGM_coAuthorAGM_in2004 APSP_1k APSP_4k BFS_1MW BFS_4096 BFS_65536 CCL_logo CCL_trojan GCL_belgium GCL_coAuthor GCL_email GCO_hood GCO_pwtk GCO_flower GCU_person MIS_128 MIS_512 MST_fla MST_rmat12 PR_wiki PR_edge14 SP_17k SP_42k SSSP_fla SSSP_rmat20 Figure 1.9: Load distribution: # of assigned CTAs across SMs 13 We also measured the number of CTAs assigned to each SM. Ideally each SM should do the same amount of work for a balanced workload execution. Figure 1.9 shows the degree of balancing in the CTA distribution across the SMs. Recall that in Table 1.1, Tesla C2050 Fermi GPU has 14 SMs. Therefore, from the center of each circle, a line stretches to one of 14 directions to indicate there is at least one CTA in one of the 14 SMs in the GPU. The length of the line indicates the number of CTAs assigned to the SM. If all 14 SMs are assigned the same number of CTAs (well balanced), the chart forms a perfect circle. Otherwise (unbalanced load distribution), the circle has a distorted shape. As shown, there are strong load imbalance of CTAs assigned to different SMs, hence, a large fraction of core resources can be idle due to synchronizations. The huge resource under-utilization in the above motivational data indicates that the GPU energy efficiency can be significantly degraded when executing traditionally CPU oriented applications. The underutilized resources consume a huge amount of passive static power without contributing to overall system throughput. Further, the overall sys- tem throughput is likely to be significantly, sometimes orders of magnitude, lower than the theoretical peak performance, and hence, the energy efficiency can be severely cur- tailed. Therefore, improving resource under-utilization is an increasingly important and challenging task for GPUs to maintain superior energy efficiency than CPUs. 1.5 Thesis Statement This thesis designs techniques to improve GPU energy efficiency by reducing resource under-utilization. As a first step, we propose to improve the GPU energy efficiency through fine grained power gating, based on branch divergence patterns. However, fine grained power gating alone is not sufficient for improving GPU energy efficiency. Power gating is able to save the static energy, but due to the fact that resources are designed and built into the hardware, it is a waste not to use them all. Therefore, gating is just one solution to tackle static energy, but not an ideal solution. For the best use of the 14 hardware, a solution to use all the resources all the time is desired. Concurrent kernel execution is one solution that we explore in depth in this thesis which improves resource utilization. For example, when a compute-intensive kernel and a memory-intensive ker- nel from different applications share a GPU, both pipeline and memory bandwidth are well utilized without compromising either kernel’s performance. The overall throughput can be significantly enhanced, if resources can be allocated intelligently and respective resource demand is taken into account to improve the concurrency of the GPU work- loads. Therefore, this thesis will also present techniques for how to slice SM resources across different competing kernels to dynamically improve GPU resource utilization. Concurrent kernel execution, while effective in improving resource utilization, brings to the forefront security concerns. Concurrent kernels, possibly belonging to different processes, share GPU resources at a fine granularity. As a result, it becomes possible to exploit indirect information flow through the shared microarchitecture structures to con- duct covert or side channel attacks. In particular,covertchannelattacks create a commu- nication channel through microarchitectural state and timing observations between two kernels to exchange sensitive data or to coordinate further attacks [160]. Therefore, this thesis will propose new solutions to detect and defend covert timing channels on GPUs. In the following sections, we present an overview of each of the major contributions of this thesis. 1.5.1 Proposal #1: Pattern Aware Scheduling for GPU Power Gating The power efficiency of GPUs can be significantly curtailed in the presence of diver- gence. Chapter 3 evaluates two important facets of this problem. First, the branch diver- gence behavior of various GPU workloads is studied. This thesis will show that only a few branch divergence patterns are dominant in most workloads. In fact only five branch divergence patterns account for 60% of all the divergent instructions in our workloads. In the second part of this chapter, this branch divergence pattern bias is exploited to propose a new divergence pattern aware warp scheduler, called PATS. PATS prioritizes 15 scheduling warps with the same divergence pattern so as to create long idleness win- dows for any given execution lane. An enhanced PATS++ scheduler is also proposed that further prioritizes warps using the instruction type information. The architectural imple- mentation details of PATS is then described and power and performance impact of PATS will be evaluated. Using a comprehensive set of experiments the thesis shows that the proposed design significantly improves power gating efficiency of GPUs with minimal performance overhead. 1.5.2 Existing GPU Concurrent Kernel Support Grid Management Unit (GMU) Thread Block Scheduler <GPU-created Grids> SM SM SM SM Stream Queues (Software Task Queue) <Grid, Thread Block> <Grid> <CPU-created Grids> Stream 0 Stream 1 Stream 2 Stream N Figure 1.10: Task management hierarchy of GPUs There is growing support for enabling concurrent kernel execution on GPUs. Mod- ern GPUs have a hierarchy of task schedulers to maximize throughput in the presence of multiple processing requests. Figure 1.10 shows the organization of task schedulers used in our baseline. GPU applications submit processing requests at a granularity of kernels with launching parameters such as number of thread blocks and the number of 16 threads in each thread block. The GPU application developer specifies the dependence relation between kernels using an abstract task queue calledstream. Kernels in different streams can be executed concurrently, however kernels belong to the same stream must be executed sequentially. Therefore, kernels in different streams are guaranteed to be independent by the application developer. A task queue is allocated for each stream, and these queues are managed by the GPU device driver. On the GPU, there is a Grid Management Unit (GMU), where grid denotes an in- stance of a kernel launch. GMU takes multiple kernels from different streams and then schedules them for execution. Depending on the hardware resource demands only a sub- set of kernels can run on the GPU. All the thread blocks in an active kernel are managed by the Thread Block Scheduler (TBS). A primary function of TBS is to decide which thread blocks are assigned to an SM; in other words, TBS allocates ¡kernel,threadblock¿ to a given SM. Prior to the Fermi architecture, TBS can handle only one kernel at a time. In this model, there is no GMU and all kernel processing requests are scheduled purely by the device driver. In Fermi’s implementation, all streams are multiplexed in a single hard- ware queue. Therefore, a sequential order between streams is enforced since TBS sched- ules requests in the hardware queue in FIFO fashion. This results in false dependency among the streams, and therefore parallelism is limited depending on the launch order of streams. In the Kepler and Maxwell generation, up to 32 concurrent streams are mapped into multiple hardware work queues, which removes false dependency among concurrent streams; this technology is branded asHyper-Q [19]. On top of Hyper-Q, several task-level multiprocessing approaches have been pro- posed. With context funneling, a software scheduler can multiplex requests from differ- ent streams to maximize hardware scheduling efficiency [155, 173]. NVIDIA introduced Multi-Process Service (MPS) which is a task scheduling daemon that takes kernel execu- tion requests from multiple CPU applications and multiplexes into a single GPU context for facilitating GMU and TBS operations. Preemptive scheduling [145, 125] has also 17 been proposed for reducing turnaround time of requests. If a large kernel is occupying all the SMs for a long time, the waiting time of the other short kernels waiting inside the task queue may increase, therefore preemptively switching multiple kernels effectively reduces average waiting time of the kernels. Asynchronous compute is supported in recent Maxwell and Pascal architecture [20, 23], where a GPU can be statically partitioned into multiple subsets, one subset that runs graphics workload and the other subset runs compute workload. Recent GPUs also allow grid launch inside a GPU kernel code for reducing CPU-GPU context switching time if the amount of the work should be adjusted dynamically. HSA foundation co- led by AMD introduced a queue-base multiprogramming approach for heterogeneous architectures that include GPUs [60]. HSA-compatible GPUs that support TLBs adopt multiprogramming even in the SM level, but there is no publicly available documentation on how resource partitioning is done among applications. As such it is imperative to understand the best resource partitioning approach to maximize resource utilization while at the same time preventing conflicting resource demands. 1.5.3 Proposal #2: Intra-SM slicing for GPU Multiprogramming As can be seen, there is growing support for enabling multiprogramming in GPUs. In cur- rent computing environments there are a growing number of kernel processing requests and also a number of combinations of kernels waiting in the task queue. Therefore, an efficient multiprocessing scheme is increasingly important. Chapter 4 proposes dynamic intra-SM slicing [168], which is a technique that enables efficient sharing of resources within an SM across different kernels. A scalable intra-SM resource allocation algorithm across any number of kernels is presented. The goal of this algorithm is to allocate re- source to kernels so as to maximize resource usage while simultaneously minimizing the performance loss seen by any given kernel due to concurrent execution. This algorithm is similar to the water-filling algorithm [124] that is used in communication systems for equitable distribution of resources. The algorithm assumes oracle knowledge of each 18 application’s performance versus resource demands. But this oracle knowledge can be approximated by doing short on-line profiling runs to collect these statistics. The thesis presents a novel way to perform the profiling to efficiently identify how much resource is allocated to each of the competing kernels. Again through an extensive set of evaluations the work shows that the proposed dynamic partitioning technique significantly improves the overall performance of GPUs by 23%, fairness by 26% and energy consumption by 16% over a baseline policy that is currently used . 1.5.4 Security Problems of GPU Concurrent Kernel Execution A covert channel is a communication channel that employs timing or storage characteris- tics to transfer confidential information in a way that violates the system’s security policy while leaving no trail [13, 30]. One of key motivations to use covert channel rather than communicating directly using encrypted message is to avoid traffic analysis from trusted agencies. For example, a government agency may become alert about the motivation for using encrypted communication [13]. In this case, to evade government detection, it is better to use covert channel and never communicate directly across two parties. Co-residency is one of the key condition for enabling covert channels [159]. Recent research efforts have shown that it is not difficult to detect collocation of required VMs, even in big data centers [151] This collocation can be abused for manipulating the shared resources, leaving the major cloud service providers vulnerable to covert channel attacks. Despite many research efforts on defending covert channel attacks using shared system resources, such as CPU [47, 169], memory [69, 174] and network [13], there are few researches on GPU covert channel attacks. As many supercomputers, datacenters and cloud platforms are equipped with a large number of GPUs [12, 71, 1, 9], it is becoming feasible to manipulate the shared GPU resources to enable covert channels. Covert channel attacks are typically contention based, with one kernel occupying a resource or performing some operations to cause measurable delay to a concurrently 19 executing kernel. Thus, the behavior of one kernel (called atrojan) can be detected indi- rectly, typically through timing, by the other kernel (called a spy) enabling information to be encoded. A second form of indirect information flow attacks called side channel attacks. In these attacks, one kernel observes the timing variations due to contention with a victim to infer sensitive information about the victim’s computation. The critical dis- tinction is that the victim is not colluding to pass this information as in the case of covert channels. Therefore, side-channel attacks are typically harder to construct but easier to mitigate than the covert channel attacks. 1.5.5 Proposal #3: Eliminating Covert Channel Attacks on GPUs Chapter 5 will focus on providing a comprehensive solution called GPUGuard to mitigate covert channel attacks on GPUs, although our solution should also protect applications against side-channel attacks. GPUGuard will continuously monitor specific architectural features of running kernels and classify their behavior to identify suspicious contention. Once contention is identified, the second component of the solution separates contending kernels into separate security domains, uniquely possible in GPUs due to the inherent spatial parallelism available, to close the identified contention channels which may be used for covert communication. Security domains at different hierarchy levels are used to maximize sharing (and performance) when it is safe, but to close timing channels when there is a possibility that they exist. A few covert channel detection schemes have been proposed on CPUs. CC-hunter [47] detects covert channel attacks on a CPU memory bus, integer divider and shared L1 cache based on recurrent conflict patterns. With the large number of threads and resources on a GPU, it is not clear that such a solution which correlates contention behavior for each resource independently can scale. Replay confusion [169] relies on a record-and-replay framework. It detects timing attacks by replaying execution on a different cache configu- ration to detect contention only on caches. In contrast to these works, our solution moni- tors contention on all known resources. Moreover, these solution stop after contention is 20 detected. In contrast, our defense also provides mitigation for the covert channels: once contention is detected, the technique will take advantage of the available parallelism in the GPU to create isolated security domains to close timing channels while maximizing sharing when it is safe to do so. An attractive feature of our solution is that a false nega- tive in the detection can lead to a small performance penalty but will not lead to stopping the applications from continuing to run. There are also a number of defenses against timing attacks that seek to equalize the performance of a shared resource to make contention unobservable [174, 69, 172, 138]. The focus is often on protecting a single resource such as memory or the L2 cache; in a covert channel setting attackers will simply shift to use a different resource if one is available. Moreover, many solutions require turning off hyper threading or simultaneous multithreading (SMT) to minimize resource sharing within a core. However, disabling SMT is likely to be impractical on GPUs because of three reasons: (1) GPUs rely on dozens of warps (a collection of 32 threads) concurrently sharing an SM to hide the long memory latencies (2) GPUs have vast number of execution resources (hundreds of execution units, special function units), different types of caches (constant cache, texture cache). Much of the architectural context of a warp is in fact stored within an SM. Hence, preemption of one kernel to protect another kernel is a very expensive endeavor, particularly with false positives. (3) Allowing only one kernel to execute has already been shown to reduce resource utilization and there is a general trend towards supporting multi-kernel execution on GPUs already. We evaluate GPUGuard on a number of constructed covert-channel and side-channel attacks running on their own or in combination with normal GPU workloads. Our solu- tion is able to detect covert and side-channel attacks with 93% accuracy in our experi- ments. To protect against contention for caches, once an attack is suspected, GPUGuard opportunistically takes idle cache resources to separate contention, or in some cases by- passes caches for one kernel to create separation. Once the contending applications are 21 separated, GPUGuard improves performance by 54% over a baseline defense of tempo- ral partitioning, while reliably closing the timing channels. GPUGuard is slower than an insecure baseline by 30% (geometric mean), which is typical for protections against timing channels. However, if no contention is detected, there is no performance loss. We also evaluate the hardware complexity of the solution, showing that it consumes an additional 0.3% area, and around 1% additional power. 1.6 Dissertation Organization The rest of the thesis is organized as follows. Chapter 2 performs a detailed characteriza- tion of the resource underutilization issue in GPUs while executing one class of irregular applications, namely graph applications. Chapter 3 describes a branch divergence pat- tern aware scheduling and power gating scheme to enhance power gating opportunities for underutilized execution units in GPUs. While power gating alone is not a sufficient solution to solve the resource underutilization issue, an architectural framework is fur- ther developed for concurrent kernel execution on GPUs. Chapter 4 presents a dynamic resource partitioning technique inside an SM to improve GPU concurrency. While intra- SM slicing makes concurrent kernel execution on GPUs effective, it also brought to fore- front the problem of information leakage across kernels. Hence, the next research thrust focused on thwarting side-channel and covert-channel attacks from co-executing kernels on GPUs. Chapter 5 describes GPUGuard, a machine learning based threat detection and defense technique to enable secure execution of concurrent kernels on GPUs. Chapter 6 concludes the thesis with retrospective thoughts. 22 Chapter 2 Resource Underutilization on GPUs: A Case Study on Graph Applications 2.1 Chapter Overview In the previous chapter we presented some motivational data for how resources are under- utilized when GPUs run irregular applications. In this chapter we dive deeper to demon- strate the severity of resource underutilization problem when executing a particular class of irregular applications, namely graph applications. We show how graph processing leads to more pipeline stalls and high load imbalances across various SMs in a GPU. We analyze the reasons for these stalls and present several design enhancements to enable efficient execution of graph applications on GPUs. 2.1.1 Graph Processing on GPUs Large graph processing is now a critical component of many data analytics. Graphs have traditionally been used to represent the relationship between different entities and have been the representation of choice in diverse domains, such as web page ranking, social networks, tracking drug interactions with cells, genetic interactions, and communicable disease spreading. As computing is widely available at very low cost, processing large scale graphs to study the vertex interactions is becoming an extremely critical task in 23 computing [110, 96, 62, 112]. Many graph processing approaches [112, 4] use the Bulk Synchronous Parallel model [150]. In this execution model, graphs are processed in synchronous iterations, called supersteps. At every superstep, each vertex will execute a user function that can send/receive messages, modify its value or modify values of its edges. There is no defined order in which the vertices are handled within each superstep, but at the end of each superstep all vertex computations are guaranteed to be completed. Graph processing has been parallelized to run on large compute clusters using well- known cluster compute paradigms such as Hadoop [5] and MapReduce [52] that have been re-targeted to run graph applications. More recently, graph processing-specific computing frameworks, such as Pregel [112] and Giraph [4] have also been proposed. Pregel, for instance, relies on vertex-centric computing. An application developer defines avertex:compute() function which specifies the computation that will be performed at each vertex. The computation can be as simple as finding the minimum value of all the adjacent vertex values, or could be a more complex function that simulates how a protein may fold when interacting with amino acids [37]. Within each superstep, or within each MapReduce iteration, there is significant amount of parallelism as the same computation is performed across all vertices in the graph. To- day this parallelism is exploited primarily through cluster-based computing where mul- tiple compute nodes concurrently process different subgraphs. Given the loose synchro- nization demand and repeated computations on vertices, the SIMT parallel execution model supported by GPUs provides new venues to significantly increase the power ef- ficiency of graph processing. By mapping a vertex processing to a SIMT lane or a set of SIMT lanes called warp or wavefront, large graphs can be efficiently processed. Sev- eral studies showed how to efficiently map graph applications to GPUs [139, 152, 72, 31, 68, 82]. However, there have not been many characterization studies conducted to under- stand how graph applications interact with GPU-specific microarchitectural components, 24 such as SIMT lanes and warp schedulers. To optimize GPUs for graph processing, un- derstanding the the graph application’s characteristics and the corresponding hardware behavior is important. Che et al. [45] recently characterized graph applications while running on AMD Radeon HD 7950 GPU. They implemented eight graph applications in OpenCL and ana- lyzed the hardware behaviors such as cache hit ratio, execution time breakdown, speedup over executing on CPU. In this chapter, we collected 12 graph applications written in CUDA from various sources and execute them on NVIDIA GPU as well as a similarly configured cycle-accurate simulator. We then provide in-depth application characteriza- tion and analyze the interaction of the application with the underlying microarchitectural blocks through a combination of hardware monitoring with performance counters and software simulators. In order to highlight graph applications’ unique execution behavior, we also ran a set of non-graph applications on GPUs and compared the behavior of the two sets of applications. The followings are the contributions of this work: • We compiled 12 graph applications written in CUDA. Not to be biased by a certain programming style, we acquired the applications from a broad range of sources. We measured various GPU architectural behaviors while running graph applica- tions on real hardware. As the real machine profiler provides only a limited set of hardware monitoring capabilities, we also used cycle accurate GPU simulator to understand the impact of warp schedulers, performance bottlenecks and load imbalance across SMs, CTAs and warps. • To differentiate graph application’s unique characteristics, we also executed a set of non-graph applications on the same platforms. We then compare and contrast the various performance and resource utilization measures. • We discuss several design aspects that need to be considered in the GPU hardware design for more efficient graph processing. 25 The remainder of this chapter is organized as follows. Section 2.2 describes our eval- uation methodology and the graph and non-graph applications used in the experiment. Then, we characterize and analyze the GPU hardware behaviors by using the evalua- tion results in Section 2.3. We discuss possible hardware optimizations in Section 2.4. Section 2.5 describes the related work. 2.2 Methodology 2.2.1 Graph Applications In this section, we briefly describe the benchmark suite of graph algorithms we evaluated in this study. Many of the benchmarks are collected from recent research papers which implemented state of art algorithms for a variety of graph processing demands. • Approximate Graph Matching (AGM) is used to find maximal independent edge set in a graph. It has applications in minimizing power consumption in wireless networks, solving traveling sales person problem, organ donation matching pro- grams, and graph coarsening in computer vision. The version of AGM used here is a fine-grained shared-memory parallel algorithm for greedy graph matching [32]. • All Pairs Shortest Path (APSP) is used to find the shortest path between each pair of vertices in a weighted graph. The standard algorithm for solving the APSP problem is the Floyd-Warshall algorithm [59]. Buluc ¸ et al., showed that APSP problem is computationally equivalent to computing the product of two matrices on a semiring [40]. The APSP algorithm we selected in this study uses this more efficient implementation. • Breadth First Search (BFS) is a well known graph traversal algorithm. The par- allel implementation of BFS is widely available. We use the version in the Rodinia benchmark [46]. 26 • Graph Clustering (GCL) is concerned with partitioning the vertices of a given graph into sets consisting of vertices related to each other. It is a ubiquitous sub- task in many applications, such as social networks, image processing and gene engineering. GCL used in this chapter refers to the implementation in [31] using a greedy agglomerative clustering heuristic algorithm. • Connected Component Labeling (CCL) involves identifying which nodes in a graph belong to the same connected cluster or component [72, 87, 121]. It is widely used in simulation problems and computer vision. CCL here refers to im- plementation of prior work using label equivalence method [72]. • Graph Coloring (GCO) partitions the vertices of a graph such that no two adja- cent matrices share the same color. There are several known applications of graph coloring such as assigning frequencies to wireless access points, register allocation at compile time, and aircraft scheduling. Graph coloring is NP-hard, therefore, a number of heuristics have been developed to assign colors to vertices, such as first fit, largest degree order and saturation degree order. GCO selected in this study uses a parallelized version of first fit algorithm presented in [68]. • Graph Cuts (GCU) partitions the vertices of a graph into two disjoint subsets that are joined by at least one edge. It can be employed to efficiently solve a wide variety of low level computer vision problems, such as image segmentation, stereo vision, image restoration. The maxflow / mincut algorithm to compute graph cuts is computationally expensive. The authors in [152, 153] proposed a parallel implementation of the push-relabel algorithm for graph cuts which achieves higher performance, which is used in this study. • Maximal Independent Set (MIS) finds a maximal collection of vertices in a graph such that no pair of vertices is adjacent. It is another basic building block for many graph algorithms. MIS here refers to the standard cusp implementation [6] using Luby’s algorithm [109]. 27 • Minimum Spanning Tree (MST) finds a tree that connects all the vertices to- gether with minimum weight. It has applications in computer networks, telecom- munication networks, transportation networks, water supply networks and smart electrical grid management. This benchmark computes a minimum spanning tree in a weighted undirected graph using Boruvka’s algorithm [41]. • Page Rank (PR) is an algorithm used by Google to rank websites. PageRank is a probability distribution used to represent the likelihood that a person randomly clicking on links will arrive at any particular page. The implementation uses Mars MapReduce framework on GPU [139, 73]. • Survey Propagation (SP) The satisfiability (SAT) problem is the problem of de- termining if there exists an interpretation that satisfies a given Boolean formula. Survey propagation [39] is heuristic SAT-solver based on Belief Propagation(BP), which is a generic algorithm in probability graph. It is implemented in Lones- tarGPU benchmark [41]. • Single Source Shortest Path (SSSP) computes the shortest path from a source node to all nodes in a directed graph with non-negative edge weights by using a modified Bellman-Ford algorithm [41, 36]. 2.2.2 Non-graph Applications In addition to the above described graph applications, we evaluated 9 benchmarks from non-graph application domains from the Rodinia benchmark [46] and NVIDIA SDK [10] suites. The benchmarks are selected to cover a wide range of application domains: LU decomposition (LUD), matrix multiplication (MUL) are dense liner algebra applications, which manipulate dense matrices. Discrete Cosine transform (DCT) and Heartwall (HW) are image processing applications. Hotspot (HS) is a physical simulation application which is used to do thermal simulation to plot the temperature map of processors. We 28 also included statistics and financial applications such as Histogram (HIST) and Bino- mial options (BIN). Finally, we included widely used parallel computing primitives scan (SCAN) and reduction (RDC) to represent a large range of applications from parallel application developers. 2.2.3 Experimental Environment GPU Model Tesla M2050 [104] Core 14 CUDA SMs@1.15GHz Memory 2.6GB, GDDR5@1.5GHz Comm. PCI-E GEN 2.0 Simulator Version GPGPU-Sim v3.2.2 [34] Configs Tesla C2050 Core 14 CUDA SMs@1.15GHz Memory GDDR5@1.5GHz L1D 16KB RF 128KB Const 8KB Shared 48KB L2D 786KB Table 2.1: Experimental environment Each of the selected applications were written in CUDA. We only modified the make- files to change the compilation flags for collecting specific data from GPU hardware that we will describe shortly. We ran these applications both on the native hardware as well as on a cycle accurate GPU simulator. Hardware measurements are performed on Tesla M2050 GPU, which has 14 CUDA SMs running at 1.15GHz. Meanwhile, simulation studies are performed on GPGPU-Sim [34] to collect detailed runtime statistics that were not possible in hardware measurements. As shown in Table 2.1, GPGPU-Sim is config- ured with Tesla C2050 parameters, which has the same architecture as M2050. The only 29 difference between the two different GPUs is that they have different heat sinks. We run multiple experiments with different input sets for each application as shown in Table 2.2. The input sets differ in size and categories; some of the input sets are obtained from Di- macs [33], some are obtained from Florida Matrix Collection [51] and others come with original application. Name Input Set Size Name Input Set Description Name Input Set Size AGM coAuthor 300K*300K APSP 1K 1K-V 6K-E BFS 4096 4K In2004 13M*13M 4K 4K-V 24K-E 65536 64K GCL email 1K*5K CCL logo 2.7MB 1MW 1M coAuthor 300K*300K Trojan 5.5MB GCU flower 4.7MB belgium 1.4M*1.4M MST rmat12 [33] 4K-V 165K-E person 4.7MB GCO hood [51] 215K-V 5.2M-E fla [33] 1M-V 2.6M-E PR wiki 7K-V 104K-E pwtk [51] 212K-V 5.7M-E SP 17K 4K-V 17K-E Edge14 16K-V 256K-E MIS 128 128*128 42K 10K-V 42K-E SSSP fla [33] 1M-V 2.6M-E 512 512*512 rmat20 [33] 1.1M-V 45M-E Table 2.2: 12 Graph applications collected from various sources (V: Vertices, E: Edges) 2.3 Motivational Analysis 2.3.1 Kernel Execution Pattern First of all, we measured the number of kernel function calls that were invoked by the CPU on the GPU when executing on native hardware. In Figure 2.1a and Figure 2.1b, the bar charts named KERNEL indicate the total number of kernel functions invoked during the execution of each graph and non-graph application. The average number of kernel invocations is nearly an order of magnitude higher in graph applications (about 300 invocations) compared to non-graph applications (about 25 invocations). Graph ap- plications require frequent synchronizations: for instance, after each superstep in BSP model all vertex computations must return to CPU for synchronization before starting the next superstep. Thus graph applications require frequent CPU interventions to pro- vide synchronization capability for both BSP and MapReduce-style graph computations. 30 254.38 103.49 1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 1.0E+05 AGM APSP BFS CCL GCL GCO GCU MIS MST PR SSSP SP AVG # Function Calls Function Invocations - Graph KERNEL PCI (a) Graph application 32.27 5.17 1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 HIST BIN RED SCAN MM DCT8x8 HS HW LUD AVG KERNEL PCI (b) Non-graph application Figure 2.1: Kernel function invocation count for graph and non-graph applications. The amount of computation done per each kernel invocation is significantly smaller in graph applications than non-graph applications. The first two bars in Figure 2.2 show the total execution time spent while executing all the kernel invocations (labeledTOTAL KERNEL) and the average amount of time spent per kernel invocation (labeled EACH KERNEL). In graph applications the per kernel execution time is only 24% of the per kernel time spent in non-graph applications. Thus non-graph applications, at least the applications that we evaluated, execute relatively large functions on each kernel invoca- tion from CPU and require fewer CPU interventions. One of the negative side effect of communicating frequently with CPU is that in cur- rent systems CPU and GPU communicate via PCI interface. Thus even short messages require long latencies to communicate over PCI. In Figure 2.1a and Figure 2.1b, the bar charts named PCI show the number of cudaMemcpy function calls that uses PCI to transfer between CPU and GPU. Graph applications interact with CPU nearly 20X more frequently than non-graph applications. Graph applications tend to transfer data almost once per two kernel invocations while the non-graph applications execute average of ten kernels without extra data transfer when multiple kernels are executed. The primary rea- son for large communication overhead is that graph applications use kernel invocation as 31 8.47E+04 2.99E+04 2.83E+02 1.17E+03 4.69E+03 2.27E+03 4.54E+01 4.25E+02 1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 1.0E+05 Graph Non-Graph Average Execution Time (us) TOTAL KERNEL EACH KERNEL TOTAL PCI EACH PCI Figure 2.2: Average kernel function execution times. a global synchronization. Whenever an SM finishes its processing on the assigned ver- tices, the next set of vertices to process is determined only when all the other SMs finish their work assigned in the kernel because there can be dependencies among the vertices processed in multiple SMs. However, as GPUs do not support any global synchroniza- tion mechanism across the SMs, the graph applications are typically implemented to call a kernel function multiple times to use the kernel invocation as a global synchronization. In addition to synchronization overheads, once a kernel function is complete the out- put data needs to be properly re-deployed in the GPU memory so that the vertices that are processed in the next kernel can read their data appropriately. At the end of each kernel execution some vertices may simply stop further computations since they reach a termination condition. However, these termination condition checks are done at the end of each superstep by making sure no other vertex has sent a message to the given ver- tex. Thus the vertex computation can only be terminated by CPU after it has processed the results from the current superstep from all SMs. The net result of all these frequent CPU-GPU interactions is that the total time spent on PCI transfers is higher in graph ap- plications, as can be seen in Figure 2.2 bar labeledTOTALPCI. Since graph applications invoke many more PCI transfers but each call only transfers smaller amount of data, the 32 time per each PCI transfer is 10X smaller than non-graph applications as can be seen in Figure 2.2 (bar labelled Each PCI). 2.3.2 Performance Bottlenecks 0% 20% 40% 60% 80% 100% AGM APSP BFS CCL GCL GCO GCU MIS MST PR SP SSSP Avg Pipeline Idle EXE Structure Hazard Short RAW Harzard Long Memory Latency Control Hazard or I-‐buffer Empty Atomic OperaPon Barrir SynchronizaPon FuncPonal Done (a) Graph application 0% 20% 40% 60% 80% 100% BIN DCT HW HIST HS LUD MUL RDC SCAN Avg Pipeline Idle EXE Structure Hazard Short RAW Harzard Long Memory Latency Control Hazard or I-‐buffer Empty Atomic OperaPon Barrir SynchronizaPon FuncPonal Done (b) Non-graph application Figure 2.3: Breakdown of reasons of pipeline stalls In this section we focus on where the performance bottlenecks are while executing the selected applications on GPUs. For this purpose we monitored pipeline stalls in our GPGPU-Sim simulator and analyzed for each pipeline stall what was the primary reason for the stall. Figure 2.3 shows the breakdown of various pipeline stall reasons. Note that the warp scheduling policies on GPUs allow multiple warps to be concurrently alive in 33 various stages of the pipeline. Hence, in a given cycle, each warp could stall in a pipeline stage for different reasons: several of the warps could wait on long memory latency operations, while others could wait on barrier synchronization. We weight individual contribution of each pipeline stall reason by the number of warps stalled due to that rea- son. In Figure 2.3(a) and Figure 2.3(b) we plot all the graph applications and non-graph applications individually. On average, long memory latency is the biggest bottleneck that causes over 70% of all pipeline stalls in graph applications. The execution structure hazard and short RAW data dependency are the second and third most prominent bot- tlenecks. On the other hand, the non-graph applications exhibit different distribution of pipeline stall bottlenecks. The execution structure hazard is the biggest performance bot- tleneck. Graph applications by nature work with large datasets, and hence they are likely to experience higher memory related stalls. Apart from the larger dataset size, CPU has to transfer data to GPU frequently and re-deploy vertex data, potentially at different loca- tions, between two successive supersteps (or kernel calls). Hence, the data cached from one superstep is unlikely to be useful in the second superstep. As we will show later with cache miss statistics, in fact, graph applications suffer from higher miss rates. Due to the ineffective cache usage, the impact of long memory latency is higher in the graph applications than in the non-graph applications. But nonetheless, it is clear that across both graph and non-graph applications the pipeline is stalled for a significant fraction of time. These stalls waste power, particularly static power, and also lead to poor utilization of resources. 2.3.3 SRAM Resource Sensitivity We collected the demand on SRAM resources by graph and non-graph applications from our simulation infrastructure. The three largest SRAM resources on GPUs are register file, shared memory and constant memory. As listed in Table 2.1, in the simulated GPU configuration, the size of the register file, shared memory, and constant memory are 128KB, 48KB, and 8KB, respectively. Figure 2.4 shows the utilization of register file, 34 69% 8% 8% 0% 20% 40% 60% 80% 100% AGM APSP BFS CCL GCL GCO GCU MIS MST PR SP SSSP Avg Register File U@l Shared Mem U@l Const Mem U@l (a) Graph application 67% 35% 29% 0% 20% 40% 60% 80% 100% BIN DCT HW HIST HS LUD MUL RDC SCAN Avg Register File UDl Shared Mem UDl Const Mem UDl (b) Non-graph application Figure 2.4: Utilization of SRAM structures shared memory, and constant memory per SM as a percentage of the total size. We used a compiler option -Xptxas=”-v” to collect the usage of the three SRAM structures per each thread in an application. Then, the total usage per SM is calculated by using the number of concurrent CTAs per SM and the total number of threads within a CTA. Across all the graph applications, the register file is the most effectively leveraged SRAM structure among the three. Graph applications use only 8% of the shared memory and constant memory. When we compare with non-graph applications that are plotted in Figure 2.4(b), even though non-graph applications have poor resource utilization, it is clear that the graph applications tend to exhibit even worse utilization. It is important to provide a brief overview of how CUDA applications use global and shared memory. Shared memory is smaller and faster memory that can be shared by all the threads that are running within a CTA, while global memory is the large but 35 slow memory that can be accessed by all SMs. Unlike traditional CPUs, GPU put the burden on applications to explicitly manage the usage of shared and global memory. If an application wants to use the shared memory then it must first execute a special move instruction to move the data from global memory to shared memory. Once the data is in the shared memory then the application can issue a second load instruction to bring the data into the execution unit. If there is not enough reuse of data then moving data from global memory to shared memory and then to the execution lanes actually consumes more time than simply loading data directly from global memory. Thus in the absence of sufficient data reuse, shared memory access only increases the memory access time as well as the instruction count as the data in the global memory needs to be loaded to the shared memory first by a load instruction and then another load instruction should be executed to get the data from the shared memory to the register. Therefore, in the relatively short kernel functions that are used in graph applications, it is hard to effectively leverage the shared memory. Thus graph applications do not try to exploit the shorter latency shared memory and instead simply load data from global memory. This memory usage statistic also explains why the main performance bottleneck of graph applications is the long memory latency as discussed in the previous section. Constant memory is a region of global memory that is cached to the read-only con- stant cache. Given that GPU’s L1 cache size is relatively small, maintaining some re- peatedly accessed read-only data in constant cache helps to conserve memory bandwidth. However, constant cache is typically the smallest among the SRAM structures embedded in the GPU die as specified in Table 2.1. Therefore, to gain performance benefits from using constant cache, programmers should carefully decide which data to store in the constant memory. If the data structure is too big, it may not benefit from using constant memory due to frequent constant cache replacements. Such data maybe better accessed directly from global memory. In the large graph processing applications where giga bytes or tera bytes of data are processed, it is hard to fit the data structures in the small con- stant cache. Thus, graph application developers are less inclined to use constant memory. 36 Even in the non-graph applications developers do not seem to pay sufficient attention to carefully managing constant cache and hence the overall utilization of constant cache is quite poor. 0.98 1.85 0 2 4 6 8 10 AGM_coAuthor AGM_in2004 APSP_1k APSP_4k BFS_1MW BFS_4096 BFS_65536 CCL_logo CCL_trojan GCL_belgium GCL_coAuthor GCL_email GCO_hood GCO_pwtk GCU_flower GCU_person MIS_128 MIS_512 MST_fla MST_rmat12 PR_wiki PR_edge14 SP_17k SP_42k SSSP_fla SSSP_rmat20 Avg Normalized IPC None/768KB 32KB/1.5MB 64KB/3MB 4MB/192MB Figure 2.5: Normalized IPC w.r.t. cache size We also measured the performance impact of L1 and L2 caches. The default per SM L1 and per device L2 cache sizes are 16KB and 768KB, respectively. We also evaluated the impact of using different cache sizes, no L1 + 768KB L2, 32KB L1 + 1.5MB L2, 64KB L1 + 3MB L2, and finally an extreme data point of 4MB L1+196MB L2. Obvi- ously the last design option was explored as a near limit study of the importance of very large cache. As the performance impact of caches can depend on the input size, we mea- sured the performance impacts by varying the inputs for each application. For example, AGM coAuthor and AGM in2004 are the executions of AGM but using different inputs; the input size is listed in Table 2.2. Figure 2.5 shows the normalized IPC under various cache configurations that were listed above. The Y-axis is normalized IPC over the IPC of default cache configuration. Figure 2.6 shows the cache miss rate during the kernel execution under the default cache configuration but with varying input size. We show the individual result for graph applications, but simply plot the average of all the 9 non-graph applications in the last bar in the figure (labeled asNG on X-axis) for comparison. 37 0% 20% 40% 60% 80% 100% AGM_coAuthor AGM_in2004 APSP_1k APSP_4k BFS_1MW BFS_4096 BFS_65536 CCL_logo CCL_trojan GCL_belgium GCL_coAuthor GCL_email GCO_hood GCO_pwtk GCU_flower GCU_person MIS_128 MIS_512 MST_fla MST_rmat12 PR_wiki PR_edge14 SP_17k SP_42k SSSP_fla SSSP_rmat20 Avg NG L2 Cache Miss Rate L1D Cache Miss Rate Figure 2.6: Cache miss rate As can be seen in Figure 2.6, almost all the graph applications have fairly high L1 cache miss rates (average of 70% L1 accesses encounter misses). Given such an ex- tremely high cache miss rate, we conducted a study without any L1 cache while keeping the L2 cache size at 768KB per device; we measured the performance without using L1 cache as plotted in the first bar charts named None/768KB in Figure 2.5. Interestingly, the IPC difference between the default and zero L1 cache configurations is only 2%. This means that L1 cache is entirely ineffective for graph processing. We then measured the performance by doubling the size of both caches with L1 cache size reaching up to 4MB. The IPC improvement of larger cache size is quite small on most applications, except for applications such as SP and GCO that do take advantage of larger L1 cache, as can be seen in Figure 2.5. Even with a 4MB L1, that is plotted in the last bar named4MB/192MB, most of the applications derive small IPC increase. The reason for this ineffectiveness of caches can be inferred from the fact that between two kernel invocations, the CPU has to do memory transfer on GPU memory. Hence, each kernel invocation essentially looses any cache locality that was present at the end of prior kernel invocation. 38 0.61 0% 20% 40% 60% 80% 100% AGM APSP BFS CCL GCL GCO GCU MIS MST PR SP SSSP Avg (a) Graph application 0.10 0% 20% 40% 60% 80% 100% BIN DCT HW HIST HS LUD MUL RDC SCAN Avg (b) Non-graph application Figure 2.7: L1 cache misses as a fraction of total accesses to the GPU L1 cache and shared memory Interestingly, graph applications derive lower cache miss rate than non-graph appli- cations as plotted in the last two bar charts in Figure 2.6, which is inconsistent with the findings of this chapter. However, we found that non-graph applications’ active use of shared memory decreases the total accesses to L1 cache, thereby the cache miss rate be- comes relatively high. Recall that once a data is loaded from global memory to shared memory, the application reads the data from shared memory and never accesses global memory for the data. As a result, L1 cache encounters a cold miss while loading the data from global memory to shared memory at the first read on the data and never expe- rience hit on the corresponding cache line. Note that loading a data from global memory to shared memory is not treated specially by the GPU hardware but is implemented by using a load instruction that reads global memory and writes the loaded data back to a register and a store instruction that stores the register value to the shared memory address. Therefore, L1 cache is accessed while executing the load instruction. However, once a data is stored to the shared memory, L1 cache is no more accessed for the data. This fact becomes obvious once we plot the cache miss rate as a fraction of the total accesses to shared memory and L1 cache combined. Figure 2.7 plots this data for both graph and non-graph applications. It is clear that this metric shows a vastly lower cache miss rate for non-graph applications due to their overwhelming accesses to shared memory. 39 2.3.4 SIMT Lane Utilization We also compare the SIMT lane utilization of graph and non-graph applications. Fig- ure 2.8 shows the breakdown of SIMT lane utilization while running the applications. The bar labeled M0 8 plots the fraction of instructions that are executed on at most 8 SIMT lanes. Similarly, M9 16 shows the fraction of instructions that are executed on 9 to 16 SIMT lanes and so on. M32 denote the portion of instructions that fully utilize the available 32 SIMT lanes. In all the graph applications except APSP, as shown in Figure 2.8(a), the SIMT lane utilization varies considerably. For example, in SSSP, over 87% of instructions are executed by only 0 to 8 SIMT lanes, which means the remain- ing 24 SIMT lanes are idle during the 87% of the execution time. On the other hand, non-graph applications plotted in Figure 2.8(b) tend to use SIMT lanes more effectively. Five among nine non-graph applications used in the experiment used all the 32 available SIMT lanes 100% of the warp execution time. 0% 50% 100% AGM APSP BFS CCL GCL GCO GCU MIS MST PR SP SSSP Avg M0_8 M9_16 M17_24 M25_31 M32 (a) Graph application 0% 20% 40% 60% 80% 100% BIN DCT HW HIST HS LUD MUL RDC SCAN Avg M0_8 M9_16 M17_24 M25_31 M32 (b) Non-graph application Figure 2.8: SIMT lane utilization 40 The nature of graph applications leads them to have variable amounts of parallelism during their execution. The vertices in a graph have different degrees (i.e. the number of edges). One typical way of graph application implementation is to run a loop for each vertex to process one edge per iteration. If each vertex is processed by one SIMT lane so that multiple vertices are processed by a warp in parallel, then the number of iterations executed by each SIMT lane varies as the degree of each vertex varies. This leads to a significant diverged control flow. Thus the SIMT lane utilization varies significantly in graph applications. 2.3.5 Execution Frequency of Instruction Types 0% 20% 40% 60% 80% 100% AGM APSP BFS CCL GCL GCO GCU MIS MST PR SP SSSP Avg NG %Int Inst %FP Inst %LDST inst %SFU Inst Figure 2.9: Execution frequency of instruction types Figure 2.9 shows the instruction type breakdown among the executed instructions. In almost all the graph applications, the dominant instruction type is integer instruction. The memory instruction is the second most frequently executed instruction type. Similar pattern is also found from non-graph applications that is shown in the last bar chart named NG. Thus, the execution time differences between graph and non-graph applications are not influenced by the instruction mix, rather the memory subsystem plays a significant role in these differences. 41 2.3.6 Coarse and Fine-grain Load Balancing We measured load balancing in two ways: coarse and fine-grain. We measured coarse- grain load balancing as the number of CTAs assigned to each SM. For the fine-grain load balancing, we collected two metrics. The first metric is the execution time difference across CTAs. Since each CTA has different amounts of computation based on the num- ber of vertices and edges processed, the execution time of CTAs can vary. The second fine-grain metric measures the execution time variance across warps within a CTA. The execution time variations of warps and CTAs can have significant negative impact on performance due to GPU execution model constraints. A kernel can be terminated only when all the assigned CTAs finish their execution. Likely, a CTA’s execution can only be finished when all the warps within the CTA finish their work. Therefore, the performance is highly dependent on few warps or CTAs that have long execution time. Hence, coarse and fine-grain load balancing is critical for performance in GPUs. Figure 2.10 shows the degree of balancing in the CTA distribution across the SMs. From the center of each circle, a line stretches to one of 14 directions to indicate there is at least one CTA in one of the 14 SMs in the GPU. The length of the line indicates the number of CTAs assigned to the SM. If all 14 SMs are assigned the same number of CTAs (well balanced), the chart forms a perfect circle. Otherwise (unbalanced load distribution), the circle has a distorted shape. Due to space limitation, we only present the load balancing graphs of the last kernel executed. The SM level imbalance shown in Figure 2.10 depends on input size and program characteristics. Let’s assume there are m SMs, maximum n CTAs can be assigned to one SM. The default CTA scheduling policy is round robin. If a kernel to be scheduled fits in exactlymn CTAs, there would be exactlyn CTAs assigned per SM at the start of kernel execution. Such a perfect balance is seen in GCO, MST and SP benchmarks which have perfect circles. On the other hand, if a kernel to be scheduled has less than mn CTAs, CTAs would be assigned unevenly across SMs. For example, RDC has 64 42 AGM_coAuthorAGM_in2004 APSP_1k APSP_4k BFS_1MW BFS_4096 BFS_65536 CCL_logo CCL_trojan GCL_belgium GCL_coAuthor GCL_email GCO_hood GCO_pwtk GCO_flower GCU_person MIS_128 MIS_512 MST_fla MST_rmat12 PR_wiki PR_edge14 SP_17k SP_42k SSSP_fla SSSP_rmat20 (a) Graph application BIN DCT HW HIST HS LUD MUL RDC SCAN (b) Non-graph application Figure 2.10: Coarse grained load distribution: # assigned CTAs across SMs CTAs in total and can schedule maximum 6 CTAs per SM: it assigns 5 CTAs on 8 SMs and 4 CTAs on the remaining 6 SMs. The last case is when the number of CTAs in a kernel far exceeds themn CTAs that can be scheduled at the start of a kernel execution. For such large kernel, CTAs would be initiated continuously onto the SMs after a previously scheduled CTA finishes execution. In such a scenario, there are two reasons that play opposing roles in balancing CTA assignments. On one hand, as the number of CTAs increase, there is a higher likelihood to assign similar number of CTAs per SM. Typically the number of CTAs created per kernel is a function of the input size. Large inputs lead to more CTAs and hence the likelihood of balancing CTA assignments per SM also increase. For example, 43 BFS shows more balanced circles as the input size increases from 4K (BFS 4096) to 64K (BF 65536) and to 1M (BFS 1MW). Similarly, GCL also has more balanced circles when the input size is bigger (email is the smallest input set and belgium is the largest). 0.0001 0.001 0.01 0.1 1 10 100 AGM_coAuthor AGM_in2004 APSP_1k APSP_4k BFS_1MW BFS_65536 CCL_logo CCL_trojan GCL_belgium GCL_coAuthor GCU_flower GCU_person PR_wiki PR_edge14 SSSP_fla SSSP_rmat20 AVG NG (a) 25% 7% 0% 20% 40% 60% 80% 100% AGM_coAuthor AGM_in2004 APSP_1k APSP_4k BFS_1MW BFS_4096 BFS_65536 CCL_logo CCL_trojan GCL_belgium GCL_coAuthor GCL_email GCO_hood GCO_pwtk GCU_flower GCU_person MIS_128 MIS_512 MST_fla MST_rmat12 PR_wiki PR_edge14 SP_17k SP_42k SSSP_fla SSSP_rmat20 AVG NG VariaTon of ExecuTon Time Across Warps (b) Figure 2.11: Fine grained load distribution: (a) execution time variance across CTAs and (b)coefficient of execution time variation across warps within a CTA There is an opposing force to achieving balance as the number of CTAs increase. If different SMs complete their assigned CTAs after different amounts of time, the sched- uler will assign more CTAs to an SM that executes faster. Thus the imbalance in CTA assignment increases as the execution time imbalances increase between CTAs. Fig- ure 2.11(a) shows the execution time variance across CTAs in box plot. The lowest point 44 in the error bar indicates the minimum CTA execution time while the highest error bar indicates the maximum CTA execution time. The box part goes from the first quartiles to the median and then to the third quartiles. All the execution times are normalized to the respective median execution time. We only included benchmarks which executed more than 4 CTAs in one kernel invocation, to calculate the variance across CTAs. In gen- eral larger input size increases the execution time variation. For instance, in SSSP more nodes need to be searched in the longest path as we increase the input size. Therefore, SSSP rmat20 has more distorted shape than SSSP fla in Figure 2.10. Furthermore, ap- plications that exhibit more warp divergence also have higher execution time variance at the CTA level. For instance, SSSP, AGM, and PR have the highest CTA execution time variation, while APSP has the smallest variation. We found this corresponds to the fact that in Figure 2.8(a) that SSSP, AGM and PR are more divergent, while APSP is well parallelized with full warp utilization. Figure 2.11(b) shows the variation of execution time across warps within a CTA. We present its coefficient of variation , where is the standard deviation and is the aver- age execution time (averaged over all warps). The standard deviation is normalized over the averaged execution time to compare the variation across different benchmarks. AGM and PR show more than 50% of variation, which indicates that their standard deviation is more than half of the averaged execution time. On the other hand, some other bench- marks, like APSP, GCO, MIS and MST do not show large variations. GCO, MIS and MST only have 1, 2 and 4 warps in a CTA respectively. Therefore, the variation should not be high. On the other hand, as we mentioned previously, APSP is implemented using an optimized matrix multiply operation. Therefore, execution time variation for warps within CTAs is not high. Compared with non-graph applications, graph applications show higher variation across the warps, 25% versus 7%. 45 76 322 0 100 200 300 400 500 AGM APSP BFS CCL GCL GCO GCU MIS MST PR SP SSSP Avg NG IPC wrt Schedulers GTO 2LV LRR Figure 2.12: Performance w.r.t. scheduler 2.3.7 Scheduler Sensitivity Finally, we explored the scheduler impact on the graph application performance. Fig- ure 2.12 shows the instructions per cycle (IPC) when three different warp scheduling algorithms are used. As we cannot change the warp scheduler in the real machine, we used GPGPU-Sim for this experiment. GTO is a greedy algorithm that issues instruc- tions from one warp until the warp runs out of ready instructions. 2LV uses two level scheduler in which the warps in the ready queue issue instructions until they encounter the long latency memory instructions. Once a warp encounters a long latency memory instruction, then the warp is scheduled out to the pending queue. LRR is a simple round robin algorithm. Among the three algorithms, GTO derived slightly better performance but the performance difference is not significant. Due to poor memory performance and divergence issues, as explained in earlier sections, graph applications have significantly lower IPC than non-graph applications. 2.4 Improve Graph Processing Efficiency In this section, we discuss some potential hardware optimizations for efficient graph processing on GPUs. 46 2.4.1 Reduce Performance Bottleneck Based on the quantitative data shown, graph applications tend to execute kernel and data transfer functions more frequently than non-graph applications. The frequent kernel in- vocations lead to ineffective use of caches as well. Therefore, the performance overhead due to PCI calls as well as long latency memory operations is higher in the graph appli- cations than in the non-graph applications. There are two possible solutions to resolve this issue. First of all, the unified system memory that can be accessed by both CPU and GPU will be very helpful to reduce the performance overhead due to frequent data transfers between CPU and GPU. Recently, AMD announced support for unified memory access [11, 3]. AMD’s proposed hetero- geneous uniform memory access (hUMA) allows heterogeneous processor cores such as CPU and GPU to use the same physical memory. If the unified main memory is used, CPU and GPU can communicate with each other by using a simple memory copy op- eration rather than using PCI-E bus transmission. The memory copy can be conducted within the system memory and hence, the performance can be significantly improved. The second solution is to actively leverage the underutilized SRAM structures such as cache and shared memory for reducing the overhead of long latency memory operation. As our evaluation showed, cache and shared memory are not effectively leveraged in graph processing. Over 60% of the L1 cache accesses encounter misses. The miss rate is not significantly reduced even when larger cache is used. Also, only 8% of shared memory is used in graph applications. These results imply that the data reuse in graph applications is rare. Therefore, we believe the cache and shared memory need to be used efficiently to handle data with limited reuse. One possible way is to use the two SRAM structures as large buffers for data prefetching. Recent studies [94] showed a memory prefetching approach for graph applications in GPU. They reused spare registers to store the prefetched data. However, for graph applications it would be easier to store the prefetched data in the cache or shared memory than in register files. By applying similar 47 prediction algorithm used in the study [94], but using the cache and the shared memory, the overhead due to long latency memory operation can be reduced. 2.4.2 Reduce Load Imbalance According to our evaluation, coarse-grain load distribution in many graph application is well balanced once the input data is large enough. However, the fine grained load distri- bution that is measured across the CTAs, warps, and SIMT lanes exhibits higher levels of imbalance. As briefly explained earlier, vertices in a graph have different degrees. Therefore, the amount of tasks that needs to be processed by each vertex is different and hence the load is imbalanced in graph applications. Given that the CTA execution time is determined by the longest warp execution time and kernel execution time is only deter- mined by the longest CTA execution time, such load imbalance can significantly degrade the overall performance. This problem can be statically resolved by the programmer’s effort. For example, if the programmer collects the vertices that have similar degrees and assigns them to the same CTA, the warp level load imbalance can be resolved. Due to the dynamic nature of graph processing it may be hard to find vertices that have similar degrees at every kernel invocation. Therefore, some sort of hardware support is necessary. The dynamic load monitoring and migration methods that are used in operating system domain [140] might be a solution. For example, if there is a CTA that processes high degree vertices and the kernel’s termination is delayed because of the CTA’s long execution, then migrating some of the warps assigned to the CTA to the other idle SMs might be helpful. Once the migration penalty is low enough, as the migrated warps can use the resources of the idle SM, the execution time can be balanced. 48 2.5 Related Works Che et al. [45] recently showed a preliminary characterization of graph applications on a real GPU machine. They implemented eight graph applications in OpenCL and analyzed the hardware behaviors such as cache hit ratio, execution time breakdown, speedup over CPU version execution, and SIMT lane utilization while running those applications on AMD Radeon HD 7950. In this chapter, we not only run applications on hardware but we also use cycle accurate simulation infrastructure to provide deeper insight into the hardware behavior. For example, we can measure the impact of having no L1 cache to show that L1 cache is entirely ineffective in graph applications. Using detailed simula- tions we can also measure load imbalance metric at the warp, CTA and SM level. The load balancing statistics provide useful insights to optimize the CTA distribution across all the SMs. Burtscher et al. [41] investigated performance impact of irregular GPU programs on NVIDIA Quadro 6000. They compiled eight irregular programs and compared the performance in several aspects with a set of regular programs. They basically measured two runtime-independent metrics, the control-flow irregularity and the memory-access irregularity at the warp level. The metrics are measured while varying the input size and optimizing the code itself. Most of the analysis provided by Burtscher is useful to optimize the application code and hence the purpose of the work is orthogonal to the focus of our research. Che et al. [46] also conducted CUDA application characterization on NVIDIA GeForce GTX 280. The evaluation is similar to [45] but the domain of the evaluated applications is more general than [45]. The target application domain of Che’s work [46] and our study is different and the focus of their characterization is software improvement, while this study focuses on understanding existing hardware bottlenecks in GPUs. 49 2.6 Chapter Summary Graph processing is a key component of many data analytics. There have been several studies to optimize graph applications on GPU platform. However, there has not been a study that focuses on how graph applications interact with GPU microarchitectural features. To provide insights to the GPU hardware designers for more efficient graph processing, we measured several (micro)architectural behaviors while running a set of graph applications. To understand the graph application’s unique characteristics, we also ran a set of non-graph applications and then compared the evaluation results. Based on the measurements, we conclude that resource utilization is quite poor across a wide range of application categories, and with the increasing prominence of new application categories such as graph applications this problem will only get worse. 50 Chapter 3 PATS: Pattern Aware Scheduling and Power Gating for GPUs 3.1 Chapter Overview In Chapter 1, we introduced the branch divergence problem when general purpose ap- plications are run on GPUs. In Chapter 2, we demonstrated the severity of branch di- vergence and resource underutilization problem in GPU. In the presence of the branch divergence, the power efficiency of GPUs can be compromised significantly. Recog- nizing the importance of sustaining power efficiency in the presence of divergence, a wide range of solutions [25, 26, 156, 158] have been proposed. Most of these solu- tions propose to power gate unused resources. In addition to power gating unused re- sources, many other orthogonal approaches to improve power efficiency have also been studied [64, 65, 99, 131, 133]. Some examples of these solutions range from reducing register file accesses through caching register content [64], 2-level warp schedulers to improve memory access behavior [133]. The focus of this chapter is two fold. First, we study the branch divergence behavior of various GPU workloads. The surprising result we observed is that in a 32 thread warp, out of possibly 2 32 of branch divergence patterns, each benchmark exhibits very few divergence patterns. In other words, many warps branch exactly the same as other warps 51 in the program. Our results, shown in Section 3.2, indicate that on average 60% of the warps in a variety of GPU workloads have fewer than five branch divergence patterns. The second part of this chapter then uses this critical knowledge to improve power gating efficiency in the presence of branch divergence. As shown in prior work [99], GPU execution units account for 20.1% of dynamic power consumption. In [26], the authors report that static leakage power accounts for 50% of total power consumption in integer units and 90% total power consumption in floating point units. Thus at least 10% of the total GPU power consumption is attributed to leakage power. The authors in [156] applied power gating at a coarser granularity of an entire SM, when it is not in use. Recent work [26] applied power gating at a finer granularity of gating either all the integer units and or all the floating point units. Floating point units can be shut off, when there are only integer instructions in the pipeline, and vice-versa. In this work, we exploit the branch divergence pattern similarity to shutdown unused SIMT lanes. Based on the observation that the same divergence pattern occurs repeatedly across multiple warps within a CTA and even across CTAs, we propose to co-schedule warps with similar divergence patterns. When a set of SIMT lanes are unused in one warp, we co-schedule warps with similar SIMT lane idleness behavior to create longer idle periods in SIMT lanes. This long idleness is essential to reduce the overheads of power gating. The rest of this chapter is organized as follows: Section 3.2 presents divergence pattern analysis of GPU workloads. Section 3.3 highlights the challenges of per lane power gating and motivates the need for a divergence pattern aware scheduling technique. Sections 3.4 discusses the proposed PATS techniques and the required microarchitectural support. Section 3.5 presents the simulation methodology and results. We discuss related work in section 3.6. 52 3.2 Explore the Divergence Patterns To study the branch divergence patterns we ran 19 benchmarks from several benchmark suites including Rodinia [46], Parboil [143], and ISPASS [34]. The benchmarks are listed in Table 3.1 and Table 3.2. The benchmark suites were used in the most recent GPU power gating work [26]: five of these benchmarks are purely integer benchmarks, four of them are floating point dominant and four of them are non-divergent benchmarks. We simulated the benchmarks using GPGPU-Sim v3.2.1 [34]. The default NVIDIA GTX480-like configuration provided with GPGPU-sim is used. The baseline is a fermi- like architecture containing 15 streaming multiprocessors (SMs). Each SM has 2 shader processor (SP) units, 1 load/store unit and 1 special function unit (SFU) to handle com- plex computations such as sine and cosine. There are 16 double-clocked CUDA cores in each of the SP units, and thus each SP unit can execute one warp containing 32 threads each cycle. Each CUDA core has a separate integer and floating point pipeline. The register file size is 128 KB per SM, and the maximum warps per SM is 48. The de- fault scheduler, unless stated otherwise, is greedy then oldest first scheduler. We use GPUWattch [99] and McPAT [102] for power estimations. Before we present the branch divergence pattern analysis, it is helpful to provide a quick overview of how active masks are used to control divergent execution within a GPU. Figure 3.1(a) illustrates that basic block A is executed by all the threads within a warp; we assume for illustration simplicity that there are four threads per warp. Hence, the active mask is set to ”1111” when A is executed. The last instruction in A is a branch and the branch can take two paths B or C. The first thread and third thread execute B, while others branch to C. Thus the active mask in B is ”1010” while the active mask in C is ”0101”. We call such divergent active mask a divergence pattern. This divergence can be handled either by software predication or using hardware stack. For a stack based architecture, a SIMT stack structure keeps an active mask at the top of the stack to mark active threads in the current warp [93]. Here, we call the first instruction in B and C as immediatedivergentpoint. 53 Name R Pattern Count %DynDiv % Total backprop 1 10000000000000001000000000000000 720896 51.6% 6.2% (12.0%) 2 11111111111111110000000000000000 675863 48.4% 5.8% bfs 1 00000000000000000000000000000100 19392 1.15% 0.96% 2 00000000000100000000000000000000 19147 1.13% 0.95% (73.8%) 3 00000000000000010000000000000000 19009 1.12% 0.94% 4 00000000000000000000000000010000 18726 1.11% 0.93% 5 00000000000000000000000100000000 18646 1.10% 0.92% b+tree 1 11111111111111111111111111110000 1088043 54.6% 5.87% 2 10000000000000000000000000000000 206918 10.4% 1.12% (10.8%) 3 00000000000000100000000000000000 32530 1.63% 0.18% 4 00000001000000000000000000000000 26830 1.35% 0.14% 5 00000000100000000000000000000000 26662 1.34% 0.14% cutup 1 11111111111111110000000000000000 45254 97.1% 1.7% (1.7%) 2 00000000000000001111111111111111 1331 2.9% 0.05% gaussian 1 11111111111111110000000000000000 11452316 97.0% 96.6% 2 11110000000000000000000000000000 102305 0.87% 0.86% (99.6%) 3 11101110111011100000000000000000 56186 0.48% 0.47% 4 10001000100010000000000000000000 54678 0.46% 0.46% 5 11001100110011000000000000000000 54678 0.46% 0.46% heartwall 1 01111111111111111111111111111111 10150173 1.75% 0.72% 2 11111111111111111111111111111110 10136403 1.74% 0.72% (41.5%) 3 00111111111111111111111111111111 10107945 1.74% 0.72% 4 11111111111111111111111111111100 10100907 1.74 % 0.72% 5 00011111111111111111111111111111 9916848 1.71% 0.71% hotspot 1 00111111111111000011111111111100 788310 48.5% 22.4% 2 01111111111111100111111111111110 529994 36.7% 32.6% (46.2%) 3 00000000000000000111111111111110 81692 5.0% 2.3% 4 01111111111111100000000000000000 78076 4.8% 2.2% 5 00111111111111110011111111111111 50836 3.1% 1.4% lbm 1 11111111111111111111111000000000 5518624 32.0% 15.4% 2 01111111111111111111111111111111 4319520 25.1% 12.0% (48.0%) 3 11111111111111111111111100000000 3471008 20.1% 9.7% 4 00000000000000000000000100000000 488992 2.8% 1.4% 5 10000000000000000000000000000000 419424 2.4% 1.2% Table 3.1: 5 most common divergence patterns for 19 benchmarks. 54 Name R Pattern Count %DynDiv % Total LPS 1 11110000000000000000000000000000 292700 32.6% 10.5% 2 01111111111111111111111111111111 205636 22.9% 7.4% (32.2%) 3 11100000000000000000000000000000 185636 20.7% 6.7% 4 11111111111100000000000000000000 72200 8.0% 2.6% 5 10101010101000000000000000000000 34500 3.8% 1.2% MUM 1 00000000000000000000000000000010 11196 1.04% 0.78% 2 00000000000000000001000000000000 10164 0.94% 0.71% (75.4%) 3 00000000000000000000000000000100 9334 0.87% 0.65% 4 00000000000000000000000100000000 9290 0.86% 0.65% 5 00000000010000000000000000000000 9218 0.86% 0.65% nw 1 11111111111111110000000000000000 8265664 46.3% 46.3% 2 11111111111111100000000000000000 638976 3.6% 3.6% (100%) 3 11111111111111000000000000000000 638976 3.6% 3.6% 4 11111111111110000000000000000000 638976 3.6% 3.6% 5 11111111111100000000000000000000 638976 3.6% 3.6% pathfinder 1 11111111111111110000000000000000 925 3.6% 0.36% 2 00000000000000000000000000000111 600 2.3% 0.24% (10.1%) 3 11100000000000000000000000000000 600 2.3% 0.24% 4 00000000000000000000000000000011 500 1.9% 0.20% 5 00011111111111111111111111111111 425 1.6% 0.17% srad 1 11111111111111101111111111111110 3538944 31.7% 4.7% 2 00000000000000010000000000000001 2621440 23.5% 3.5% (15.0%) 3 00000000000000000000000000000001 884736 7.9% 1.2% 4 00000000000000010000000000000000 851968 7.6 % 1.1% 5 00000000000000001111111111111110 458752 4.1% 0.6% WP 1 11111110111111101111111011111110 592802 23.6% 8.2% 2 11111111111111110000000000000000 339682 13.5% 4.7% (34.8%) 3 11111110111111100000000000000000 42289 1.69% 0.59% 4 00000001000000000000000000000000 27022 1.08 % 0.37% 5 00000000000000000000000010000000 18801 0.75% 0.26% *kmeans, lavaMD, LIB, mri-q, sgemm have 0% diverged instructions Table 3.2: 5 most common divergence patterns for 19 benchmarks (Cont.). 55 PC Iteration Active Thread Count InstructionCount Warps Count B 1 3 1 1 1 100 3 B 2 3 3 1 0 100 3 C 1 0 2 2 2 20 3 C 2 0 0 1 2 20 3 Warp 1 Warp 2 Warp 3 A B C D A B C D (a) Example Program (b) Divergence statistics table (c) Multiple warps diverge differently at same PC in different iterations. A/1111 B/1010 C/0101 D/1111 Figure 3.1: Illustration of divergence pattern behavior GPU applications use the notion of a kernel to execute the parallel component of the workload. Usually the kernel code is repeatedly executed by many warps. Furthermore, each warp may iteratively execute the same set of divergent points but with potentially different active masks at each iteration. Such an execution could result in a huge number of divergent patterns. Figure 3.1(c) shows an illustration where three different warps (warp1, warp2, warp3) running the same control flow code shown in Figure 3.1(a) for two iterations. In both iterations the basic block A and D are executed by all threads in all warps. But basic blocks B and C could be executed by different combinations of threads in each warp as illustrated in the figure. For instance, basic block B is executed with three different active masks ”1010”, ”1100”, ”1001” by the three warps in the first iteration, and ”1110”, ”1100”, ”1100” are the active masks of the three warps in the second iteration. In this example a total of five patterns are seen during execution of B. Hence, when there are 48 warps, each with a 32-bit active mask vector, that execute thousands of iterations of the same control flow sequences, there could be thousands of different branch divergence patterns. 56 In order to capture the variance in the number of active masks, we use a divergence statistics table. Figure 3.1(b) shows this table for the warp execution behavior seen in basic blocks B and C in the illustration in Figure 3.1(c). For every basic block that can diverge, there is one entry in this table. For every diverging basic block, there is one entry per each iteration that block is executed. The column titledAggregateActiveMask sums up the active masks of all warps across the CTAs currently executing in the SM that branch to the specific basic block. For instance, in our illustration, the first iteration of B was executed by three warps with active masks ”1010”, ”1100”, and ”1001”. Thus the aggregate active mask field stores ”3111”, which is simply the sum of all active masks for the given iteration of a basic block. This field implies SIMT lane one was used three times during one iteration of the basic block, while all other lanes were used once. The instruction count field in the table is simply the dynamic count of instructions executed in the given iteration of that basic block. The last column, warp count, counts the total number of warps with at least one active mask bit that executed the basic block in the corresponding row. After defining the basic terminology above, we now present a graphic representation of the divergence statistics table. The divergence pattern data for a subset of 8 bench- marks is shown in Figure 3.2. In each of this figure the X-axis of the graph ranges from 1 to 32, indicating the warp width. For each iteration of every diverging basic block there is one point on the Y-axis. The Y-axis is a sorted list of the starting addresses (PC) of each diverging block concatenated with the iteration number. For instance, consider a program with just three diverging basic blocks, and the starting PCs of these basic blocks are PC1, PC2 and PC3. Note that each of these diverging basic blocks may be executed by a subset of warps. Let us say the execution sequence of these diverging blocks is PC1, PC2, PC3, PC1, PC1. In this execution sequence, the diverging basic block starting at PC1 is executed three times; PC2 and PC3 were executed once. As mentioned before we simulated a Fermi- like configuration where at most 48 warps from one kernel can execute concurrently in 57 (a) backprop (b) bfs (c) hotspot (d) LPS (e) lbm (f) nw (g) pathfinder (h) srad Figure 3.2: Divergence patterns for divergent benchmarks a single SM. Thus at most 48 warps could have executed each of the diverging basic blocks. Referring to divergence statistics table in Figure 3.1(b), the maximum value for any given SIMT lane in the aggregate active mask field will be 48 during any one iteration of the three basic blocks, PC1, PC2 or PC3. The aggregate active mask vector of PC1’s first iteration is then plotted as a single horizontal line in Figure 3.2. We then plot the second and third instance of PC1’s aggregate active mask which are plotted as the second and third horizontal lines. Once PC1 is processed, we then plot PC2’s aggregate active mask followed by PC1’s aggregate active mask. Each row of the graph shows the value of the aggregate active mask vector in color scale. Red maps to 48, which is the maximum number of warps concurrently executing in a shader core in Fermi. Green maps to 24. Dark blue maps to zero, means that none of the 48 warps has an active thread in that SIMT lane for the given PC in a given iteration. We further differentiate the importance of each basic block and we use the total instruction count as defined earlier as the weight of each line that is plotted. If PC1 has 58 Figure 3.3: Divergence control flow of pathfinder 100 instructions, and PC2 has 20 instructions, the line for active mask for PC1 will be 5 times thicker than a line for PC2. As shown in Figure 3.2, many benchmarks exhibit visually interesting divergence pat- terns. Except for bfs, other benchmarks have divergence patterns that are neatly blocked and zeros (dark blue) are clustered together. This pattern implies that the same branch divergence behavior is encountered across multiple iterations of the divergent block. But even more critically the divergence similarity is also quite persistent across different di- vergent basic blocks. To understand the reason for this strong divergence similarity across multiple iter- ations of the same basic block and across different basic blocks in many benchmarks, we analyzed the source code of each benchmark and identified the potential divergence points and the conditions used within branch instructions at the divergence point. The most common reason for this divergence similarity occurs when a branch condition is dependent on a thread-specific static value. The example code in Figure 3.3 for the pathfinder benchmark shows the branch condition in the inner most loop of the pathfinder [46] benchmark. Theif branch is taken or not depends on the value of the x component of thread id (variabletx). For each iteration of the outermost loop the inner- most loop iterates as many times as the outmost loop’s index (variablet).This condition is true for all the warps executing the kernel. Once inside the inner loop a warp diverges, it simply based on the thread id, as shown in the ”if” condition within the inner loop. Each horizontal line in Figure 3.2 for pathfinder has an aggregate active mask pattern of 59 Figure 3.4: Divergence control flow of bfs shrinking ones, as less threads in a warp satisfy the ”if” condition after each outer loop iteration. But the weight of each line (thickness) is the same since the total instruction count executed on each iteration is identical. Thus range of the qualified thread ids de- creases with pyramid height on both sides in each of the innermost loop iteration, and thus the graph in Figure 3.3 has triangle patterns. The benchmark nw also has a similar branching behavior where the innermost loop branch is dependent on the thread id and the innermost loop iteration count is set by the outermost loop. Hence, whenever the branch condition is dependent on a statically determined value, such as thread-id, the branch divergence patterns will remain stable across all warps. Similar observation was made in [130]. On the other hand, as shown in Figure 3.4, the bfs benchmark kernel#1 has a branch condition that is dependent on the value of the graph node visited by that thread in that iteration. Therefore, as shown in Figure 3.2b, the divergence pattern of bfs benchmark is unpredictable within each warp as well across different iterations. We show the five most common divergence patterns for each benchmark in Table 3.1. The column titled %Count shows the total number of instructions (dynamic count) un- der each pattern. The column titled %DynDiv shows the instructions (dynamic count) that executed under that divergence pattern as a fraction of all the instructions that di- verged. The column titled %Total shows the instructions (dynamic count) that executed under that divergence pattern as a fraction of all the instructions, both diverged and non- diverged, executed in that application. The percentage under the name of each applica- tion shows the fraction of total diverged instruction over all the instructions executed. For 60 backprop there are only two divergence patterns throughout the entire application, while for bfs there is no dominant divergence pattern since each pattern accounts for roughly 1% of the total diverged instructions. In our evaluations we noticed that bfs has 2000 divergence patterns and most of these divergence patterns have many leading zeros and just a few ones. Since there is no else condition in the bfs branch code, shown in Fig- ure 3.4 there are no complementary divergence patterns with leading ones. Even though pathfinder has well defined structure to its divergence patterns, as explained earlier, the height of the pyramid determines the number of divergence patterns that can be seen. If the pyramid is short there will be few divergence patterns, and vice-versa. In our exe- cution the pyramid is quite tall and hence there are many patterns, each accounting for only a fraction of the total divergence patterns. Many benchmarks exhibit five or more divergence patterns, but the top two divergence patterns account for more than 50% of the diverged instructions. On average, the five most common patterns account for 60.3% of total diverged instructions. 3.3 Challenges in Per Lane Power Gating in GPUs There are many potential usages to exploit the predictable branch divergence behavior shown in the previous section. One such usage is power gating. We propose to explore the possibility of power gating unused SIMT lanes by exploiting the repeated branch divergence patterns. With technology scaling, static power is increasingly becoming prominent [38]. In [26] the authors state that static power accounts for more than 50% of the total power consumption in a GPU. Power gating is a well know technique for reducing static power consumption, but it has to be used with caution. Otherwise, the power gating overheads can lead to negative performance impact as well as increase the power consumption. As is shown in [75], the overheads include break-even time and wakeup delay. The energy overhead to sleep as well as the overhead to wakeup have to be compensated before power gating can provide any power savings. Thus the longer a unit 61 is power gated, it continues to save static energy and eventually the static energy savings cross over the power overheads of using the sleep transistor. The number of cycles in power gated state when this cross over occurs is called the break-even time. When the gated unit is needed for computation there is a cost to wake up the unit since the unit has to be powered on to reach fullV dd , which is called the wakeup delay. A typical value for break-even time is about 9 - 19 cycles [75], and a typical value for wakeup delay is about 3-9 cycles, for logic blocks that are similar in size to the execution units in GPUs. In the remaining chapter, by default we use a break-even time of 14 cycles and wakeup delay of 3 cycles. We also include our energy saving and performance results for 9-19 cycles of break-even time and 3-15 cycles of wake-up delay in the sensitivity analysis section of the chapter. Applying power gating per SIMT lane poses new challenges. First, in order to ef- fectively apply power gating at the granularity of a SIMT lane, it is necessary to keep the SIMT lane idle for long enough periods. However, the typical idleness window of a SIMT lane is significantly shorter than the required break-even time. In fact, our results show that, for the hotspot benchmark, 72% of idle periods for any single integer unit is less than five cycles. The primary reason for the short idle periods is the scheduling deci- sions made by the default two-level scheduler [120], which schedules all ready warps in a round robin fashion. Therefore warps with different divergence patterns are interleaved with each other. In particular if a warp with a divergent behavior is co-scheduled with a warp that fully utilizes all the SIMT lanes, then the idle period of each SIMT lane is constantly interrupted by the warp that needs to use all the SIMT lanes. However, in [26] the authors proposed to power gate all the integer or floating point pipelines within each SM as a whole. Thus they needed to pay the wakeup penalty only when they issue an instruction that requires the gated unit type (INT or FP). But with SIMT lane power gating of either an integer or floating point unit, even if a sched- uled warp requires just one gated SIMT lane resource then the entire warp must pay 62 the penalty of wakeup. Thus the probability of paying the wakeup penalty increases significantly with per lane power gating. Blackout is a technique proposed in [26] to reduce the negative effects of power gating. The authors proposed to force a unit to stay in sleep mode for at least the break- even time. When this technique is applied for the entire cluster of INT or FP units then the scheduler simply forces all the warps that require that cluster to wait for the blackout period to complete. Given that there are plenty of warps with either INT or FP instructions the scheduler can still find sufficient number of warps of a given type to schedule even if the other instruction type resource is in the blackout state. Thus blackout in effect clusters the warps that need the same resource together. But this technique when applied to a SIMT lane power gating may cause significant performance penalties. When blackout is applied at the SIMT lane, the warp scheduler has to essentially find other warps that also don’t use that SIMT lane within the active warp queue. Unfortunately, there are warps with a fully utilized active mask that are interspersed with divergent warps in the ready queue. Our results show that on average there is 3.4% performance overhead when blackout is applied on per lane power gating, as the warp scheduler is unable to issue a warp that needs a lane in the blackout state. In the worst case scenario, it causes 14% performance overhead in the MUM benchmark. Many lanes are forced into the blackout state whenever the lane idleness surpasses the idle detect window length. Then all warps that need that SIMT lane are blocked until the end of the blackout period, leading to the performance degradation. Therefore, to tackle the challenges listed here, there is a critical need to develop a new power gating approach for per lane power gating to work well. In the next section, we describe our three proposed techniques, Pattern Aware Two-level Scheduler(PATS), Deterministic Look Ahead Rule (LAR), Pattern and Instruction Type Aware scheduler (PATS++). We show that these techniques are able to target the challenges effectively with minimal overhead. 63 3.4 Pattern Aware Two-level Scheduler (PATS) 0 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 0 1 1 0 1 0 0 1 0 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 0 1 1 0 1 0 0 1 Cycles Two Level Scheduler (a) Interleavely scheduled warps 0 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 0 1 1 0 1 0 0 1 0 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 0 1 1 0 1 0 0 1 Cycles Pattern Aware Scheduler (b) warps of same pattern is prioritized Figure 3.5: Effect of warp scheduler on idle cycles We propose PATS to prioritize scheduling warps with similar divergence patterns. The prioritized scheduling technique could prolong the length of idle cycles for any given SIMT lane. Figure 3.5 illustrates the key idea of pattern aware scheduler. For the purpose of illustration, we assume there are 8 threads per warp (rather than the 32 threads used in our results) and hence there are 8 SIMT lanes to support the execution of each warp. In Figure 3.5a and Figure 3.5b, a warp is scheduled for execution on 8 SIMD lanes. Again for demonstration purposes, let us assume that the break-even time is three cycles, and zero idle detect time and zero wakeup delay. In Figure 3.5a, the default scheduler is not cognizant of the divergence patterns. Hence, different patterns interleave with each other in the scheduler. Although many lanes are intermittently active, none of the lanes has long enough idle window to offset the break-even time of three cycles. Figure 3.5b shows the behavior with PATS. PATS uses the same set of warps as before, but prioritizes issuing warps with the same divergence patterns. Thus warps with the same divergence are scheduled as a cluster creating long idleness windows of 4 cycles for idle SIMT lane thereby increasing the opportunity to power gate. 64 (a) Two level scheduler (b) Pattern aware scheduler Figure 3.6: Actual execution trace of the INT units collected from the hotspot benchmark Figure 3.6a and Figure 3.6b show an actual execution trace of the 32 integer units’ usage collected from the hotspot benchmark. The X-axis represents 32 lanes from left to right and the Y-axis represents the time progression from top to bottom. In Figure 3.6a, the default two level round robin scheduler in GPGPUSim v3.2.1 is used. In Figure 3.6b we use PATS. As explained earlier, we assume that each SM has two shader processors (SPs) each with 16 double clocked SIMT lanes and each SIMT lane has a separate INT and FP pipeline. Two warps can be issued to the two SPs in any clock. Any full warp is assigned to one SP, by simply scheduling the odd and even SIMT lanes one after the other in two half clocks. Thus the divergence pattern for a warp is converted from a 32-bit value to a 16-bit value by OR-ing odd bits with even bits. We then concatenate the active masks of two instructions that were issued in the same cycle as a 32 bit value in this illustration. An active mask bit value of zero is represented as blue and the red color represents an active mask bit value of 1. Although the default two level scheduler is able to find some stretches of idleness, the SIMT lane idleness is interrupted by fully active warps. In fact, in this Figure 3.6a the longest stretch of idleness is 200 cycles for one SIMT lane and most SIMT lanes have fewer than 20 idle cycles. The PATS approach, shown in Figure 3.6b, has extended periods of idleness for many SIMT lanes. 65 For instance, the longest blue blocks in the middle SIMT lanes have as long as 497 continuous idle cycles. 3.4.1 Design Issues of Pattern Aware Two-level Scheduler (PATS) In this section, we describe the micro-architectural modifications needed in a GPU pipeline to implement PATS. PATS can be built on top of any existing warp scheduler, such as greedy then oldest scheduler [133] or two-level round robin scheduler [120]. To con- cretely present the implementation we built PATS on top of the two-level scheduler. We extend the default two-level scheduler with three enhancements: (1) Pattern table, (2) Per pattern warp subset and (3) Warp charger. Pattern table: We augment the warp scheduler with a 7-entry pattern table. Each entry is a 16-bit vector that holds the most common divergence patterns. Note that, as mentioned earlier, each warp is scheduled in two half cycles on 16 double clocked CUDA cores in our implementation. If one was to use 32 CUDA cores without double clocking, then the entry size will be 32 bits. In addition, each entry has a 6-bit warp count field that counts the number of warps with the same divergence pattern, and 1-bit valid field. Note that at most 48 warps can be resident in each SM, and hence, the warp count field is six bits. All entries are initialized as invalid at the beginning. The pattern table is updated only at the beginning and at the end of each divergence. In current GPU designs each time a warp encounters a divergence or re-convergence point, the SIMT stack is updated. As described in prior work [93], at the divergence point a new SIMT stack entry is pushed on top of the existing SIMT stack entries and at the re-convergence point the SIMT stack entry at the top is popped. We propose an addition to the SIMT stack update process. Every time an entry is pushed on top of the SIMT stack the active mask vector that is being pushed on the SIMT stack is read out and it is matched against the top five active mask entries of the pattern table. If there is a hit in the pattern table, then the 6-bit warp count is incremented. Every time the SIMT stack entry is popped the active mask that is being removed is also matched with the top five pattern table entries. On a hit the warp 66 count field is decremented. If the warp count field reaches zero that implies there are no active warps currently executing with the same active mask. Hence, the entry whose warp count reaches zero is invalidated. If there is no hit in the pattern table, then the active mask that is being pushed on top of the SIMT stack is allocated an entry in the pattern table. First, the insertion process tries to find an invalid entry in one of the top five pattern table entries. If no invalid entry is available, it implies that there are already five divergence patterns in the machine with at least one warp with that pattern still active. Rather than perform a complex replacement algorithm, we use the last two entries in the pattern table to find an approximate match. Entries 6 and 7 in the pattern table are special entries with at least four leading zeros (entry 6) or four trailing zeros (entry 7). The basic intuition is that if a pattern can not be exactly matched with the top five patterns then we try to see if the pattern has leading or trailing zeros. This is based on our empirical observation that many patterns in fact have either trailing or leading zeros. Rather than match perfectly, we try to match the current active mask approximately when the pattern table has five valid entries already. If the current active mask has no trailing or leading zeros we simply treat it as a random pattern, and we do not store that entry in the pattern table. Each active mask entry that is pushed on the top of the stack is assigned a pattern number. In our implementation we simply use the pattern table index number where the pattern was matched as the pattern id. For instance, if the active mask vector matched entry 4 then that active mask is tagged as pattern number 4. Note that this process of searching for patterns and assigning patter id is not in the critical path. The active mask is pushed on to the SIMT stack and the matching process is conducted off the critical path. Once the active mask is associated with a pattern type (which is the pattern table index in our case), there will be no need to repeat the pattern assignment until the next divergence point. 67 We also designed a simple optimization to reduce the number of patterns that are considered for insertion into the pattern table. This optimization is based on the obser- vation that if the divergent code is short and is executing only a few instructions before diverging again or reaching a re-convergence point, then there is no reason to insert this short lived divergence pattern into the pattern table. Hence, every time a new divergence point is reached we simply subtract the current PC from the re-convergence PC (RPC). If there is a separation of at least 100 bytes between PC and RPC, which roughly translates into 12 instructions, then we consider that divergent active mask pattern for insertion into the pattern table. Otherwise, we simply ignore that pattern. This optimization effectively reduces the number of searches and insertions in the pattern table for short lived patterns. Note that RPC and PC are both already available in existing SIMT stack hardware and hence this distance can be easily computed. Furthermore, note that this optimization is not perfect since the distance between RPC and PC is not the true metric for measuring the size of a divergent path. We needed to measure the distance between PC and jump PC within an ifthenelse condition to truly measure the size of the if and else paths. Nonetheless this rough optimization works well by reusing existing data fields in the SIMT stack. Per pattern type active warps subset: Since our goal is to schedule warps with same divergence patterns together, we first need to identify warps with similar divergence patterns. We augment the SIMT stack with a 3-bit pattern id which essentially stores the pattern table index where the pattern is currently matched for each warp. This pattern id is then used by the scheduler to decide which warps to issue. This decision is made in conjunction with a warp charger logic which we will describe next. Warp charger: The last hardware enhancement proposed in our work is the warp charger. Warp charger has the responsibility to gather a large number of warps with the same pattern type so that the scheduler can issue them in a clustered fashion. As described in Figure 3.7, at the beginning, the warp charger is in the idle detect state. The warp charger then then simply picks the first valid entry in the pattern table and decides 68 Charging Idle Detect Start Charging Discharge Active Warps Count of pattern k Cycles Figure 3.7: Illustrations of warp charger state machine to charge that divergence pattern and enters a charge state. In the charge state it informs the scheduler the pattern id number it has decided to charge. Then the scheduler gives lowest priority to that pattern id when it is scheduling warps. Thus it holds off issuing warps with that pattern id thereby allowing other warps to issue. The warp charger stores the pattern id of the pattern that is being charged (which is the pattern table index of the first valid entry) in a 3-bit register. Then every time a warp enters the active queue in a two-level scheduler, it checks the pattern id of the entering warp with the pattern id it has marked for charging. If there is a match it increments a charge counter. Once the counter reaches a threshold, which is set at eight in our current implementation, then the warp charger enters the discharge state. In the discharge state the warp charger informs the scheduler that pattern id that it wants to discharge. The scheduler then issues only warps with the discharge pattern id. Note that the warp charge counter is incremented only when a warp enters the active queue. Hence, during the discharge process all warps that are being discharged will be already available in the active queue of the two-level scheduler. If the scheduler can’t find enough warps with the discharge pattern id in any given cycle, it stalls issuing warps rather than issue a different divergence pattern. This approach may appear too 69 aggressive in providing power gating at the expense of performance. But note that the discharger only informs the scheduler when it sees a minimum of eight warps in the active queue with the same divergence pattern. Since each warp may execute many instructions cumulatively the length of idle cycles for any given SIMT lane in the current divergence pattern can be significantly enhanced. Hence, in practice warp discharger increases the effectiveness of power gating by allowing the power gating hardware to safely shutdown all SIMT lanes that are going to be unused during the discharge process. The discharge process stops when the warp count of that divergence pattern reaches zero. Then the warp charger selects the first valid pattern id from the pattern table to charge and informs the scheduler to give lowest priority to the selected divergence pattern. Note that our warp charger works well to deal even with benchmarks with highly random warp divergence patterns. For benchmarks, such as bfs, that have many random warp divergence patterns the warp charger may pick one pattern for charging. But if that pattern does not appear sufficiently often in the active queue the warp charger may never enter the discharge state. Note that in charge state the scheduler only gives lower priority to the selected divergence pattern. But if there are no other warps to issue the scheduler will ignore the warp charge request and continue to issue the charging patterns. Thus the warp charger provides an excellent opportunity to improve power gating efficiency without negatively impacting performance. 3.4.2 Gating Penalty Avoidance Using Deterministic Lookahead In this section we describe how to avoid unnecessary sleep and wakeup penalty for power gating a SIMT lane. We propose a new technique called deterministic look ahead rule (LAR). Our key idea is inspired by determinist clock gating [101], which points out that for many stages in the pipeline, the circuit block’s usage is deterministic for several cycles into the future. Hence, clock gating can be disabled and enabled with near-zero penalty by exploiting this determinism. Our key observation is that as soon as a warp is issued to the operand collector stage, this warp will come into the execution unit in the 70 immediate near future. Note that the operand collector stage is simply a pipeline stage used in GPUs to read the register input operands after a warp is already scheduled for execution. Since register reads can take multiple cycles in GPUs the operand collector stage buffers the input operands and waits for all input operand reads to complete before executing the instruction. The typical time a warp stays in the operand collector is 2-10 cycles. Thus given this knowledge it is prudent to not activate any power gating for a SIMT lane that is going to be used by any warp that entered the operand collector stage. Thus our power gating hardware will not start to power gate any SIMT lane as long as there is at least one warp in the operand collector stage that requires the SIMT lane being considered for power gating. Fetch Decode Issue Operand Collecter WB SFU LDST Int Int Int Int Int Int FP FP FP FP FP FP Int Int Int Int Int Int FP FP FP FP FP FP 2 – 10 Cycles LAR Unit Counters PG Unit Inc Dec Figure 3.8: Illustrations of the deterministic look ahead rule (LAR) Figure 3.8 shows a SIMT pipeline and the implementation of LAR. A per lane counter is used to track the pending warps in the operand collector. Whenever a warp is issued to the operand collector buffer, the appropriate counter in the active lane will be increment by one. When the warp leaves the operand collector, that counter in the corresponding lane will be decremented by one. 71 3.4.3 Pattern and Instruction Aware Scheduler (PATS++) PATS as described in the previous section simply uses pattern similarity to prioritize warp scheduling for each SIMT lane. Hence, the assumption is that we apply power gating for the entire SIMT lane. Recall that in our baseline architecture each CUDA core has a separate INT and FP pipeline. We use the baseline with a separate INT and FP pipelines since modern hardware designs have separate INT and FP pipelines due to vastly differ- ent representation of INT and FP numbers. We enhance PATS to take advantage of the availability of separate INT and FP units. In this section we design PATS++ which is essentially PATS that also takes into account instruction type (INT or FP) when making a warp scheduling decision. Thus PATS++ combines pattern similarity information with instruction type information. The 3-bit pattern id information is enhanced with a 2-bit instruction type field (INT/FP/SFU/LDST). The scheduler then prioritizes a warp with the same pattern for scheduling but when there are multiple warps with the same pat- tern available it also prioritizes issuing the warp that has the same instruction type as the previously issued warp. Each time a warp enters the active warp queue, PATS++ checks whether it is an INT or FP instruction and increments a 6-bit counter for either INT or FP type. Initially the warp scheduler selects INT type as a priority. But if the INT counter goes down to zero it simply switches to prioritize FP type and toggles back to INT type when the FP counter goes to zero. We assume that the per lane power gating technique is implemented as shown in [75]. The approach uses idle detect logic and ready instruction detect logic. The idle detect logic can be implemented as a counter that will be incremented every time an idle cycle is detected and cleared whenever a ready instruction is detected. Whenever the counter hits the idle detect threshold, the power gating logic will trigger the power gating signal for that specific SIMT lane. Scheduler: PATS requires each warp entry in the scheduler to store an additional 3-bit pattern id. 72 Pattern Table: As mentioned before, a 7-entry pattern table is required in the branch- ing unit. Each entry has a 16-bit active mask pattern, a three bit pattern id (PAT TP), a valid bit (V), and 6 bit warp counter (CNT). In total, we need 23 bits per entry and a total of 161 bits of storage for the pattern table. For comparison logic, 16 XOR gates are used to compare SIMT stack entry that is being pushed or popped sequentially with each entry in the pattern table. One could also increase the comparison speed with additional XOR gates, but in our implementation we only compare the SIMT stack entry with each pattern table entry sequentially. Warp Charger: The warp charger stores a 3-bit pattern id that is currently selected for charging (PAT TP). The charger accumulates a count of all ready warps with a match- ing pattern id. The counter is 3 bits since we accumulate up to eight warps before starting the discharge process. In addition, there is one bit to indicate charge or discharge state. PATS++: PATS++ requires two-bit instruction type field attached to each warp entry, in addition to the 3-bit pattern id. It also requires two 6-bit counters to count the number of INT ready and FP ready warps. Whenever the idle detect counter for a given SIMT lane hits the idle detect threshold, the power gating logic will trigger the power gating signal for that specific SIMT lane. In our approach even when counter value exceeds the threshold we first check the LAR to override the idle detection counter. The per lane power gating controller does not interact with the warp scheduler or warp discharger directly. The warp scheduler works only in conjunction with the warp discharger to provide longer idle windows, which simply translate into improved power gating efficiency. This decoupling between the scheduler and power gating controller simplifies our proposed design. We also explored the option of letting the warp discharger proactively inform the power gating controller when it is ready to discharge. In this approach the warp dis- charger can tell the power gating controller which lanes are going to be idle during the discharge phase. But the approach requires coordination between the two sides. Further- more even when warp discharge begins due to inherent delays within the pipeline, such 73 as bank conflicts when reading a register, there is no way to predict exactly when the dis- charged warps enter the execution stage. Hence, the added benefits of this coordination were not sufficiently large to justify this complexity. 3.5 Evaluation 3.5.1 Methodology We evaluated our proposed techniques for energy saving and performance using GPGPU- Sim v3.2.1 [34]. The default NVIDIA GTX480-like configuration provided with GPGPU- sim is used. We use GPUWattch [99] and McPAT [102] for power estimations. We evalu- ated our results using 19 benchmarks from Rodinia [46], Parboil [143], and ISPASS [34] benchmarks. 3.5.2 Static Energy Impact Figure 3.9a and Figure 3.9b shows the static energy savings by taking into account the power gating overhead. There are four bars in each chart. The first bar (labeled PerLane) shows the static energy savings when the default power gating approach [75] is applied to each of the INT and FP units within each SIMT lane. We use a break-even time of 14 cycles and wakeup delay of 3 cycles. On average this default approach saves 33.0% of the static energy for INT units and 45.4% for FP units. Note that some divergent benchmarks like backprop and MUM, the energy savings are virtually non-existent due to significantly increased transitions between wakeup and sleep states. For the same rea- son, cutcp even has negative energy savings.These negative savings are due to frequent interruptions to a gated SIMT lane due to short idle cycles. The second bar labeledGates shows power savings of gates approach as described in [26]. In this approach an entire cluster of INT or FP units is power gated and they also use blackout and adaptive idle detect approaches, which were described earlier in 74 33.0% 52.1% -‐30% -‐10% 10% 30% 50% 70% 90% b+tree backprop bfs cutcp gaussian heartwall hotspot kmeans lavaMD lbm LIB LPS mri-‐q MUM nw pathfinder sgemm srad WP avg Integer Energy Saving Per Lane Gates Pats Pats++ (a) Integer Unit 45.4% 66.4% 0% 20% 40% 60% 80% 100% b+tree backprop bfs cutcp gaussian heartwall hotspot kmeans lavaMD lbm LIB LPS mri-‐q MUM nw pathfinder sgemm srad WP avg Fp Energy Saving Per Lane Gates Pats Pats++ (b) Float Point Unit Figure 3.9: Static energy impact of proposed techniques. Section 3.3. Gates is most beneficial for benchmarks with a good instruction mix, like cutcp, LIB and mri-q. Those benchmarks have lots of integer instruction as well as floating point instructions. On the other hand, as is shown in Figure 3.9b, b+tree, bfs, lavaMD, MUM, nw, pathfinder don’t have any FP instructions. In this case, gates is unable to find other instruction types to prevent issuing INT instructions. As a result, static energy savings for INT units are limited with gates. The third bar (labeledPATS) shows the savings when we apply PATS. PATS pro- vides 12% and 6% more energy savings on divergent benchmarks for INT and FP units compared to gates. PATS improves energy savings on many divergent benchmarks with 75 dominant patterns, like backprop, lbm and srad. Even for benchmarks without strongly dominant divergence patterns, like bfs and MUM, PATS can still achieve energy savings. Note that in gates the authors used blackout that forces a unit to stay gated even if the idle window is small. PATS does not use blackout; thus for cutup and mri-q, which have many short idle cycles, the unit cycled frequently resulting in negative savings with PATS. It is clear that PATS would not work well on non-divergent benchmarks, and therefore kmeans and lavaMD , which are non-divergent workloads, do not benefit from PATS. The 90% energy saving of kmeans is mainly contributed by lots of long idle cycles. The last bar labeledPATS + + shows the power savings with our enhanced PATS that gives priority to instruction type in addition to the pattern id. Clearly, PATS++ provides the best power savings compared to any of the previously proposed techniques. Only LPS and pathfinder degrades slightly due to the performance overhead. Figure3.9b shows the energy savings of floating point units. b+tree, bfs, lavaMD, MUM, nw show almost 100% savings, because those are integer only benchmarks. We excludes those for calculating average energy savings in the FP charts. For other benchmarks, we can see similar trends as INT. PATS and PATS++ both can outperform prior techniques. On average, PATS++ improves energy savings by 14% for both INT and FP units. Overall, PATS++ saves 52.1% of the INT unit static energy and 66.4% of FP unit static energy. Execution unit leakage accounts for roughly 10% of the total GPU energy consumption. Therefore, PATS++ would gain 5%6% overall GPU energy savings. Note, there’s no dominant component in GPU consumes most of the energy. Execution Unit, Register File, and DRAM each takes around 20% energy consumption. To improve GPU energy efficiency, we have to improve each component. This chapter focus on execution unit. With technology scaling down, as static energy consumption increases the contribution of PATS++ can be even more significant. 76 0.6% 0.7 0.8 0.9 1 1.1 1.2 b+tree backprop bfs cutcp gaussian heartwall hotspot kmeans lavaMD lbm LIB LPS mri-‐q MUM nw pathfinder sgemm srad WP geomean Normalized ExecuQon Time Per Lane Gates Pats Pats++ Figure 3.10: Performance impact 3.5.3 Performance Impact Figure 3.10 shows the performance impacts of all the techniques mentioned above. The Y-axis is the execution time normalized to the default scheduler without power gating. Per lane has an average of 2% performance overhead. In the worst case, it can incur about 8% of performance loss for MUM. The performance loss is primarily due to power gating of SIMT lanes, which are just gated but then immediately required for execution by a fully active warp. Gates suffers about 1% performance loss and benchmarks with limited instruction mix cause the worst case performance degradation. PATS alone degrades performance by 2% but PATS++ reduces the overhead to 0.6%. The way Gates, PATS and PATS++ change the scheduling order of warps may occasionally lead to increase in the performance for some benchmarks. Most of these improvements were due to memory access re-ordering that lead to slightly improved cache hit rates. 3.5.4 Sensitivity to Power Gating Parameters Figure 3.11b and Figure 3.11a show the performance and energy savings sensitivity to power gating parameters for both default per lane and PATS++. We increase the break- even time and wake-up delay in this study. Overall when the break-even time increased 77 20% 40% 60% 80% 9 14 19 9 14 19 Integer Fp Sta4c Energy Saving PATS++ Per Lane (a) Break-even Time 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 3 6 9 15 Wakeup Cycles Normalized Exec Time PATS++ Per Lane 0% 20% 40% 60% 80% 3 6 9 15 3 6 9 15 Integer Fp Sta6c Energy Savings (b) Wake-up Delay (be=14) Figure 3.11: Sensitivity of power gating parameters to 19 cycles, PATS++ still provides 50% of static energy saving in integer units and 65% of floating point units. The wakeup delay increase has the most significant impact on both performance and energy savings. However, our proposed PATS++ has significantly slower degradation than the default technique of power gating per lane without consider- ing the divergence pattern. 3.5.5 Implementation Complexity Header/footer transistor for gating must be sized based on the amount of switching cur- rent. Thus per-lane gating transistor is smaller than the per-cluster gating transistor. The total gating transistor area, whether one uses per lane or per cluster, is roughly the same. The primary cost is the control logic overhead which is accounted for in our results. We use NCSU PDK 45nm library to estimate the area and power overhead for various regis- ters, counters and logic blocks used in PATS and PATS++. We also extract the power and area of an SM from GPUWattch [99]. An SM occupies 48:1mm 2 and consumes 1.92 W of dynamic power and 1.61 W of leakage power. The total area overhead of PATS++ is 6269:1um 2 , resulting in 0.01% area overhead. The dynamic power is 0.015 W and the leakage power is 6.6e-5 W, accounting for 0.78% of the dynamic power and 0.004% leakage power overhead. 78 3.6 Related Work Handling SIMD Divergence: Handling SIMD divergence has been focus of many prior works in GPUs. Fung et al, first proposed dynamic warp formation [93] to dy- namically regroup threads with different home lanes executing the same instruction into a new warp. Thread block compaction [61] was then proposed to exploit the observa- tion that divergence within a thread block occurs with high temporal locality. Recent research [130] shows that many idle lanes are aligned because branch conditions are de- pendent on programmable values such as thread ids, therefore requiring shuffling lanes to create better warp compaction. Intra-warp compaction [149], also targeted eliminating the aligned idleness of SIMT lanes to improve compaction opportunities. On the other hand, we exploit the aligned SIMT lane idleness to improve power gating efficiency. Predictive Power Management Techniques: Power Management usually incurs some performance and power overhead. Various techniques have been proposed to im- prove prediction accuracy and reduce power penalty. Idle detect is one simple but effec- tive prediction method proposed in power gating [75], based on the observation that if a unit is idle in the previous time window, it probably would stay in idle state. In [78], a global phase history table is used to track different execution phases in microprocessor for dynamic power management. In deterministic clock gating [101], the author utilized deterministic information available a few cycles ahead to realize zero overhead clock gating. In our research, we utilize these prior approaches to power gate SIMT lanes but we then enhance and adapt them within the execution context. Power gating: Power gating techniques has been widely applied in processors [75, 111], caches [58], and NOCs [48]. Prior work on GPU power gating focused on gating a coarser granularity of SM cores [156] or an entire SP cluster based on instruction types [26]. In our work, we propose lane-level power gating and we compare with [26] to show that PATS++ can exploit lane level idleness for improving power efficiency further. GPU power saving: Various researchers have looked at orthogonal approaches to GPU power efficiency. Several works [25, 64, 171] proposed to save register file power. 79 Leng [99] saves dynamic power of the execution units and register file by clock gating and DVFS. However, our research mainly targets reducing static power by using lane- level power gating. GPU schedulers: GPU scheduler has been extensively studied to improve gpu per- formance. Two-level scheduler [64] improved energy efficiency of scheduler by keeping only a small subset of active warps in the active queue. Narasiman [120] has proposed another Two-level scheduler that divides large warps into smaller fetch group. Jog [86] proposed warp scheduler which could enable efficient prefetching policies. Rogers [133] proposed a cache locality based warp scheduler to improve performance. PATS can be built on top any scheduler although our implementation used two-level scheduer. Power aware schedulers: Power aware schedulers for CPUs and multicore sys- tems [35, 135] have been studied extensively. Previous works focused on dynamic run- time power management and DVFS. Our research utilized the special control flow diver- gence patterns to reduce GPU static power. 3.7 Chapter Summary GPUs provide a power efficient execution platform for many massively parallel through- put oriented applications. But when general purpose applications are ported to run on GPUs these applications suffer from inefficient resource utilization due to branch and memory divergence. This chapter focuses on two aspects of this problem. First, we study the branch divergence behavior of various GPU workloads and show that branch divergence patterns exhibit strong bias and only a few patterns dominate the execution in many workloads. We then exploit this knowledge to effectively power gate SIMT lanes within a GPU. We propose pattern aware warp scheduler that increases the idleness time of a SIMT lane. We design and evaluate PATS and an enhanced PATS++ approaches to show that the proposed approaches outperform prior GPU power gating techniques. 80 Chapter 4 Warped-Slicer: Intra-SM Slicing for Efficient Concurrent Kernel Execution on GPUs 4.1 Chapter Overview As we discussed in Chapter 1, each new generation of GPUs has delivered more powerful theoretical throughput empowered by ever-increasing amount of execution resources [15, 19, 20, 18]. Traditional graphics-oriented applications are successful in exploiting the resource availability to improve throughput. With the advent of new programming mod- els, such as OpenCL [117] and CUDA [14], general purpose applications are also re- lying on GPUs to derive the benefits of power-efficient throughput computing. How- ever, as we showed in Chapter 2 resource demands across general purpose applications can vary significantly, leading to the widely-studied issue of GPU resource underutiliza- tion [84, 123, 25, 81, 26, 97, 173, 67, 28, 103, 164]. The previous chapter proposed to power gate the idle execution lane to improve en- ergy efficiency. In this chapter, we seek to resolve the resource underutilization issue in the context of multiprogramming. We explore various intra-SM slicing strategies that slices resources within each SM to concurrently run multiple kernels on the SM. Intra- SM slicing can resolve concurrency and intra-SM resource underutilization. Our results show that there is not one intra-SM slicing strategy that derives the best performance for 81 all application pairs. Based on several micro-architectural characterizations that the ap- plication pairs show under various intra-SM slicing strategies, we built a profiling-based dynamic intra-SM slicing strategy. Using the execution statistics that are collected dur- ing a short profiling phase, the proposed method dynamically selects an appropriate SM slicing strategy for the given application pair. Intra-SM slicing improved performance by 23% over the baseline multiprogramming approach with minimal hardware overhead. 4.1.1 GPU Multiprogramming GPU Multiprogramming enables kernels from diverse applications to be concurrently executed on a GPU. Several new design features of recent generations of GPUs are en- couraging this trend. The HSA foundation co-led by AMD introduced a queue-based multiprogramming approach for heterogeneous architectures that include GPUs [60, 134]. NVIDIA also introduced concurrent kernel execution (CKE) that allows multiple kernels of an appli- cation to share a GPU. For instance,Hyper-Q was introduced by NVIDIA for the Kepler architecture which enables kernels to be launched to the GPU via 32 parallel hardware queue streams [19]. These hardware mechanisms use a Left-Over policy that assigns as many resources as possible for one kernel and then accommodates another kernel if there remain sufficient resources [50]. These simple policies enable concurrent execution of kernels only opportunistically. Inspired by the support for concurrent execution, several researchers have proposed microarchitectural and software-driven approaches to concurrently execute kernels from different applications more aggressively [84, 123, 173, 28]. These studies showed that concurrent kernel execution can be beneficial for improving resource utilization espe- cially when the kernels have complementary characteristics. For example, when a compute- intensive kernel and a memory-intensive kernel from different applications share a GPU, 82 both pipeline and memory bandwidth are well utilized without compromising either ker- nel’s performance. Several models [84, 173, 103] have been proposed to find the opti- mal pair of kernels that can be executed concurrently for better performance and energy efficiency. Software-driven approaches have used kernel resizing to maximize the op- portunity for concurrent execution, even within the current Left-Over policy. Kernel slicing [173] partitions a big kernel into smaller kernels so that any single kernel does not consume all available resources. Elastic kernel [123] runs a function that dynami- cally adjusts kernel size by using resource availability information. Refactoring kernels and rewriting application code to improve concurrency have shown that there are signifi- cant performance advantages with concurrent execution. However, it may not be feasible or desirable to modify and recompile every application for improving concurrency. To reap the benefits of concurrent kernel execution more broadly, in this chapter we focus on hardware-driven multiprogramming approaches, where kernels from diverse applications can be automatically launched to the GPU without software modifications. One hardware-driven multiprogramming approach is spatial multitasking [28], a mi- croarchitectural solution that splits the streaming multiprocessors (SMs) in a GPU into at least two groups and allows each SM group to execute a different kernel. Unlike the Left- Over policy, which allows multiple kernels to run only if there is enough space, spatial multitasking enables at least two different kernels to concurrently run on a GPU with- out prioritizing just one kernel over the other. As different applications have their own dedicated set of SMs, we refer to this approach as inter-SM slicing. Inter-SM slicing is a simple way to preserve concurrency with minimal design changes. While spatial mul- titasking enables better utilization of GPU-wide resources such as memory bandwidth, they do not address the resource underutilization issue within an SM. For example, if a concurrent thread array (CTA) within a kernel requires 21% of the shared memory then only four CTAs can be launched on the SM which leaves 16% of the shared memory to be wasted. Thus, if the available resource in an SM is not an integer multiple of the required resource of a CTA, there will be resource fragmentation. 83 Inspired by these challenges, this chapter explores another approach to resolve re- source underutilization issue within an SM while preserving concurrency. We propose Warped-Slicer, which is a technique that enables efficient sharing of resources within an SM across different kernels. For example, two compute-intensive kernels can be con- currently run on an SM without compromising their performance if each of the kernels has computation intensity in different kinds of instructions such as an ALU-intensive application and an SFU-intensive application. Each kernel may have different resource demand and performance behavior, and we try to minimize the performance impact suf- fered by each kernel when multiple kernels are assigned to the same SM. Lee et al. [97] discussed the potential benefits if one were to support intra-SM slicing. Since the focus of their paper is to design thread block scheduling policies, they did not explore microar- chitectural design challenges of concurrently executing multiple kernels on the SM and what are the best policies for resource assignment between the two kernels. We first present a scalable intra-SM resource allocation algorithm across any number of kernels. The goal of this algorithm is to allocate resource to kernels so as to maximize resource usage while simultaneously minimizing the performance loss seen by any given kernel due to concurrent execution. This algorithm is similar to the water-filling algo- rithm [124] that is used in communication systems for equitable distribution of resources. We present the algorithm assuming we have oracle knowledge of each application’s per- formance versus resource demands. We then show that we can approximate the oracle knowledge by doing short on-line profiling runs to collect these statistics. Next, we de- scribe how the profiling can be done efficiently to identify when to use inter-SM slicing and when to activate intra-SM slicing. Through extensive evaluation, we show that the proposed dynamic partitioning technique significantly improves the overall performance by 23%, fairness by 26% and energy by 16% over the baseline Left-Over policy. The rest of this chapter is organized as follows: Section 4.2 describes the simula- tion methodology and the motivation analysis. Section 4.3 proposes a GPU intra-SM 84 resource partitioning strategy for multiple kernels which profile and identifies the per- formance sweet spot rather than partition the resources blindly. Section 4.4 presents the GPU water-filling algorithm, the online profiling technique and a simple performance prediction model to determine optimal resource partitioning for given workloads. Sec- tion 4.5 presents the evaluations. We discuss related work in section 4.6. 4.2 Methodology and Motivation 4.2.1 Methodology In this section, we first show motivational data regarding how resource utilization varies across different application categories and how different applications face different hur- dles to reduce their stall times. We used GPGPU-Sim v3.2.2 [34] in our evaluation and our configuration parameters are described in Table 4.1. The GPGPU-Sim front-end is extensively modified to allow multiple processes to concurrently share the same execu- tion back-end. Parameters Value Compute Units 16, 1400MHz, SIMT Width = 16x2 Resources / Core max 1536 Threads, 32768 Registers max 8 CTAs, 48KB Shared Memory Warp Schedulers 2 per SM, default gto L1 Data Cache 16KB 4-way 64MSHR L2 Cache 128KB/Memory Channel, 8-way Memory Model 6 MCs, FR-FCFS, 924MHz GDDR5 Timing t CL =12,t RP =12,t RC =40, t RAS =28,t RCD =12,t RRD =6 Table 4.1: Baseline configuration 85 Application Abbr. Inst. Reg. Shm. ALU SFU LS Griddim Blkdim L2 MPKI Type Profile% Blackscholes [14] BLK 0.9B 95% 0% 48% 73% 84% 480 128 51.3 Memory 0.7% Breadth First Search [46] BFS 0.6B 71% 0% 14% 6% 46% 1954 512 84.4 Memory 5% DXT Compression [14] DXT 1.2B 56% 33% 47% 11% 21% 10752 64 0.03 Compute 0.25% Hotspot [46] HOT 0.7B 84% 19% 41% 22% 75% 7396 256 5.8 Compute 0.36% Image Denoising [14] IMG 1.7B 43% 0% 81% 30% 11% 2040 64 0.3 Compute 0.14% K-Nearest Neighbor [46] KNN 0.4B 37% 0% 14% 26% 42% 2673 256 100.0 Memory 6% Lattrice-Boltzmann [143] LBM 0.2B 98% 0% 7% 1% 100% 18000 120 166.6 Memoy 0.25% Matrix Multiply [143] MM 0.6B 86% 5% 52% 1% 34% 528 128 1.7 Compute 0.25% Matrix Vector Product [143] MVP 0.2B 74% 0% 9% 7% 96% 765 192 89.7 Cache 0.9% Neural Network [34] NN 0.9B 94% 0% 43% 22% 89% 54000 169 3.7 Cache 0.25% Table 4.2: Resource utilization fluctuates across 10 GPU applications (arithmetic mean of all cores across total cycles). We studied a wide range of GPU applications from image processing, math, data min- ing, scientific computing and finance domains. These applications are from the CUDA SDK[14], Rodinia[46], Parboil[143] and ISPASS[34] benchmark sets, as summarized in Table 4.2. To quantify the resource utilization of each benchmark we ran each benchmark in isolation for two million cycles without any multiprogramming. 4.2.2 Motivational Analysis Table 4.2 shows the resource utilization of the target applications. We chose large input size to avoid GPU resource underutilization due to insufficient input. The grid dimension and block dimensions used in each application are shown in the table, labeledGriddim andBlkdim, respectively.Inst: column shows the total number of instructions executed during the two million cycles for each benchmark. We measured the register (labeled Reg in the table) and shared memory (Shm) demand of applications. This information can be obtained at compile time without any simulations. We also measured the average utilization of functional units (ALU, special function units SFU, and load/store units LS) while executing each application. As shown, applications have diverse resource us- age. KNN uses 37% of registers while LBM utilizes 98%. IMG barely uses LS resources, 86 while MVP consumes 96% of LS resources. On average, 70% of registers, 6% of shared memory, 31% of ALU units, 11% of SFUs and 49% of LD/ST units are utilized. TheType column classifies the benchmarks into memory or compute intensive based on whether the L2 misses per kilo warp instructions executed (labeled L2 MPKI) is high ( 30) or low. We chose 30 as the threshold because there’s a large gap (10-50) between highL2MPKI benchmarks and lowL2MPKI benchmarks. The cache type denotes the L1 Cache Sensitive applications in Section 4.4. TheProfile% column will be discussed later. 0% 20% 40% 60% 80% 100% BLK BFS DXT HOT IMG KNN LBM MM MVP MUM NN AVG Pct. of Total Cycles Long Memory Latency Short RAW Hazard Execute Stage Resource Ibuffer Empty Figure 4.1: Fraction of total cycles (of all cores) during which warps cannot be issued due to different reasons. We measured the fraction of cycles when an application is stalled because no new warps can be issued due to a variety of structural and functional reasons. Figure 4.1 shows the fraction of cycles during the total execution cycles where no warps are ex- ecuted due to various stalls. Long memory latency stalls and execution stage resource stalls (the required functional unit is unavailable) in total waste 40% of GPU cycles. There are also short RAW stalls (read-after-write dependencies). The i-buffer empty stalls are when the warps are waiting for next instruction to be fetched. These results broadly confirm the motivation data presented in Chapter 1. But as shown in this chap- ter, not all applications suffer the same set of bottlenecks. For instance, DXT is mostly waiting for the instruction fetch, whileBFS is awaiting for response from memory. 87 The fact that different applications demand different resources (as shown in Table 4.2) and different applications are stalled by different constraints (as shown in Figure 4.1) suggests that there is a potential for improving performance by combining different ap- plication pairs that have differing resource needs and stall reasons. For instance, we can choose DXT and BFS to co-locate in the same SM. Furthermore, concurrent kernel ex- ecution in the same SM can get around the design imposed limits on how many thread blocks can be launched from a given kernel. For example, the maximum number of CTAs allocated to an SM is limited by the total number of available registers, shared memory size, available warps, maximum CTA count allowed by the GPU. A majority of kernels are limited by only one of these limits [170]. When a given resource (e.g., the register file) is underutilized by one kernel because it has reached the usage limits on a different resource (e.g., the shared memory), it is possible to co-locate another kernel that places very little demand on the shared memory but can make use of the unused registers. In this chapter, we propose to identify the best multiprogramming approach for a given set of kernel types, without any software modifications. Proper resource partition- ing across multiple co-located kernels within an SM is a challenging problem. One of the challenges is that application performance does not necessarily improve proportionally to the amount of resources that are assigned to it. The relationship between performance and resource allocation is mostly non-linear and sometimes even non-convex, as shown later in Figure 4.3a. Therefore designing an efficient algorithm for GPUs to optimize SM resource allocation with various constraints can be a challenging task. To the best of our knowledge, none of these challenges, policies and performance prediction models have been studied in-depth in prior works. There has been a plethora of work in CPU space to enable efficient concurrent application execution by equitable sharing of resources, such as last level cache and memory bandwidth [141, 43, 42, 91, 128, 49, 118, 157, 74]. But in a GPU the size of the register file far exceeds the size of the cache and the number of execution lanes is at least an order magnitude more than in a CPU. Hence, GPUs present unique challenges for intra-SM sharing, which must be addressed. 88 4.3 Intra-SM Slicing A A A A A A A A A A A Even Partitioning A A A A A A A A A A FCFS Allocation Fragmentation! A A A A A A Left-Over Allocation No A Left A A A A A A A A A A A A A A B B B B B B B B B B B B B B B B B B (a) (b) (c) (d) Uneven Fixed Partitioning Re-partitioning for the Third Kernel (e) A A A A A A B B B C A A Figure 4.2: Illustration of proposed storage resource allocation strategies for improving resource fragmentation. In this section, we first discuss some intra-SM slicing approaches that we consider for evaluation in this chapter. We illustrate these approaches in Figure 4.2. The resource allo- cation strategies we discussed here are equally applicable to register file, shared memory and threads; for simplicity, we use shared memory in this section as an example. We assume that two kernels named A and B are running together on the same SM and each of the kernel A’s CTAs request only 50% of the shared memory that is required by a CTA of kernel B. The outermost box graphically represents the total shared memory in an SM. The shared memory required by a CTA from kernel A is represented as a rectangle while shared memory required by a CTA from kernel B is represented as a larger square. In Figure 4.2a, shared memory is allocated in a First-Come-First-Serve (FCFS) manner. For example, as shown in the figure, if CTAs of kernel A and B are assigned to an SM in a interleaved manner, then kernel A and kernel B’s shared memory allocations are interspersed in the shared memory. When a CTA from kernel A finished (a gray-colored rectangle), the deallocated shared memory region is not large enough to fit a CTA from kernel B. As a result, all the shared memory originally assigned to A will be fragmented after kernel A terminates and those regions cannot be used for the newly arriving CTAs of kernel B. 89 The second strategy to consider is the concurrent kernel execution approach that uses Left-Over allocation strategy. Figure 4.2b shows the Left-Over strategy. Under Left- Over, Kernel A is given all the shared memory it needs, and only when it does not need any more resources is the remaining memory assigned to kernel B. Only when two adjacent CTAs of kernel A finish, a new CTA from B can take the resources used by the two CTAs of kernel A. Note that when the first CTA from A finishes execution, one has to wait until the second CTA from A to finish before assigning CTA B. The third strategy to run two kernels is to apply even spatial partitioning [28]. Spatial multitasking was previously proposed for assigning different sets of SMs to different kernels. We use the same approach for intra-SM slicing. We evenly split the resources across the two kernels. As such half of the register file and shared memory are given to each kernel. As shown in Figure 4.2c, kernel A and B are assigned half of the shared memory region from the beginning of the execution; left half of the shared memory is dedicated to kernel A and the right half is reserved by kernel B. Whenever a CTA from A terminates a new CTA from A is assigned. However, even-split may limit the shared memory usage. For example, although there are remaining resources on the right half that can accommodate CTAs of kernel A, kernel A cannot use those resources. The last strategy which is proposed in this chapter is the Warped-Slicer. At the be- ginning when two kernels start running on the GPU, we determine the best partition of the register and shared memory resources. To maximize the resource utilization, the par- tition can assign more registers to kernel A and provide more shared memory to kernel B. After the initial partition is done, the CTA from kernel A can only replace another CTA from kernel A. Figure 4.2e illustrates how the proposed Warped-Slicer policy can be easily extended to more than two kernels. When a third kernel comes, we launch a new resource repar- titioning phase for the three kernels. Then the GPU runtime reallocates some of the currently used resources for the third kernel C (the gray-colored rectangle). From that 90 point on, kernel A and kernel B will issue no more CTAs to use the marked resources. Kernel C will then start to execute once the assigned resources are freed from A and B. In this chapter, we evaluated the strategies described in this section in depth. As we show later in our results section, Warped-Slicer is significantly better than even spatial partitioning. However, Warped-Slicer requires a way to estimate the resource alloca- tions across multiple kernels that maximize the resource usage and improve the cumula- tive performance. In the next two sections, we will present two performance prediction models for achieving this goal. 4.4 Intra-SM Resource Partitioning Using Water-Filling In this section, we present an analytical method for calculating the resource partition- ing that maximizes performance. We present the analytical model assuming we have full knowledge of each application’s performance versus resource demands. Such oracle knowledge may be gained for instance by running each application with varying amounts of resources and measuring the performance. In the next section, we show how to realis- tically collect simple microarchitectural statistics to replace the oracle knowledge. 0 0.2 0.4 0.6 0.8 1 0% 20% 40% 60% 80% 100% Normalized IPC CTA Occupancy HOT (Compute) IMG (Compute) BLK (Memory) NN (Cache) MVP (Cache) (a) (b) Figure 4.3: (a) Performance vs. increasing CTA occupancy in one SM, (b) identify the performance sweet spot. 91 Before presenting the details of the algorithm, we show an approach for classifying applications based on their performance scalability with thread-level parallelism. This classification is used by the partitioning algorithm later. Figure 4.3a shows how the ap- plication performance varies when the number of CTAs assigned to an SM increases. The X-axis shows the number of CTAs allocated as a fraction of the maximum allowed CTAs for a given benchmark. In our experiments, a maximum of eight CTAs can be assigned to an SM. However, some benchmarks may need more resources than are provided in an SM to execute eight CTAs. In this case, the maximum allowed CTAs could be less than eight CTAs. The Y-axis shows the IPC of the application normalized to the best IPC the ap- plication achieves. The behavior diverges across different applications [70, 90, 97, 136]. The resulting graphs can be empirically classified into the following categories. • Compute Intensive-Non Saturating: The performance continues to increase as more CTAs are assigned to an SM. HOT falls into this category. • Compute Intensive-Saturating: The performance continues to increase with CTAs but then it saturates. This behavior is shown by benchmarks such as IMG. The performance saturation could be due to pipeline stalls on RAW dependencies. • Memory Intensive: These benchmarks, such as BLK, also exhibit increasing per- formance with CTA count but they saturate rather quickly. If an application is memory intensive (number of L2 cache misses as a fraction of total number of instructions executed is large), then the performance saturates much more quickly than the applications in the compute intensive-saturation category. • L1 Cache Sensitive: L1 Cache sensitive applications continue to increase their performance with CTA count up to the point when the L1 cache is filled up. At that point, adding more CTAs to the SM results in L1 cache thrashing and performance degradation. This is the case with both NN and MVP benchmarks. The goal is to find the resource distribution across two applications so as to achieve optimal performance when two applications are combined to run on the same SM. Since 92 GPUs allocate resources at the CTA level, resource distribution can be translated into how many CTAs from each application are assigned to an SM. Figure 4.3b illustrates vi- sually how many CTAs are assigned to each of the two applications. In this illustration, we select two applications: IMG which is a Compute Intensive-Saturating application and NN which is in the Cache Sensitive application category. We plot IMG’s resource occupancy versus performance on the primary X-axis. We then plot a mirrored image of NN plot on the secondary X-axis. NN graph shows how its performance varies as resource occupancy decreases from 100% to 0%. By plotting the two graphs in this man- ner, we see that the total use of resources from two applications is always equal to 100% at any given X-axis point. This figure clearly illustrates why even partitioning of the SM resources to these two applications is sub-optimal; the performance of NN is maximized, but IMG suffers a massive 30% performance loss, compared to the peak achievable IPC. On the other hand, if we select 60% resources for IMG and 40% resources for NN, then IMG and NN each suffers only 10% performance loss compared to the peak perfor- mance achieved when each application is executed sequentially. Thus, we can maximize the benefits of running the two applications concurrently. Based on the intuition provided above we propose an optimization model that relies on the performance and resource utilization data to find the best concurrent execution approach. We find that there exists a sweet spot, where the performance degradation of each application is minimized when running both applications concurrently. The sweet spot partitioning is captured by the following optimization function: Max Min i P (i;T i ) : K X i=1 R T i R tot (4.1) whereP (i;T i ) is the performance of applicationi normalized to the maximum achiev- able performance when T i CTAs are assigned to the application, K is the number of applications sharing the SM. R T i is the resource requirement of T i CTAs. The sum of all the resource requirements should be less than the total resources available in an SM (R tot ). Thus, the optimization tries to find the minimum performance loss across all the 93 Algorithm 1 Water-Filling Partitioning Algorithm Part 1 1: R L =R tot .R L represents total resources left 2: .P i;j stores the perf of kerneli with j CTAs 3: .Q i;d stores the max perf with less than or equal to j CTAs, elements of the same value are not stored 4: .M i stores the associated # of CTAs that lead toQ i 5: . Initialize vectors for all kernels 6: fori = 1 . . . K do . K is the max # of Kernels 7: max = 0;d = 0 . d is the index ofQ andM 8: forj = 1 . . . N do . N is the max # of CTAs 9: ifP i;j >max then 10: max =Q i;d =P i;j ,M i;d =j,d++ 11: end if 12: end for 13: T i = 1,g i =1.T i is # of CTAs assigned to Kerneli.g i points to current resource allocation inM andQ. Initially minimum 1 CTA is allocated to each kernel. 14: R L =R L R i 15: end for applications assigned to an SM subject to the constraint that the total resource usage does not exceed available SM resources. The detailed algorithmic implementation is shown in Algorithm 1 and Algorithm 2. We use two vectors, Q i and M i . Q i stores the incremental best performance achieved by running increasing # of CTAs from Kernel i. M i maintains the # of CTAs that can achieve the performance stored inQ i .g i is the index pointing to the current allocation of resources inM i andQ i . Initially, each kernel is assigned one CTA. Then in each itera- tion, we identify a kernel that loses performance the most compared to its peak achievable performance. Then, we assign the minimum number of CTAs that can improve the ker- nel’s performance. We useT i CTAs for kerneli as the best SM partition strategy. T i is 94 Algorithm 2 Water-Filling Partitioning Algorithm Part 2 1: whileR L >= 0 do 2: find = false, mp = MAX . mp: Min perf. 3: for i= 1 . . . K do 4: if not Full(i) andQ i;g i <mp then 5: find = true, mp =Q i;g i ,S =i.S is the selected kernel with min perf. to assign the next CTA 6: end if 7: end for 8: if not find then break; 9: end if 10: dT =M S;g S +1 M S;g S .dT : minimum amount of CTAs required to have incremental perf increase 11: .R S represents the resource required to allocate one CTA from the selected kernel 12: ifR L >=R S dT then 13: R L =R L R S dT ,g S ++,T S =T S +dT 14: else 15: Full(S) = true; . No more resource should be allocated to kernelS 16: end if 17: end while iteratively updated to find the optimal number of CTAs that minimize the performance loss due to concurrent execution across all applications.K is the number of applications sharing the SM andN is the maximum concurrent number of CTAs restricted by an SM. The time and space complexities of Algorithm 1 are bothO(KN), which is superior to a brute-force implementation, where the complexity isO(N K ). Note that the model de- scribed above is inspired by the water-filling algorithm [124] which is used extensively 95 in communication systems for distributing resources such as bandwidth across multi- ple competing users. However, water-filling algorithm [124] uses continuous functions while our proposed solution is discrete. One potential issue with this algorithm is that it only tries to minimize the perfor- mance loss across various CTA combinations. If a performance loss upper-bound is not set, some applications may lose too much performance due to concurrent execution. Therefore, we disband the co-location of multiple kernels in the same SM when the per- formance loss exceeds a threshold. In such case, we will choose to simply fall back on spatial multitasking. We set the performance loss threshold of any single kernel to 1 K 120% ifK kernels are concurrently sharing an SM. As we will show later in Section 4.5, only two pairs of applications chose spatial multitasking over intra-SM partitioning. And even with a higher threshold value, the majority of the application pairs gained significant perfor- mance benefits from intra-SM partitioning. 4.4.1 Profiling Strategy Recall that the water-filling algorithm’s description in the previous section relies on the availability of performance versus the number of CTAs for each kernel. However, in practice, the impact of CTA count on performance is not available a priori. Hence, in this section we present a simple hardware-based dynamic profiling strategy to estimate the performance versus CTA allocation for each kernel. In essence, we computeQ i and M i for each application based on a short runtime profile rather than the entire application run. One of the unique aspects of a GPU design is that there are a plethora of identical SMs. We utilize these parallel SMs to measure the performance impact of varying CTA count for each of theK kernels that are being co-located in an SM. We use two kernels as an illustration to describe our profiling approach. However, our profiling technique is applicable to any number (K) of kernels. As shown in Figure 4.4, we first divide the available SMs equally between the two kernels. We then assign a sequentially increasing 96 number of CTAs from each of the two kernels to its allocated SMs. For example, we assign from kernel 1, 1 CTA to SM0, 2 CTAs to SM1, 3 CTAs to SM2 etc. Similarly, we assign from kernel 2, 1 CTA, 2 CTAs, 3 CTAs each to a different set of SMs. For instance, if a GPU has 16 SMs and each SM can accommodate 8 CTAs then kernel 1 and kernel 2 will each use 8 SMs during the profile phase. Each SM will run anywhere from 1 to 8 CTAs. Note that this approach is scalable to multiple kernels by simply time sharing one SM to run with a different number of CTAs sequentially and then collect the Q i andM i for each kernel. We then employ a sampling phase to measure the IPC of each SM as it executes a given number of CTAs in isolation for 5K cycles. While the L1 cache miss rate will not be impacted by application executions on other cores, the L2 and memory accesses are shared across all SMs. The SM with more CTAs tends to consume higher memory bandwidth. As a result, the sampling phase may not accurately measure the performance of each application when running with a given number of CTAs on a GPU in isolation. To solve this problem, we design a scaling factor, inspired by recent work [85], to offset this memory bandwidth contention penalty. We found the scaling factor to be highly effective in predicting performance while accounting for differing bandwidth demands across the SMs. Recently, Jog and Kayiran observed that the IPC, DRAM bandwidth and L2 MPKI have the following relationship for memory intensive applications on GPU[85]: IPC/ BW MPKI (4.2) 97 Based on Equation 4.2, we design a weight factor to offset the imbalance problem as follows: IPC scaled =IPC sampled factor factor = 1 + mem (4.3) = B scaled MPKI sampled B sampled MPKI scaled 1 where theIPC sampled ,B sampled andMPKI sampled are the IPC, bandwidth, L2 cache miss rate measured during the sampling period. TheIPC scaled is the projected IPC based on the adjusted new bandwidthB scale and the new L2 cache miss rateMPKI scaled . mem is the portion of pipeline stalls which are caused by long memory latency out of total sampling cycles. Empirically we observed that L2 MPKI changes minimally with the number of CTAs. Intuitively, MPKI measures misses normalized to thousand instructions. Hence, irrespec- tive of the number of CTAs MPKI seem to fairly stable in our empirical observations. Hence, is directly proportional to the bandwidth usage factors. The amount of band- width consumed is proportional to the number of CTAs assigned to an SM. As a result, for each SM we can simplify the computation as follows: CTA i CTA avg 1 (4.4) SM 0 SM 1 SM 2 SM 3 SM 4 SM 5 SM 6 SM 7 CTA 0 CTA 5 CTA 1 CTA 2 CTA 3 CTA 0 CTA 4 CTA 5 CTA 1 CTA 2 CTA 3 Kernel 1 Kernel 2 Kernel Aware Thread-block Scheduler Sampler SM 0 SM 1 SM 2 SM 3 SM 4 SM 5 SM 6 SM 7 CTA 0 CTA 9 CTA 7 CTA 0 CTA 4 CTA 5 Kernel 1 Kernel 2 Kernel Aware Thread-block Scheduler Sampler Case1: Choose Intra-SM Slicing at (2,2) SM 0 SM 1 SM 2 SM 3 SM 4 SM 5 SM 6 SM 7 Kernel 1 Kernel 2 Kernel Aware Thread-block Scheduler Sampler Case2: Choose Spatial Multitasking 0 2 3 CTA 1 CTA 8 CTA 2 CTA 3 CTA 1 CTA 7 CTA 2 CTA 3 CTA 4 CTA 6 CTA 7 CTA 8 CTA 9 CTA 6 CTA 7 CTA 8 CTA 9 CTA 1 CTA 4 CTA 12 CTA 14 CTA 10 CTA 11 CTA 13 CTA 0 CTA 1 CTA 4 CTA 12 CTA 14 CTA 2 CTA 5 CTA 7 CTA 15 CTA 3 CTA 6 CTA 8 CTA 9 CTA 0 CTA 10 CTA 11 CTA 13 CTA 2 CTA 5 CTA 7 CTA 15 CTA 3 CTA 6 CTA 8 CTA 9 CTA 4 CTA 5 CTA 6 CTA 6 CTA 8 CTA 9 CTA 10 CTA 11 CTA 12 CTA 13 CTA 10 CTA 14 CTA 15 CTA 11 CTA 12 CTA 13 CTA 14 CTA 15 1 Identify Sweet Point Figure 4.4: Illustration of the proposed profiling strategy used in Warped-Slicer. 98 The entire process flow for using dynamic profiling information to generate the CTA distributions is shown in Figure 4.4. Since profiling is done only for a small fraction of cycles over the entire kernel execution window the overhead of profiling is negligible. The Profile% column in Table 4.2 shows the profiling overhead. The resource partition- ing algorithm isO(KN), which is extremely fast in computing the resource distribution. In all our results we show the net performance after accounting for the profiling overhead. To quantify the impact of using profiled data, rather than the full application run, we compared the number of CTAs assigned to each of the two kernels using the resource partitioning algorithm using both theIPC scaled obtained from sampling and the true IPC obtained from full application runs in isolation. The number of CTAs that were assigned to each of the two kernels is within, at most, one CTA for more than 90% of the kernel pairs. The detailed distribution of CTA allocations across all possible pairs of kernels that we studied in this chapter are shown in Table 4.3. 4.4.2 Dealing with Phase Behavior Figure 4.5: Sampling the program characteristics using a 5K cycles of sampling window. Note that our proposed approach already handles any phase changes between dif- ferent kernels by profiling every new kernel at its launch time. One remaining concern with profiling is that there is an implicit assumption that the performance data collected 99 will stay stable for the entire kernel duration for that kernel. This concern can be re- solved as follows: First, IPC will be monitored during co-execution of a kernel and if the IPC changes significantly and the change is sustained over a long duration (say, at least the length of profile run of 5K cycles), then a new sampling phase may be initi- ated at that point in execution. During this sampling interval, higher priority is assigned to the sampled kernel while holding back other kernels. While this approach does not provide complete isolation, it may be sufficient to rebuildQ i andM i for a kernel with reasonable accuracy. Once the vectors are available we can re-run the resource parti- tioning algorithm to determine the new resource distribution. But more critically, we looked at significant and sustained IPC changes in kernel execution across several GPU benchmarks. The mem and average IPC collected per SM during the first 5K cycles is highlighted and compared to a much larger 50K cycle execution window for several benchmarks in Figure 4.5. Evidently, the sampling window can provide a fairly accurate characterization of the entire kernel execution. 4.5 Evaluation In this section, we evaluate our Warped-Slicer (represented as Dynamic in our figures) and compare it with three multiprogramming alternatives that were described earlier: namely, left-over partitioning, even partitioning, and spatial multitasking (Spatial) [28, 148, 103]. Note that spatial multitasking is an inter-SM partitioning scheme while the other schemes are intra-SM partitioning schemes. For generating multi-programmed workloads, we created three categories of benchmarks by pairing compute, cache and memory application types (See Table II). The three categories are Compute + Cache, Compute + Memory, and Compute + Compute. For each of the categories, we generate all combinations of benchmarks from the two categories. Thus, a total of 30 benchmark pairs were generated. While executing these diverse pairs of benchmarks, one challenge is to make sure that any given application pair executes the same amount of work across 100 different multiprogramming approaches evaluated in this chapter. To achieve this goal we use the following approach. Recall that for collecting individual application statistics in Table II we ran each benchmark for two million cycles. We recorded the total number of instructions executed during that two million cycles for each benchmark in theInst column. When running the benchmark pair we run each benchmark until it reaches that recorded instruction count. Once a benchmark finishes the target instruction count that benchmark simulation is halted and its assigned GPU resources are released. The slower benchmark may then consume all the available resources to reach its own instruction target. The total simulation time is treated as the execution time for the application pair. This way, each application simulates the same number of instructions under all configurations. For Warped-Slicer, we set the profiling phase to be 5K cycles long. We wait for 20K cycles for GPU to warm up before starting the first profiling phase. At the end of the profiling phase, the partitioning algorithm reads the profile data and decides the number of CTAs each application is going to get. We compare the profiling cycles over average kernel execution time of each benchmark in theProfile% column in Table 4.2. As such, the profiling overhead is minimum for most applications. 4.5.1 Performance 1.06 (b) (c) 1.23 1.27 0.6 0.8 1 1.2 1.4 1.6 1.8 Normalized IPC (a) Spatial Even Dynamic Oracle Figure 4.6: Performance results of all 30 pairs of applications: (a) Compute + Cache. (b) Compute + Memory. (c) Compute + Compute. The results are normalized to baseline Left-Over policy. GMEAN shows the overall geometric mean performance across the three workload categories. 101 Figure 4.6 shows the IPC of various multiprogramming approaches normalized to the IPC of the Left-Over policy. The average IPC of concurrently executed kernels is calculated by dividing the sum of all kernels’ instruction count by execution time until all kernels finish. Note that Left-Over policy performs very similar to the sequential ex- ecution of the two applications since the second application will not start execution until after the first application is done with issuing all of its CTAs. Overall, the IPC of the Left-Over policy when running the two applications is 3.2% higher than the average IPC of the two applications running sequentially. The Oracle approach is the highest per- formance we obtained for the application pair among all multiprogramming approaches discussed in this chapter (Left-Over, Spatial and Intra-SM Slicing). To identify the best results for intra-SM slicing, we exhaustively ran all possible CTA combinations. On average, all multiprogramming algorithms derived better performance than the baseline Left-Over policy. The proposed Warped-Slicer approach (Dynamic) outper- formed the other algorithms and is close to the oracle results for most applications. Warped-Slicer partitions resources according to the workload’s performance and CTA count relationship as measured during the profiling phase. Spatial multitasking achieved only minimal performance improvement over Left-Over. Spatial partitions resources only across the SM boundaries. As explained in the earlier sections, inter-SM slicing can cause resource underutilization within an SM, which cannot be handled by split- ting workload across SMs. As expected, the two Intra-SM slicing approaches, Even and Warped-Slicer, derived better performance than Spatial. Warped-Slicer derived an aver- age of 23% performance improvement over Left-Over policy, which is widely used in current GPUs, 14% better than even partitioning of an SM and 17% better than Spatial. Table 4.3 shows how differently our Warped-Slicer partitions resources than Even approach. When Intra-SM multiprogramming is chosen for the workload, the numbers in the parenthesis indicate the number of CTAs run by each of the two applications; each application is assigned required resource per CTA # CTAs. The work- loads that run Inter-SM multiprogramming are marked as Spatial and each application 102 Compute + Cache Compute + Memory Compute + Memory Compute + Compute Workload Dyn Even Workload Dyn Even Workload Dyn Even Workload Dyn Even DXT MVP (7,1) (4,4) DXT BFS (6,2) (4,1) IMG BFS (6,2) (4,1) DXT IMG (4,4) (4,4) DXT NN (4,4) (4,4) DXT BLK (4,4) (4,4) IMG BLK (5,3) (4,4) HOT DXT (2,6) (1,4) HOT MVP (3,1) (1,4) DXT KNN (4,4) (4,3) IMG KNN (4,4) (4,3) HOT IMG (2,6) (1,4) HOT NN (2,4) (1,4) DXT LBM (5,3) (4,3) IMG LBM (7,1) (4,3) MM DXT (3,5) (2,4) IMG MVP (7,1) (4,4) HOT BFS spatial (1,1) MM BFS spatial (2,1) MM HOT (2,2) (2,1) IMG NN (3,5) (4,4) HOT BLK (2,4) (1,4) MM BLK (3,4) (2,4) MM IMG (3,5) (2,4) MM MVP (5,1) (2,4) HOT KNN (2,4) (1,3) MM KNN (4,4) (2,3) MM NN (3,5) (2,4) HOT LBM (3,1) (1,3) MM LBM (5,1) (2,3) Table 4.3: Resource partitioning when Warped-Slicer (Dyn) and Even multiprogram- ming algorithms are used. is equally assigned eight SMs. Each application pair has two numbers listed, the first number corresponds to the number of CTAs assigned to the first application and the sec- ond number corresponds to the number of CTAs assigned to the second application. In many cases, Warped-Slicer assigns more CTAs than Even. For example, in HOT DXT of Compute+Compute category, Warped-Slicer runs two HOT CTAs and six DXT CTAs whereas Even only assigns one HOT CTA and four DXT CTAs. Since each application can use up to 50% of the intra-SM resources in the Even approach, resources may be fragmented if the assigned 50% resource is not an integer-multiple of the required re- sources of each application. On the other hand, Warped-Slicer finds the optimal resource assignment where performance degradation and resource fragmentation is minimal for both applications. In some cases, Warped-Slicer assigns the same number of CTAs as the Even partition, as can be seen in MM MVP. For this workload, the total number of CTAs assigned by Even and Warped-Slicer is the same. However, Warped-Slicer assigns fewer CTAs for MVP because MVP is a cache intensive application. Hence, more CTAs can cause cache trashing, and fewer CTAs improve overall performance. On the other hand, the Even approach assigns the same number of CTAs for both applications, which leads to worse performance than the Warped-Slicer. 103 4.5.2 Resource Utilization 0.8 1 1.2 ALU SFU LDST REG SHM Normalized Utilization Figure 4.7: Assorted resource utilization of the proposed Warped-Slicer normalized by even partitioning policy Figure 4.7 shows how the Warped-Slicer increases ALU and SFU pipeline utilization, register file and shared memory utilization over the Even partitioning approach. The ALU utilization metric is the fraction of cycles when an ALU pipeline is occupied over the total execution cycles. Overall, Warped-Slicer derives over 15% higher utilization of all the GPU resources. This is because the Warped-Slicer chooses the optimal resource partitioning that can minimize the resource underutilization. On the other hand, Even allows each application to use up to half of each resource, regardless each application’s resource demand. Therefore, the resource utilization is lower due to fragmentation. 4.5.3 Cache Misses When running multiple applications on an SM, the resource utilization can be improved as shown in the previous sections. However, as each application accesses different re- gions of memory space, cache contention might increase. In Figure 4.8, we measured cache miss rates in L1 and L2 caches. We observed different behaviors from different application categories. For Compute + Non-Cache applications, as expected, L1 cache miss rate is the lowest in Left-Over and highest in Even and Warped-Slicer. Interestingly, for Compute + Cache applications, Warped-Slicer achieves the lowest miss rate, which is 104 0% 20% 40% 60% 80% 100% Compute vs Cache Compute vs Others Compute vs Cache Compute vs Others L1D Cache L2 Cache Left-Over Spatial Even Dynamic Cache Miss Rates Figure 4.8: Cache miss rates 2% lower than Even and 17% lower than Left-Over. This is because Warped-Slicer runs fewer cache-intensive CTAs concurrently, which leads to lower L1 cache contention. L2 cache miss rate shows slightly different trend because L2 caches are shared across multi- ple SMs and hence, data of different applications can contend in the L2 cache even under Spatial approach. Consequently, Spatial derived higher L2 cache miss rate than Left- Over. Warped-Slicer derives highest L2 miss rate in Compute + Cache category because its total L2 accesses reduced significantly by 43% due to lower L1 cache miss rate. 4.5.4 Stall Cycles 0% 20% 40% 60% MEM RAW EXE IBUFFER Total Pct. of 4M Cycles Left-Over Spatial Even Dynamic Figure 4.9: Breakdown of total stall cycles Multiprogramming not only enhances the pipeline utilization but also reduces stall cycles caused by resource contention. By running compute-intensive and memory-intensive applications together, the memory congestion can be relieved because memory accesses 105 are generated primarily by only half of the concurrently running warps while the other half of the warps stress computational resources. Figure 4.9 shows various stall cycles under the multiprogramming approaches. As expected, long latency memory stalls reduced most using the Warped-Slicer. Note that the stall cycles are also reduced with Spatial multitasking that does not share SMs be- cause, when memory-intensive and compute-intensive applications are multiprogrammed, the memory-intensive applications are assigned to only half of the total SMs and hence the memory congestion can be still relieved. Overall, the Warped-Slicer encounters 15% fewer accumulated total stall cycles than Left-Over. 4.5.5 Multiple Kernels Sharing SM In this section, we show that our proposed scheme can work on more than two kernels as- signed to concurrently execute on an SM. As described in our algorithmic description, the proposed approach is general and it does not depend on the number of concurrent kernels being considered for execution. We evaluated all the combinations of three applications which contain at least one compute application. BFS and HOT are not included because of their large CTA size, which prevents more than two kernels from being executed. Fig- ure 4.10 shows all the combinations of memory/cache applications working with two compute applications with the last bar showing the overall performance improvement over all combinations. Warped-Slicer consistently outperforms even partitioning and on average by 21%. 4.5.6 Fairness Metrics Figure 4.11a shows the minimum speedup across various configurations evaluated. Com- pared with the Even partitioning, the proposed scheme improves fairness (as measured by the minimum speedup metric) by 14% in 2 Kernels and 23% in 3 Kernels. 106 46 COMBO 1.32 1.40 0.8 1 1.2 1.4 1.6 1.8 BLK_IMG_DXT BLK_MM_DXT BLK_MM_IMG KNN_IMG_GXT KNN_MM_DXT KNN_MM_IMG LBM_IMG_DXT LBM_MM_DXT LBM_MM_IMG NN_IMG_DXT NN_MM_DXT NN_MM_IMG MVP_IMG_DXT MVP_MM_DXT MVP_MM_IMG GMEAN GMEAN of ALL Normalized IPC Spatial Even Dynamic Figure 4.10: Performance result when combining three applications in an SM. 0.9 1.1 1.3 1.5 2 Kernels 3 Kernels Fairness Spatial Even Dynamic (a) Fairness(Minimum Speedup) 1 1.5 2 2.5 3 2 Kernels 3 Kernels ANTT Spatial Even Dynamic (b) Average Normalized Turn-around Time Figure 4.11: Comparison of fairness improvement (Normalized to Left-Over Policy) and ANTT reduction between multiprogramming policies. Figure 4.11b shows the average normalized turnaround time which is another impor- tant metric to measure fairness. The Warped-Slicer improves this metric significantly over even-partitioning. It improves by 15% when 3 kernels are running on the SM. 4.5.7 Power and Energy Analysis We use GPUWattch [99] and McPAT [102] for power evaluation. Compared with base- line Left-Over policy, Warped-Slicer increases average dynamic power by 3.1% due to the increased resource utilization. Overall, however, our Warped-Slicer saves the base- line energy consumption by 16% through significantly reduced total execution time. 107 4.5.8 Sensitivity Analysis 0.97 0.98 0.99 1 1.01 5k 10k CTA 1k 5k 10k Sampling Cycles Algorithm Delay Normalized IPC (a) Sensitivity to profiling length and algorithm de- lay 0.9 1 1.1 1.2 1.3 Greedy Then Oldest Round Robin Normalized IPC Spatial Even Dynamic (b) Sensitivity to warp schedulers Figure 4.12: Performance sensitivity to profiling parameters and warp schedulers. We also investigated how the length of the profiling phase influences the prediction correctness. We ran all 30 pairs of applications with the Warped-Slicer while varying profiling length from 5K, 10K cycles, and finally up to the total number of cycles of an entire CTA execution. Figure 4.12a shows the IPC under various profiling lengths, normalized by the IPC when using 5K cycle for the profiling length. Across all the application pairs, the IPC variations are at most 2% with varying profiling lengths. We then investigated how the algorithm execution delay influences the overall perfor- mance. We again ran all 30 pairs of applications with an additional delay varying from 1K cycles, 5K cycles to 10K cycles. between finishing sampling to start the new parti- tion. We found that overall the IPC change is less than 1.5%. The time for calculating the resource-partitioning algorithm does not block the warps from executing on the SM. While the decision is being made, the CTAs already issued in the sampling phase can still execute on the machine. Therefore, even when the additional algorithm delay increases to 10k cycles, the performance loss is still minimal. Figure 4.12b studies the performance impact of different warp schedulers. The IPC and speedup of using the Warped-Slicer is not impacted by the underlying warp sched- ulers that were studied: greedy then oldest scheduler and the round-robin scheduler. 108 We also examined the impact of less contended SM resources by evaluating the sys- tem with 256KB register file, 96KB shared memory, 32 maximum CTAs and 64 maxi- mum warps. Our Warped-Slicer still significantly improves the performance and fairness of the baseline policy, both by 26%. Since software written for future GPUs will utilize more and more registers and shared memory resources, we believe that our schemes that target resource contention and efficient partitioning will be increasingly important for future generations of GPUs. 4.5.9 Implementation Overhead We synthesized the various profiling counters and the global sampling logic required by Algorithm 1 using NCSU PDK 45nm library [7]. The set of counters for sampling occupies 714um 2 per SM and the global counters and logic for Algorithm 1 occupies 0:04mm 2 . We also extract the power and area of 16 SMs from GPUWattch [99], which is 704mm 2 and consumes 37.7W of dynamic power and 34.6W of leakage power. The total area overhead of our proposed approach for 16 SMs is 0.05mm 2 , resulting in only 0.01% area overhead. The total dynamic power is 54mW and the leakage power is 0.27mW, accounting for 0.14% of dynamic power and 0.001% leakage power overhead. 4.6 Related Work Workload Selection and Task Scheduling in CPUs: Several approaches have ad- dressed the workload selection and assignment problems in CPUs. Snavely et al.[141, 142] first proposed the SOS scheduler which uses profile-based knowledge to select the best symbiosis co-runners. Several other approaches [56, 42, 129, 54, 44, 137] proposed CPU performance models to construct optimal workload assignments. For optimizing thread scheduling, on-line characterization techniques have been explored for simulta- neous multithreaded (SMT) processors. For instance, Choi and Yeung [49] proposed an on-line resource partitioning based on performance monitoring. In SMT enabled 109 CPUs, there are only a few threads. Hence, continuous performance monitoring to de- cide resource partitioning is acceptable. Our approach monitors the performance versus resource allocation using a small profile run to characterize the full application behav- ior, which is inspired by existing leader-follower style sampling techniques in CPU do- main [128]. Our sampling approach is similar to [49], but different in that we sample with varying number of CTA counts concurrently on different SMs in a GPU thereby collecting the required profiling metrics in a single short profiling phase. For partitioning on-chip resources, existing works in SMT primarily focused on cache partitioning [91, 128, 157, 43, 74]. In CPUs, the fraction of physical registers and cache can be easily adjusted by controlling the number of in-flight instructions of each thread. However, in GPUs, such dynamic control is not available since each CTA gets all its resources at once. Once a CTA is assigned to an SM, register file and shared memory must be allocated statically and cannot be released until the CTA is completed. Thus, allocation-time resource scheduling is more important in GPUs and hence we focused more on register file and shared memory partitioning rather than caches. In addition, GPUs have a much smaller L1 cache with lower hit rate, since GPU’s L1 cache is shared by thousands of threads. Multiprogramming on GPUs: Several software-centric GPU multiprogramming approaches have been studied. Jiao et al. [84] proposed power-performance models to identify the optimal kernel pairs that can achieve better performance per watt. Their models statically estimate energy efficiency of any pair of applications so that the optimal kernel pair can be multiprogrammed together. Pai et al. [123] proposed elastic kernel design that adjusts each kernel’s resource usage dynamically subject to the available hardware resources. They modified application code to run a special function that adjusts resource usage based on current resource utilization. Zhong and He [173] proposed a dynamic kernel slicing that partitions a kernel into several smaller kernels so that multiple kernels can more efficiently share the resources. Adriaens et al. [28] proposed spatial multitasking, which runs multiple applications on different sets of SMs. Ukidave et 110 al. [148] explored several adaptive spatial multiprogramming approaches. Aguilera et al. [29] pointed out the unfairness of the spatial multitasking and examined several task assignment methods for more fair resource usage and better throughput by distributing workloads across SMs. Gregg et al. [67] developed a run-time kernel scheduler that merges two OpenCL kernels so that the kernels can run on a single-kernel-capable GPU. These software-centric multiprogramming methods improved concurrency and per- formance significantly by refactoring kernels and rewriting application code. In many situations, it may not be feasible to modify every application for improving concurrency. Also, once a kernel is sliced, the sliced kernel’s size cannot be adjusted in the run time, which might cause another inefficiency. Our study proposes a hardware mechanism that does not require any application code modification. We also provide a novel method to determine the best multiprogramming strategy in the run time, which is not applicable to the software-centric approaches. Recently, several studies show a new trend of hardware support for multiprogram- ming. The HSA Foundation [60] standardized hardware and software specifications to run multiple applications on heterogeneous system architectures. In the specification, they also cover the execution of multiple applications in the same GPU, which uses multiple simultaneous application queues which are similar in spirit to the NVIDIA’s Hyper-Q. Still, there is not a detailed explanation of how the applications are assigned to execution cores. Wang et al. [154] proposed a dynamic CTA launching method for irregular applica- tions. They observed that CTA-level parallelism is better than kernel-level parallelism for resource utilization, especially in irregular applications. They proposed a runtime platform that supports dynamic CTA invocation. This work is orthogonal to our ap- proach. We focus more on efficient multiprogramming strategy rather than improving application parallelism. We used a performance model to determine the optimal resource partitioning between two different applications and each application’s kernel parameters. 111 Preemptive scheduling on GPUs: Preemptive scheduling context switches one ker- nel with another kernel to enable multiprogramming.The main challenge of preemptive scheduling is the high overhead of context switching. Tanasic et al. [145] explored classic context switching and draining. Classic context switching stops running kernel to save the current context to the memory and then a new kernel is brought in. Park et al. [125] added another preemptive scheduling algorithm, flushing. Flushing detects idempotent kernels, which generate exactly the same result regardless how many times the kernel is executed. Then, to run an urgent kernel, they drop a running idempotent kernel and yield corresponding resources to the urgent kernel. Later, the dropped idempotent kernel is re-executed from the beginning. Recently, Yang et. al. [163] proposed a partial context switching technique to allow fine-grain sharing by multiple kernels within each SM. This approach tries to resolve the long context switching delay suffered during preemption. These studies are orthogonal with our study because our study focuses more on multiple kernels’ concurrent execution rather than temporal GPU sharing. Dynamic execution parameter adjustment: Kayiran et al. [90] explored dynamic workload characterization to adjust the number of CTAs on the fly, which derives better overall performance. Lee et al. [97] proposed a profiling-based single kernel execution optimization for GPU. Lee et al. [95] proposed a dynamic voltage scaling for better throughput under a power budget. They periodically check voltage, frequency and SM core count to adjust the three parameters in the next epoch. Sethia and Mahlke [136] proposed a hardware runtime system that dynamically monitors and adjust several pa- rameters, such as the number of CTAs, core frequency, and memory frequency. The proposed runtime system can be configured either to improve energy efficiency by throt- tling underutilized resources or to improve performance by boosting core frequency and adjusting the number of CTAs. These four studies used single kernel execution environ- ment as their baseline. On the other hand, our approach determines the optimal resource partitioning to enable multiple kernels concurrently run on the same SM. 112 4.7 Chapter Summary In this Chapter, we present a novel approach for efficiently partitioning resources within an SM across multiple kernels in GPU. The algorithm we proposed follows the water- filling algorithm in communication networks, but we apply it to efficient resource sharing problem within an SM. We demonstrate how to implement this algorithm in practice us- ing a short profile run to collect the statistics required for executing the algorithm. We then evaluated our proposed design on a wide range of GPU kernels and show that our proposed approach improves performance by 23% over a baseline Left-Over multipro- gramming approach. 113 Chapter 5 GPUGuard: Eliminating Covert Channel Attacks on GPUs 5.1 Chapter Overview In the previous chapter, we demonstrated that that running concurrent kernels on the same SM can significantly reduce resource underutilization by sharing the register file, shared memory, cache and execution units. While concurrent kernel execution improves resource utilization, it also brings to forefront security concerns. When multiple kernels access the same type of resource, one kernel can secretly gain knowledge about the other kernel based on how and when resources are utilized through various timing attacks. In this chapter, we consider the problem of hardening GPUs against these implicit com- munication channels. Our solution first develops approaches for detecting patterns of the contention that are consistent with covert channel measurements. Once contention is detected, we explore temporal, inter-SM and intra-SM partitioning for closing the channel. We propose GPUGuard, a detection based intra-SM defense scheme that can reliably close the covert channels. Our results show that GPUGuard can reliably detect contention with high accuracy. Compared to a baseline temporal partitioning, GPUGuard achieves 54% speedup (63% on Kepler) with minimum hardware overhead, showing that it is possible to exploit multiprogramming while securing GPUs against these attacks. 114 5.1.1 Timing Attacks on GPUs Micro-architectural covert channel timing attacks exploit timing variations caused by contention on microarchitectural hardware resources. This contention creates an indirect information flow channel between the kernels that share the resource. This channel can be used intentionally to communicate secretly as in the case of covert channel attacks. Alternatively, one malicious kernel may observe the contention behavior to extract secret information from a victim kernel if the contention can be correlated to sensitive data; this type of attack is called a side channel attack. As an example scenario, a Trojan application can create contention on shared resource like replacing the contents of a cache set to encode ’1’ and leave the resource idle to encode ’0’. The Spy application, on the other side accesses the cache and measures its access time to decode the transferred bit. Similarly, a Trojan application can create contention by excessively using execution units, warp scheduler and instruction fetch units to encode ’1’ and leave those resource idle to encode ’0’. The Spy application, on the other side access those shared resources and measure the execution time to decode the transferred bit. Preventing covert and side-channels is a difficult problem. Side channels attacks are difficult to launch but easier to defend against since the victim is not colluding with the attacker as in the case of the covert channel attacks. In particular, in a covert channel attack, the Trojan can collaborate with the Spy to bypass defenses that prohibit commu- nication between two parties. Thus, defenses against covert channel attacks are more general and difficult than defenses against side-channel attacks. For example, randomiz- ing the sharing of cache sets may defeat a side channel attack, but the colluding covert channel programs may probe the cache to discover contention and defeat the remapping. In the context of GPUs, there are two approaches for reducing information leakage through shared resources: (1) temporal partitioning, where the kernels are separated in time; and (2) spatial partitioning, where the kernels are allocated separate resources. 115 5.1.2 Temporal Partitioning K1 K2 K3 K4 K5 t1 t2 Spy Trojan K1 K2 K3 Preempt K1 Resume K1 Time Kernels Figure 5.1: Illustration of temporal partitioning technique on GPUs Enforcingtemporalpartitioning by allowing only one kernel to have exclusive access to the GPU is the most straightforward way which defends all the timing attacks, but can lead to significant performance overhead. In Figure 5.1, time domain is evenly sliced into multiple time slots, and each kernel can only execute in its own time slots, no matter how long the kernel is. As in the example, spy kernels (blue) can only execute in the odd time slots, while the trojan kernel (yellow) can only execute in the even time slots. GPUs must rely on kernel level preemption to enforce this scheduling: whenK1 reaches the end of its assigned slot, the GPU will need to save the kernel context, preempt the kernel and then schedule the next kernelK2 to run on the GPUs. AfterK2 used up its slot,K1 can then resume execution from previous context. In this case, the execution of the kernels are completely isolated between two security domains. As shown, the kernel waiting time t1 and t2 are now independent of the execution time of the other kernel. As the spy kernels and the trojan kernels are not co-running in the same GPU, kernel execution time and access latency are independent from each other. Therefore, temporal partitioning the GPU can effectively eliminate all the timing attacks mentioned above. Context switches for kernel preemption on GPUs are considered more expensive than on CPUs [125, 163]. Modern GPUs support up to 2048 active threads per SM [23] and each thread can access to its own register file in addition to a shared scratch pad memory. In the latest Pascal architecture, threads can access 256KB register file and 116 96KB shared memory in a single SM. Saving these large contexts not only increases the context switching latency but also reduces the system throughput, as no work can be performed in the system during context switching. In our experiments, we found that the preemption can lead to over 100% performance penalty. 5.1.3 Spatial Partitioning K1 Spy Trojan Time SM # K3 K5 K2 K4 Idle 0 8 16 (a) Spatial Partitioning K1 Spy Trojan Time SM # K3 K5 K2 K4 0 8 16 K3 K5 t2 t1 (b) Spatial Partitioning with Dynamic Load Bal- ancing Figure 5.2: Illustration of existing spatial partitioning techniques on GPUs Spatial Multitasking [28] has been proposed to partition GPU resources across mul- tiple kernels. SMs are partitioned into one subset that runs compute intensive workloads and another that runs memory intensive workloads. This technique can be easily adapted to create multiple security domains on the same GPU. Figure 5.2a shows an example of creating two security domains on the GPU using spatial partitioning. We assume that there are 16 SMs and the trojan application occupies kernel 0-7 and spy application oc- cupies kernel 8-15. Because the kernels are separated on different SMs, no security chan- nels can be established through execution units or L1 caches inside an SM. Therefore, kernel execution time and access latency are independent between kernels. Meanwhile, two kernels can execute in parallel, the kernel waiting time will also not be interfered by the other kernel. 117 Spatial partitioning is more efficient that temporal partitioning since it does not re- quire kernel level preemption. However, one drawback of this technique is that it does not eliminate timing attacks through global resources such as the L2 cache, memory channels and interconnection network. An alternative approach introduced in recent NVIDIA Pascal architecture [23] is called Asynchronous Computing with Dynamic Load Balancing. In spatial partitioning, dynamic load balancing allows a kernel to harness available resources if they become idle, as shown in Figure 5.2b. While this mechanism can improve GPU resource uti- lization, it may pose security vulnerabilities. The kernel execution time ofK3 andK5 are now different from the execution time ofK1, which leaks the information thatK4 is shorter. Therefore, in the rest of the chapter, we assume the dynamic load balancing feature is disabled for spatial partitioning. Given the drawbacks of temporal and spatial partitioning in the next section we intro- duce GPUGuard, which is our proposed scheme to reduce information leakage through shared resource usage. This chapter makes the following contribution. We propose GPUGuard, which uses the detection scheme to dynamically decide using temporal protection or intra-SM pro- tection. We will first build a machine learning model for threat detection and classifica- tion with high accuracy. We will then design an intra-SM security protection architecture for defending against the timing attacks on resources shared inside an SM. The defense scheme are evaluated to reliably close intra-SM timing channels. The rest of this chapter is organized as follows: Section 5.2 describes the decision tree based machine learning model for threat detection and classification. Section 5.3 proposes a GPU intra-SM security protection architecture. Section 5.4 presents the eval- uations. We discuss related work in section 5.5. 118 5.2 GPUGuard Design In this section we describe GPUGuard, a holistic protection framework for GPUs to detect and defend against timing attacks. GPUGuard consists of a detection and defense component. We present each of these components in the next two subsections. 5.2.1 Timing Attack Detection GPUGuard: Runtime Threat Detection and Classification Kernel 1 Kernel 2 Kernel 3 Kernel 4 Insecure Execution Reschedule to Security Domains SD1 SD2 SD3 Figure 5.3: Overall design of the GPUGuard Figure 5.3 illustrates the overall design of the GPUGuard framework. For illustration purposes we assume that there are four applications concurrently running on the GPU, including two regular applications, a trojan application and a spy application. Each appli- cation launches kernels to the same shared GPU, and they may be assigned to execute on the same SM as well. The GPUGuard detector continuously monitors the execution sta- tus, resource utilization and various other performance counters (e.g. cache miss rates) for different active kernels. A selection of features is computed periodically using the collected statistics. These features are then classified by a machine learning classifier to detect whether there are suspect timing channels between any two concurrent kernels. The classifier not only detects suspected timing channel presence, but it can also further determine which shared resources are being used to establish the covert timing channels. 119 5.2.1.1 Decision Tree Classifier We use decision trees as our classification algorithm. Decision trees [127] are a type of supervised learning algorithm in which the classification model is built by breaking down a dataset into progressively smaller subsets while developing an associated tree structure that can be followed to classify. The two most important advantages of decision trees over other classification models are: (1) human-readable model; (2) easily isolates relevant and irrelevant attributes through information gain; and (3) we are only interested in threat category rather than the precise value of a target metric. The decision tree structure has some decision nodes, with two or more branches, and leaf nodes, representing a classification outcome. An algorithm to build the tree, called ID3, was presented by Quinlan et al. [127]; it uses entropy and information gain to identify appropriate decision points, and employs a top-down greedy search through the space of possible branches with no backtracking. An instance is classified by starting at the root node of the tree, testing the attribute specified by this node, then moving down the tree branch corresponding to the value of the attribute. This process is then repeated to reach a leaf node [116]. We use this algorithm to build the decision tree based classifier for the GPUGuard. 5.2.1.2 Feature Dataset Extraction The decision tree model is built using a training input set consisting a large collection of features that correspond to various resource utilization statistics. In our experiments we started with a total of 234 features that correspond to a broad range of resource utilization statistics. Table 5.1 lists all the features collected as inputs to the decision tree model. We then selected 40 normal applications from a wide variety of benchmarks, includ- ing Rodinia [46], ISPASS [34], Parboil [143] and NVIDIA-CUDA-SDK [10]. These represent normal application kernels without any malicious intent. We then hand-coded 20 pairs (Spy and Trojan) of malicious applications as attack benchmarks that try to com- municate data between Trojan and Spy by using the timing information on the shared 120 Feature Type Feature Name Instruction Features # of SP-INT, SP-FP, SFU and Load/Store operations issued in the issue stage, # of decoded ALU, SFU, ALU-SFU, INT, FP, Load and Store operations, # of decoded Branch, Barrier, Memory barrier, Call, Ret and Atomic operations, # of decoded INT MUL/DIV , FP MUL/DIV , SP-FP, FP sqrt/log/sin/exp, Pipe-SP/SFU/MEM(INT/FP/SFU/MEM instructions processed in decode stage) # of decoded Tex/non-Tex operations, some cache related instruction opcodes such as LD OP, LDU OP, Total stall: # of cycles warp scheduler issues no warp to execute Total issue: # of cycles warp scheduler issues a warp to execute Functional Units Features SP-INT, SP-FP, SFU and Load/Store utilization at execute stage Memory and Cache Features Lmem stall: # of cycles warp scheduler is stalled by long memory latency L1D accesses/misses/evictions, L1C accesses/misses/evictions L1I accesses/misses/evictions, L1T accesses/misses/evictions L2 accesses/misses/evictions, L1C, L1D, L1I and L1T accesses per set L1D read hit/miss per set, L1D write hit/miss per set L2 read hit/miss per bank, L2 write hit/miss per bank Table 5.1: All collected features to create dataset. resources. For the attack benchmarks, we considered different types of covert channels including inter-SM and intra-SM hardware resource sharing, constant cache covert chan- nels similar to those presented in [119], as well as new attacks to create covert channels on other shared resources like global memory and L2 cache through atomic operation, double precision functional units and special functional units, shared within SMs. We also include a prime-and-probe side channel attack on constant cache and run it in con- junction with normal programs. We split both the normal applications and attacks into separate training and testing sets. Each of the normal application and the spy-trojan pairs are run on the GPGPUsim simulator [34] with a Kepler-like configuration to collect the 234 features that were men- tioned above. During execution the simulator generates utilization counts for the 234 features every thousand cycles. Thus every thousand cycles we will generate a 234-entry feature vector. 121 We label each vector based on the attack type on a given resource. We have four classes of covert channel attacks in addition to the normal execution mode in our dataset, as listed in Table 5.2. Instruction and data cache attacks occur if the two kernels com- municate either through instruction cache access or data cache access timing. While a number of global memory attacks may be feasible in theory, in our experiments we no- ticed that some attacks (such as through L2 cache access timing) are not easy in practice due to extreme noise. However, we note that one global memory attack using atomic operations created a covert channel with relative ease. Hence, the global memory at- tack classification below is primarily due to atomic operations. The double precision and SFU attacks, as implied in the name, are covert channels formed by observing the timing information to access these resources in the pipeline. Class label Security Type 0 Normal Application 1 Instruction and Data Cache Attack 2 Global Memory Attack With Atomics 3 Double Precision Units Attack 4 Special Functional Units (SFU) Attack Table 5.2: Class labels for different applications. 5.2.1.3 Feature Selection The 58730 feature vectors were divided based on whether they came from the training or testing data set. The data in each of the sets was obtained from running the benchmarks and attacks that belong to that set only. The decision tree model was trained using the training input set. The decision tree model uses the attack type tag that is associated with each of the training vectors. Once the training is complete the output is a decision tree that also indicates the weight of each of the feature in determining the attack type. While 234 features were initially selected once the decision tree model was trained we were able to prune unimportant features. To remove unimportant features from our dataset, 122 we build a model of fit binary classification decision tree for multiclass classification and use this model to compute importance factor of each feature on our dataset based on the training subset. Through this selection, we were able to reduce the 234 features to just 24 important features; 210 features were eliminated as unimportant. The goal of GPUGuard is to detect covert channel attacks as efficiently as possible in real time. Hence, pruning the feature set to just 24 counters makes it highly effec- tive in collecting and testing each feature vector against the trained decision tree model. Since the attack benchmarks in our dataset include those that create contention on con- stant caches, functional units and L2 and global memory through atomic operation, the most important features are related to these resources statistics. The selected important features are listed in bold font in Table 5.1. Note that many of these features can already be monitored on existing GPU architectures using built-in performance counters. Only a few additional features that we collect are currently not available for monitoring in GPUs, which we assume GPU vendors will be able to add in future GPUs. We then classify the test set based on our decision tree model and two fold cross validation. Based on classification results for each instance (1000 cycle), we consider a window of 10 instances to evaluate the accuracy of our model. The predicted class of each window is defined as the most frequent value of predicted classes of the instances in the window. Section 5.4 evaluates accuracy of our online classification based detection. 5.2.2 Timing Attack Defense: Security Domain Hierarchy The second component of GPUGuard is to defend against covert channel attacks once the attack is detected and classified using the decision tree model described in the previous subsection. If the GPUGuard classifier suspects a timing channel between two kernels, the defense mechanism will reschedule the suspected kernels into isolated security do- mains. For example, in Figure 5.3, the GPUGuard identified that Kernel 3 and Kernel 4 are suspicious. The GPU now creates three isolated security domains (SD1, SD2 and 123 SD3) and issues Kernel 1, Kernel 2 to SD1, Kernel 3 to SD2, Kernel 4 to SD3. In this way, the timing channel between Kernel 3 and Kernel 4 is closed. GPU SM SM SM SM EXE EXE EXE EXE GPUGuard Classifier INT/FP, SFU... Attacks? GPU Coarser Grain Finer Grain GPU Level GPU Security Domain SM Level Resource Level SD1 SD2 Figure 5.4: GPUGuard will select a security domain level based on a specific attack type (our contributions are highlighted) GPUs comprise of a scalable number of identical SMs and each SM has a number of parallel execution lanes. The presence of multiple parallel execution lanes provides us with an opportunity to create even fine-grain security domains where two kernels may still execute on the same SM but there is no information leakage. Figure 5.4 shows how a GPU security domain can be established from a coarser grain GPU level to the finest grain resource level. At the highest level the entire GPU may assigned a single security domain. The next level is to spatially group the SMs into multiple security domains. As shown Figure 5.4, we can partition the four SMs into two security domain SD1 and SD2, each containing two SMs. However, such approaches do not allow for resource sharing across kernels. Hence, GPUGuard uses intra SM partitioning: partitioning the parallel execution lanes or utilizing other underutilized resources inside an SM to create multiple security domains. As an example, assuming that we have four execution lanes in the special functional units inside an SM, we can assign the first two lanes to SD1 and the remaining two lanes to SD2. This can be achieved through security aware warp folding which we will introduce shortly. Note that many GPU workloads have shown significant warp level divergence [166, 81], and lane level partitioning in many cases improves the resource utilization. 124 The crux of dissecting intra-SM resources into security domains is to find non- overlapping resource subsets to successfully execute the kernels. This approach is similar to a dissection puzzle called Tangram. We refer to the GPUGuard’s defense mechanism alone as the Tangram unit. Figure 5.5 presents the microarchitecture of Tangram. Streaming Multiprocessor Shared Frontend L1 I-Cache L1 I-Cache L1 I-Cache L1 Fetch Arbiter L1 I-Cache Decode Scoreboard Per-Warp I-Buffers Scoreboard Scoreboard SIMT Stack Scoreboard Select Issue L1 I-Cache Register File (32 banks) 8 wide Registers (32) Datapath (Lane 0-7) 8 wide Registers (32) Datapath (Lane 8-15) 8 wide Registers (32) Datapath (Lane 16-23) 8 wide Registers (32) Datapath (Lane 24-31) L1 I-Cache 1-to-4 Channel De-multiplexers Registers (32) Registers (32) Registers (32) Registers (32) L1 I-Cache 4-to-1 Channel Multiplexers L1 I-Cache Writeback (Lane 0-31) L1 I-Cache Memory Units L1 I-Cache Texture Cache L1 I-Cache Shared Memory L1 I-Cache L1 D-Cache L1 I-Cache Constant Cache L1 I-Cache Tangram Security Unit To ICNT and Global Memory L1 I-Cache Request Traffic Controller L1 I-Cache Warp Scheduler L1 I-Cache Control Logic Scoreboard Control Logic From Global GPUGuard Unit Security Domain Table WarpID SDID WarpID SDID ... Ctrl Logic Ctrl Logic Tangram Table SDID IF WarpSched Datapath Cache SDID IF WarpSched Datapath Cache ... 1 2 3 4 5 6 7 8 9 11 Figure 5.5: Intra-SM security protection for eliminating timing attacks (shaded units are our modifications) 5.3 Tangram: Intra-SM Security Domains for Eliminating Timing Attacks 5.3.1 Intra-SM Security Domains in Tangram Recall that the GPUGuard classifier indicates which type of covert channel may be present between two malicious kernels. The goal of the intra-SM resource slicing is to prevent the four types of covert channels, namely cache attack, global memory attack through atomic operations, double precision unit attack, special functional unit attack. As such Tangram uses four types of resource slicing to prevent these four attacks. Tan- gram relies on creating fine grained temporal/spatial partitioning for pipeline stages and 125 memory units inside an SM, such that there is no leakage across the partitions. The par- titioning includes sliced data pipelines( 1 ), memory request traffic controller( 2 ) and rate limited scheduling in the warp scheduler ( 3 ) and L1 Fetch Arbiter ( 4 ). The maximum number of security domains for each resource is set as four. Hence, at most four suspicious kernels can be isolated in a given SM. This is a reasonable limitation since more than four concurrent kernels inside the same SM have diminishing returns in performance benefits in our evaluations. Datapath Slicing Based Security Domains: As in our baseline, we have 32 exe- cution units. Datapath Slicing splits the 32 execution lanes into four slices, each with eight lanes. Each slice is allocated to a single kernel thereby preventing one kernel from observing the SFU and double precision unit usage of another kernel. Note that datapath slicing does not change the number of threads inside a warp or the number of register file banks in the SM. Datapath Slicing folds the 32 threads in a warp into four quarter-warps, which are then issued in succession. The threads in a warp are shifted in successive cy- cles to align with the slice in a linear fashion: threads 0-7 are mapped to lane 0-7 in the first cycle, threads 8-15 are mapped to lane 0-7 in the second cycle, and so on. Datapath slicing allows for concurrent warps from different security domains to be executed simultaneously on isolated datapath slices. Each warp will be folded into mul- tiple sub-warps, which will be issued onto the execution pipelines in succession. With four slices each warp executing on a slice needs to be executed in four consecutive cycles, incurring a delay of three cycle delay for a given warp. While a sliced pipeline delays the execution of each warp, the total throughput of the GPU is similar, or in some cases even better, than a unified 32-lane pipeline. In fact, when there are control divergence and pipeline bubbles, we see that a sliced pipeline can help fill out the idle resources and improve the performance in many cases. Prior works such as Warp Folding [27] and Variable Warp Sizing [132] have also suggested splitting 32 threads in a warp into smaller warp sizes to utilize opportunities from control flow divergence that are preva- lent in general purpose workloads. While these prior works focus on performance and 126 energy improvement through warp subdivision, we focus on using sliced data paths to create timing-isolated security domains. Variable Warp Sizing relies on a gang table to split and reform warps, and Warp Folding only allow a 32-wide pipeline to execute a single warp instruction at any cycle. However, Tangram allows four warps to execute concurrently. Hence, Tangram relies on some additional hardware support for executing multiple sub-warps concurrently. Tan- gram adds a set of 32 registers immediately before and after each data pipeline slice ( 11 ), so that the registers can be consumed and updated in multiple successive cycles. In addition, two shuffling logic units ( 8 and 9 ) are used to select which slice of the data pipeline the registers are going to be forwarded to. The decision is based on the warp id ( 5 ), its corresponding security domain id and the resource assigned to that domain ( 6 ) in the Tangram Security Unit, which is described shortly. The proposed design does not require any modification to the interconnection net- work between the register banks and the data pipeline. In fact, the interconnection will still forward the registers into the same original 32-wide registers of the pipeline. Only after that register, we added a 1-4 channel demultiplexer ( 8 ) to shift the register to the additional add-on 32-wide register of a particular slice. The write-back process is similar: we made no modifications after the original write-back register of a particular unit. The results of a slice will be stored locally in the add-on registers and shifted out to the original 32-wide write-back register through the 4-to-1 multiplexer ( 9 ). After each sub-warp finishes execution, the caching register simply needs to shift left by 8 bytes to feed into the next sub-warp. Instruction Fetch Arbitration Across Security Domains: In addition to separat- ing contention for the datapath resources, Tangram also provides isolation for channels that use the caches and the memory system. In this chapter, we focus on the cache and memory units inside the SM. For global memory attacks, primarily through atomic oper- ations, we simply fall back on temporal partitioning of the SM. In particular, we context 127 switch the two malicious kernels (without perturbing the normal kernels). Since con- text switching is an expensive operations, as stated earlier, we only use this option for tackling covert channels formed through atomic operations. Tangram prevents the instruction cache attacks using instruction fetch arbitration. A malicious kernel may intentionally saturate the instruction fetch bandwidth, the instruc- tion fetch bandwidth could saturate easily. In order to defend against instruction cache attacks, Tangram alters the control unit ( 4 ) in the L1 fetch arbiter so that it will suc- cessively fetch from different security domains in a round-robin manner. Therefore, each security domain gets fair access to the L1 I-cache. Data Cache Re-purposing and Bypassing: On the data front, each SM can access shared memory, L1 data cache, constant cache and texture cache. Among those units, the shared memory is preassigned at the begin- ning of each kernel, therefore, shared memory allocations across kernels are isolated. On the other hand, previous experiments show that constant and texture caches are vul- nerable [119] because of their comparably smaller size and using them across multiple kernels can create high bandwidth covert channels. Cache partitioning is an efficient mechanism in CPUs to guarantee fairness and de- fend against the timing attacks on CPUs. However, many recent GPU works have shown that many GPU workloads suffer from high L1 miss rate and under-utilized cache re- sources, making cache partitioning potentially expensive. Based on these observations, we advocate for cache redirection instead of partitioning. To mitigate covert channels from constant cache accesses Tangram dynamically re- routes traffic from constant cache to use the L1-D cache. For this purpose Tangram first detects the attack type as constant data cache attack and then the constant values accessed by one malicious kernel are moved to the L1-D cache. Once an attack is detected Tan- gram watches for constant data load operations from one of the malicious kernel. The load address is looked up in the constant cache. If there is a hit in the constant cache Tan- gram marks it as a miss and then places a miss fill request to bring the constant data into 128 the L1-D cache. For this purpose Tangram uses the constant data address to lookup the L1D cache to find a victim cache line. The victim cache line is evicted and constant data is then stored in that line. Thus some of the L1-D cache space is used to store constant data for either the spy or trojan kernel. Thus the covert channel through constant cache is eliminated. Finally, covert channels formed through L1 data cache itself are eliminated through cache bypassing. In fact, previous studies show that the GPU L1D cache miss rate is so high so that the performance even gets better when GPU L1-D cache is bypassed [100, 165, 146]. Therefore, we selectively bypass the L1-D cache requests if the attacks are detected on the L1D cache instead. Note that it is not possible to re-purpose the read- only constant cache for potentially read/write operations from a regular L1-D cache. Therefore, we will simply pick either the spy or trojan kernel and mark all its load/store operations as non-cacheable. Rate Limiting Warp Scheduler: The GPU warp scheduler selects which warps will be issued to execute in the next cycle from a pool of all the active warps in the SM. The warps from multiple security domains can compete for the scheduling bandwidth and issue timing attacks. Our base- line warp scheduler selects the next available warps to issue based on the last issued first and then the oldest order. When all the warps from a kernel are stalled, all the scheduling cycles will be given to the next kernel. On the other hand, if the warps from a kernel are always ready to execute, it will consume all the scheduling bandwidth and starve the other kernel. This interference in scheduling can be manipulated for timing attacks. Therefore, we enhance the warp scheduler with a rate limiter ( 3 ), so that scheduling cycles will be fairly distributed. For example, if one warp scheduler can issue up to two warps in each cycle, and there are two security domains, we will ensure that only one warp from each security domain can be issued. In the case of four security domains, one warp from each security domain can only be issued every other cycle. 129 5.3.2 Tangram Security Unit The various schemes described above defend against different types of attacks. Based on the attack classification the coordination across various schemes is handled using the Tangram Security Unit (TSU). When all the kernels are executing normally TSU keeps all kernels in a single security domain and none of the resource partitioning schemes described above are activated. However, when the security classified detects an attack the warp ids of the two colluding warps is sent to the TSU. TSU then activates the resource splitting across security domains. Each kernel is then assigned a security domain id and all warps in that kernel execute within that security domain. TSU keeps track of the mapping of warp ids to security domain ids in a Security Domain Table (SDT)( 5 ). TSU also consist of a Tangram Table ( 6 ), which tracks what resources are assigned to each security domain. The resource assignment scheme ( 7 ) inside the TSU will assign the resource pieces to security domains so that kernels will start to execute in isolated security domains. Each entry in the SDT contains an 8-bit warp id number and a 3-bit security domain number( 5 ). As long as the GPUGuard unit considers the kernels running to be safe, all the warps will stay assigned to security domain zero. Once a suspicious activity is detected, the GPUGuard unit immediately notifies the TSU unit to upgrade the security domain numbers for all the warps: the two suspected kernels are mapped to SD2 and SD3, while the remaining kernels are assigned to SD1. As two new security domains are allocated, the Tangram Table is updated. Each entry in the Tangram Table consists of a 3-bit security domain id, a 3-bit in- struction fetch token, a 3-bit warp scheduling token, a 4-bit datapath slice number and a 8-bit cache utilization mode( 6 ). The instruction fetch token and warp scheduling token are used to determine its scheduling slots out of the of total scheduling cycles in a small window, the particular security domain is going to take. For example, if the warp scheduling token for SD1, SD2 and SD3 are one, one, two, respectively, and the warp scheduler can issue two warps in each cycle, then in a two cycles window, the number 130 of warps can be issued by each of them is one warp for SD1, one warp for SD2 and two warps for SD3. To ensure fair access for different security domains, all the tokens will be initially set to one. The datapath slice number has 4 bits, each corresponding to a datapath slice (Recall that we have four datapath slices). The corresponding warp instruction will be issued to datapath slice 0, if the bit 0 is set. Similarly, if bit 1 is unset, the warp cannot be issued to datapath slice 1. The last field in the entry is called the cache access mode, which is used to provide the fine grained security protection of the caches. The cache traffic control logic is designed ( 2 ) will decide how to direct the cache requests based on the warp id and cache mode in the Tangram Table. The cache mode contains 16 bits, 4 bits for each type of the L1 cache (e.g. shared memory, L1 D-Cache, constant cache and texture cache). Each bit in the four bits determine actually which units the request is actually going to be directed to. If all the four bits are zero, the request traffic control logic will decide that the request should go to the global memory. In this way, the access to caches are always going to be re-directed to the other under-utilized resources or the global memory to guarantee the execution isolation. 5.4 Evaluation 5.4.1 Methodology We used GPGPU-Sim v3.2.2 [34], a cycle accurate timing simulator in our evaluation and our configuration parameters are described in Table 5.3. We studied 20 GPU covert channel attack applications obtained from the authors in [119]. The benchmarks covers atomic operation attacks, constant cache attacks, attacks on floating point and double precision execution units and special functional units. For each experiment run of mul- tiple kernel executions, we kept the total number of executed kernels and the simulated 131 Parameters Value Compute Units 16, 1400MHz, SIMT Width = 32 Resources / Core max 1536 Threads, 32768 Registers max 8 CTAs, 48KB Shared Memory Warp Schedulers 1 per SM, max issue 2 warps per cycle, default gto L1 Caches 16KB 4-way L1D$, 4KB 4-way L1C$, 12KB 24-way L1T$ L2 Cache 128KB/Memory Channel, 8-way Memory Model 6 MCs, FR-FCFS, 924Hz GDDR5 Timing t CL =12,t RP =12,t RC =40, t RAS =28,t RCD =12,t RRD =6 Table 5.3: Baseline configuration instructions the same and compare the execution time speed up under different architec- tural modifications. P:0 P:1 P:2 P:3 P:4 A:0 2591 5 86 20 64 A:1 0 869 0 0 0 A:2 0 0 1017 0 0 A:3 0 71 13 106 432 A:4 0 0 0 0 599 Table 5.4: Confusion matrix (P: Predicted, A: Actual) Detection Accuracy:Table 5.4 shows the total confusion matrix which visualizes the performance of our machine learning based classification. The first column indicates the actual attack type and the first row shows the predicted attack type. Note that these values are prediction results for each window of 10000 cycles (10 instances). As shown by the 132 strong diagonal matrix the predicted and actual attacks are extremely close in most cases, except the DP attacks that are misclassified as other attacks at high rate. We believe that with some additional performance counters we can more effectively classify DP attacks in the future. The classification accuracy is also measured using true/false positive/negative di- mensions. In our results 6.3% of regular applications were misclassified as malicious applications (false positive), and 0% malicious applications were misclassified as regu- lar applications (false negative). False positive cases essentially pay penalty while false negative cases evade our detection and defense mechanisms. Note that the true negative rate of the classification is calculated as 93.6% and true positive rate is 100%. Performance Impact of Defense Schemes: Note that performance impact is some- what irrelevant if the two kernels are in fact maliciously forming a covert channel. Hence, performance impact is relevant only for those few false negative cases when a normal ker- nel is inaccurately classified as malicious. As stated earlier such false negative cases are rare. Nonetheless we shown the performance impact such a false negative classification is going to suffer. We evaluated the performance impact of multiple defense schemes: temporal partitioning (TP), spatial partitioning (SP) against GPUGuard. In temporal par- titioning, each kernel is assigned an execution window of 50K cycles in a round robin manner. At the end of the 50K cycle window, we will preempt the current kernel and switch to the kernels in the next security domain. Note that as stated earlier GPUGuard also uses temporal partitioning for only global memory attacks using atomics.For spatial partitioning, only one kernel will be assigned to one SM and each kernel will take over half of the total number of the SMs. However, spatial partitioning does not prevent global memory covert channels. We evaluated datapath slicing, cache redirection and fair warp scheduling and instruction fetch in our Tangram defense. The execution units pipeline will be sliced evenly between the kernels and the warp scheduling and instruction fetch slots are evenly assigned. If there is no instruction to issue from the assigned kernel, the scheduler will not issue any instructions in that slot (hence some performance loss if a 133 kernel is falsely classified as malicious). However, compared to spatial partitioning, it turns out to be a more efficient mechanism overall . We turned on cache redirection for constant cache attacks. While one of the kernel continue to have exclusive use of the constant cache, the other kernel will use its L1 data cache for its constant cache requests. If the attacker detects our protection and then changed to use L1 data cache attacks, we will bypass the L1 data cache requests. 1.39 2.57 1.34 1.54 2.78 1.46 1.54 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Normalized Speedup SP GPUGuard No Protection Figure 5.6: Performance impact of Tangram and GPUGuard. The results are normalized to temporal partitioning. Figure 5.6 shows the performance benefits of our proposed GPUGuard methods on 20 different covert channel attacks compared to TP baseline. Detailed description of these benchmarks can be found in [119]. CCA1, CCA2, CCO1, CCO2 are attacks on L1 constant caches inside an SM on multiple cache sets (CCA), and on single cache set (CCO). The number 1-2 denotes different spy-trojan pairs that use different ways to communicate and use different input sets. SP gained 2.57x speedup compared to TP, while at the same time, achieved zero information leakage on all the four attack benchmarks. As the two kernels are running on disjoint sets of SMs, establishing timing channels on L1 constant cache between the two kernels is impossible with SP. Tangram provides the same level of defense as SP but it achieves 11% better performance than SP by its clever reuse of constant and data cache structures. The execution unit attacks we included are: ADD, MUL on SP units and SQRT on special functional units. To quantify the performance impact of control divergence, 134 the benchmark labels ending in odd number have no control divergence, while the even benchmarks have 25% - 50% of control divergence. For the execution unit attacks, SP defends the attack by separating the kernels at SM level, while GPUGuard isolates the execution interference using intra-SM resource slicing. SP improves the performance over TP by 34%, since it avoids kernel preemption. For benchmarks without control divergence, such as ADD1 and ADD3, the performance of GPUGuard is very similar to SP. However, for benchmarks with control divergence, GPUGuard improves the perfor- mance by over 40% compared to SP, and even 125% for SQRT2 benchmark. However, GPUGuard suffers from the additional pipeline delays for MUL benchmarks without control divergence (MUL1 and MUL3). In our baseline configuration, the initiation in- terval of MUL application is as long as 16 cycles, and the longer latency caused by warp folding leads to some performance penalties. However, overall GPUGuard achieves 12% better performance than SP for execution units attacks. ATM1-ATM4 are attacks dominated with atomic operations. The number 1-4 denotes different spy-trojan pairs that use different ways to communicate through atomic opera- tions and use different input sets. This type of attacks will be targeted on global memory. Spatial partitioning is entirely incapable of protection against this attack. While GPU- Guard uses temporal partitioning for this attack type. Thus GPUGuard’s performance is going to be identical to temporal partitioning. Overall, the geometric mean of GPUGuard’s performance is 54% faster than that of TP, showing that it is possible to benefit from multiprogramming while maintaining pro- tection against timing attacks. On the other hand, the performance is about 30% slower than a baseline with no protection. This overhead is typical of protections against timing channels; for example, the recent Camouflage CPU memory controller timing channel protection is over 50% slower than the baseline (but like our design, significantly faster than TP) [174]. Moreover, this overhead is only incurred when an attack is suspected. 135 1.42 2.69 1.25 1.50 3.64 1.47 1.63 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Normalized Speedup SP GPUGuard Figure 5.7: Performance impact of GPUGuard on Kepler Architecture 5.4.2 Sensitivity Analysis We also study the sensitivity of our proposed mechanisms on a Kepler-like architecture to validate the performance difference. The Kepler-like architecture has 6 SP units per SM and the warp scheduler can issue four warps per cycle. We observe that all the attacking timing channels are successfully closed on our GPUGuard architecture. Figure 5.7 shows a performance comparison between TP, SP, and GPUGuard. For constant cache attacks, GPUGuard achieves 95% better performance than SP, and 3.64x improvement than TP. For execution unit attacks, the performance benefits of GPUGuard are 24% and 47% compared to SP and TP. Overall, GPUGuard achieves 63% better performance than TP. Therefore, we show our mechanisms continues to perform well with architectural scaling and in fact the gap between SP and GPUGuard increases with scaling. 5.4.3 Robustness of the Detection Scheme When the attackers are aware of the existence of the proposed security detection on GPU, they are likely to adapt the attacks accordingly to find the weakness of the scheme. There- fore, it is important to increase the robustness of our proposed scheme in preparation of such enhanced, smarter attacks. One of the potential weakness in the decision tree based classifier is that the attackers may infer the decision rules and try to stay within the mar- gin of the corresponding thresholds. For example, in a constant cache attack, the rate that 136 the benchmark is accessing and evicting the data in constant cache is involved in deter- mination of the attacks. A sly attacker could sacrifice the communication bandwidth and lower the access rate to constant cache in order to trick the classifier. In Figure 5.8, we slowed down the attacks by reducing the covert channel bandwidth by 2x, 10x and 100x, and then evaluate the effectiveness of our detector trained earlier in Section 5.2.1.1. Not surprisingly, the slower attacks increases the probability that an attack could stay under the hood. As the more irrelevant instructions are added in the code to slow down the attacks, the higher prediction miss rate is. When we slowed down the attacks by 10x, the probability that the attacks are not classified correctly increased to as high as 40%. 0% 20% 40% 60% 80% 100% CCA ATM MUL SIN CCA ATM MUL SIN GPUGuard Enhanced GPUGuard Prediction Miss Rate 2x 10x 100x Figure 5.8: Comparison of detection miss rate of GPUGuard and enhanced GPU Guard when the attacker tries to evade the detection by lowering the attacking bandwidth by 2x, 10x and 100x respectively. We retrain the model with a more inclusive training set with a 2x slower benchmark in each attack category added. We can see clearly on the right of Figure 5.8 that the detection miss rate are dramatically reduced. For constant cache attacks (CCA), atomic attacks (ATM) and SFU sin function attacks (SIN), the probability of mis-classification is under 5% even when bandwidth has been throttled down by 100 times. This makes attacking through those resources extremely difficult, since the useful communicating bandwidth would be too low. 137 5.4.4 Hardware Overhead GPUGuard requires 120 byte performance counters in each SM for sampling the selected features for threat detection. Many of those counters may already been provided in the modern GPUs [22]. Tangram security units require 78 bytes of counters for keeping track of the security domain IDs and scheduling information. In addition, to support dat- apath slicing, 1.5KB registers and 64 bytes of multiplexers are required. We extracted the power and area information for counters, register files from NCSU PDK 45nm li- brary [7] and GPUWattch [99]. We estimate the additional counters, registers and multi- plexers occupies 0:13mm 2 per SM. We also extract the power and area of 16 SMs from GPUWattch [99], which is 704mm 2 and consumes 38W of dynamic power and 34.6W of leakage power.The total area overhead of our proposed approach for 16 SMs is 1.9mm 2 , resulting in only 0.3% area overhead. The total dynamic power is 422mW and the leak- age power is 6.6mW, accounting for 1.1% of the dynamic power and 0.02% leakage power overhead. 5.5 Related Work Microarchitectural covert-channel and side-channel attacks have been widely studied on different resources on CPUs [160, 77, 55, 114, 126, 107, 89] . A recent work studied possibility of contention-based covert channels [119] on GPGPUs that provide error free and very high bandwidth channels. In the context of side channels, Jiang et al. [83] present an architectural timing attack on GPU. They observed the correlation between the memory coalescing behavior and the response time of cache line requests and used the information to extract the last round key in AES encryption. The timing is measured from the CPU side and this channel is not based on contention creation on microarchitectural resources, so it is out of the scope of this chapter. 138 There are many defense proposals to close side channel attacks on the CPUs which are mostly focused on L1 cache channels. These proposals include: (1) Static or dy- namic partitioning of resources like L1 cache [53, 122, 128, 79, 80] that can introduce unacceptable performance overhead and can only support a limited number of partitions with reasonable overhead and mitigation mechanisms like locking the critical cache lines with the support of OS and compiler [92, 161] that are specific to cache based channels and could not be practical on GPUs. Liu et al. [105] proposes to partition the LLC into secure and non-secure partitions and line locking the secure partition for defeating side channels. (2) Randomizing memory-to-cache mapping, including randomization in the replacement of the cache lines in the entire cache [162] and in the cache fill strat- egy [106]. (3) Adding noise to timing by manipulating time measurement structure of processor [113]. Although similar approaches can be applied to close covert channels, mitigating covert channels are usually more difficult than side channel ones, since in covert channels two malicious applications are collaborating to each other and can find alternatives to create the channel or restore their operation after corruption. Online detection of contention based covert communication is an alternative that is useful for closing covert channels. Chen et al. [47] present a framework to detect tim- ing covert channel on shared hardware resources on a CPU by dynamically monitoring conflict patterns between processes. However their framework is designed to detect al- ternating pattern of cache conflicts between spy and Trojan and fails to detect any varia- tions of the attack; those channels that access to the shared resources concurrently. Yan et al. [169] propose a record and deterministic replay framework that rely on two ob- servations on cache based covert channels: first, highly tuned malicious access to the mapping of address to the cache locations and disruption of attack by changing the map- ping and consequently changing the conflict miss pattern, second, a distinctive cache conflict pattern that repeats over time, so such a repeating pattern should still be distin- guishable after changing the mapping. They record the miss rate timeline by logging the non-deterministic events, then change the mapping and replay the log while recording 139 the new cache miss timeline. By computing the difference of these two timelines and observing a periodic pattern, a covert channel is detected. The overhead of record and replay can be high; moreover, there is a delay between the recording and the replay that introduces a significant delay before the attack is detected. Our mitigation framework is the first defense proposal for contention-based attacks on GPUs. It supports online detection of contention-based attacks and provides GPU- specific isolation mecahnisms to completely close these channels. 5.6 Chapter Summary In this chapter, we propose GPUGuard which can dynamically detect and defend against GPU covert channel attacks. The detection algorithm uses a decision tree based design that is able to accurately detect covert channel attacks (100% sensitivity in our experi- ments) with a small (6%) false negative detection rate. The detection algorithm feeds the classification results to Tangram, a GPU-specific covert channel elimination scheme. Tangram uses a combination of warp folding, pipeline slicing, and cache remapping mechanisms to close the covert channels with much lower performance overhead than temporal partitioning. 140 Chapter 6 Conclusions Improving GPU energy efficiency is a vital challenge for wider adoption of GPUs in gam- ing, graphics, data center as well as emerging areas such as deep learning, automotive driving and artificial intelligence. GPU energy efficiency can be significantly curtailed by various resource underutilization issues. The case study of graph applications shows that when general purpose applications are ported to run on GPUs, these applications can suffer from inefficient resource utilization due to branch divergence, memory divergence and load imbalance. To tackle the resource underutilization issue, prior studies have pro- posed power gating unused resources at a coarse-grain, such as banks of register files, entire execution units, etc. In this thesis, a fine-grained execution lane level power gating technique, called PATS, was explored. First, the branch divergence behavior of vari- ous GPU workloads was studied and the results showed that branch divergence patterns exhibit strong bias and only a few patterns dominate the execution in many workloads. To exploit this divergence pattern bias, a pattern aware warp scheduler that increases the idleness time of a SIMT lane was designed and evaluated. PATS and an enhanced PATS++ approaches were shown to outperform prior GPU power gating techniques. While power gating idle resource saves static energy, the fact that idle resources are a wasted opportunity to improve performance remains. We argued that concurrent kernel execution is a more effective and desirable way to tackle the resource underutilization problems. This thesis described a novel approach for efficiently partitioning resources 141 within an SM across multiple kernels. The algorithm proposed follows the water-filling algorithm used commonly in communication networks, but was applied to the problem of efficient resource sharing within a GPU. This thesis also proposed an effective approach for implementing this algorithm in practice using a short profile run to collect the statis- tics required for executing the algorithm. The proposed design was evaluated on a wide range of GPU kernels. The results demonstrated that the proposed approach improves performance by 23% over the state-of-the-art baseline multiprogramming approach. However, concurrent kernel execution also brings to forefront security issues when- ever there are resources shared among different parties. This thesis further addressed GPU covert channel attacks among concurrent kernels. The proposed technique called GPUGuard can dynamically detect and defend against GPU covert channel attacks. The detection algorithm used a decision tree based design that was able to accurately de- tect covert channel attacks (100% sensitivity in the experiments) with a small (6%) false negative detection rate. The results of the detection algorithm were then fed to a GPU- specific covert channel elimination scheme called Tangram. Tangram proposed a com- bination of warp folding, pipeline slicing, and cache remapping mechanisms to close the covert channels with much lower performance overhead than temporal partitioning. GPUs have been more efficient than CPUs for massive data parallel applications. Looking into the future, GPUs will be facing many challenges brought on by emerging application needs. GPUs are likely to face significant hurdles running large deep learning models, which are essentially large graph algorithms. Conventionally, GPUs are only efficient when a learning model completely fits inside its device memory (10s GB). But as the models get complex the model parameters vastly exceed GPU device memory. As such communication and synchronization between GPU chips, and server nodes and even across racks may become inevitable. Chapter 2 demonstrated that GPUs already suffer huge amount of efficiency loss when there are synchronization and communication costs across SMs. And the load imbalance and resource underutilization issue are likely to be more severe in the presence of communication and synchronization demands. As such 142 we believe a coordinated resource utilization improvement scheme that works across multiple GPU chips will be much needed. We hope that this thesis provides inspiration for those future studies. Secondly, GPU will be facing the challenge to run real-time applications with high efficiency. For example, as GPUs gain traction in autonomous vehicles the need for real-time guarantees is critical. Current GPUs rely on various levels of on-chip dynamic scheduling schemes to hide the stall latencies. When there are two or more applications running concurrently, there are a variety of sources of contention on shared resources, as mentioned in Chapter 5, which leads to uncertainty in kernel execution time. Dedicating hardware resources to each application to guarantee quality of service is not an appealing solution since there may be a plethora of applications that contend for the limited hard- ware resources. We hope the water filling algorithm based intra-SM slicing described in the thesis forms the foundation for future studies that explore deterministic kernel execution time guarantees in the presence of resource contention. Last but not the least, GPUs will need to continue to improve performance per watt to stay in the game with the application specific ASIC designs. Relying on technology scaling and increased die area will not solve the problem, since those advantages are also available to CPU and ASICs. We believe that one reason for rapid adoption of GPUs in diverse application domains is the availability of programming APIs that lowered the barrier to entry. As such exposing the problem of resource underutilization through com- pilers and GPU runtime systems to the application developer will be the next frontier in sustaining the energy efficiency advantages of GPUs. Eventually we believe that the contributions made in this thesis bring us one step closer to making GPUs much more energy efficient and secure in the future. 143 Reference List [1] Amazon EC2 Elastic GPUs. https://aws.amazon.com/ec2/Elastic-GPUs/. [2] AMD Accelerated Processing Unit. https://en.wikipedia.org/wiki/AMD Acc- elerated Processing Unit. [3] AMD Unveils its Heterogeneous Uniform Memory Access (hUMA) Technology. http://www.tomshardware.com/news/AMD-HSA-hUMA-APU,22324.html. [4] Apache Giraph. https://giraph.apache.org/. [5] Apache Hadoop. http://hadoop.apache.org/. [6] CUSP : A C++ templated sparse matrix library. http://cusplibrary.github.com. [7] The freepdk process design kit. http://www.eda.ncsu.edu/wiki/FreePDK. [8] Geekbench browser. https://browser.primatelabs.com/v4/cpu. [9] Google cloud platform-graphics processing unit. https://cloud.google.com/gpu/. [10] NVIDIA GPU computing SDK. https://developer.nvidia.com/gpu-computing- sdk. [11] The programmer’s guide to the APU galaxy. http://developer.amd.com/word- press/media/2013/06/Phil-Rogers-Keynote-FINAL.pdf. [12] Top 500 supercomputing sites. http://www.top500.org/statistics/list/. [13] Whitepaper: Network covert channels : Subversive secrecy, 2006. https://www.sans.org/reading-room/whitepapers/covert/network-covert-channels- subversive-secrecy-1660. [14] NVIDIA CUDA compute unified device architecture - programming guide, 2008. http://developer.download.nvidia.com. [15] Whitepaper: NVIDIA’s Next Generation CUDA TM Compute Architecture: Fermi TM . Technical report, NVIDIA, 2009. 144 [16] ”GPUs are only upto 14 times faster than CPUs”, says Intel, 2010. https://blogs.nvidia.com/blog/2010/06/23/gpus-are-only-up-to-14-times-faster- than-cpus-says-intel/. [17] Whitepaper: NVIDIA GF100. Technical report, NVIDIA, 2010. [18] Whitepaper: AMD graphics cores next (GCN) architecture. Technical report, AMD, 2012. [19] Whitepaper: NVIDIA’s Next Generation CUDA TM Compute Architecture: Kepler TM GK110. Technical report, NVIDIA, 2012. [20] Whitepaper: NVIDIA GeForce GTX980. Technical report, NVIDIA, 2014. [21] International Technology Roadmap for Semiconductors 2.0 2015 Edition Execu- tive Report. Technical report, 2015. [22] NVIDIA profiler user’s guide, 2015. ”http://docs.nvidia.com/cuda/profiler-users- guide/warp-state-nvvp/”. [23] Whitepaper: NVIDIA GeForce GTX1080-Gaming Perfected. Technical report, NVIDIA, 2016. [24] Whitepaper: NVIDIA Tesla V100 GPU Architecture. Technical report, NVIDIA, 2017. [25] M. Abdel-Majeed and M. Annavaram. Warped register file: A power efficient register file for GPGPUs. InProceedingsoftheInternationalSymposiumonHigh PerformanceComputerArchitecture(HPCA), Feb. 2013. [26] M. Abdel-Majeed, D. Wong, and M. Annavaram. Gating aware scheduling and power gating for GPGPUs. In Proceedings of the International Symposium on Microarchitecture(MICRO), Dec. 2013. [27] M. Abdel-Majeed, D. Wong, J. Kuang, and M. Annavaram. Origami: Folding warps for energy efficient GPUs. In Proceedings of the International Conference onSupercomputing(ICS), June 2016. [28] J. T. Adriaens, K. Compton, N. S. Kim, and M. J. Schulte. The case for GPGPU spatial multitasking. InProceedingsoftheInternationalSymposiumonHighPer- formanceComputerArchitecture(HPCA), Feb. 2012. [29] P. Aguilera, K. Morrow, and N. S. Kim. Fair share: Allocation of GPU resources for both performance and fairness. InProceedingsoftheInternationalConference ofComputerDesign(ICCD), Oct. 2014. 145 [30] M. Alam and S. Sethi. Detection of information leakage in cloud. CoRR, abs/1504.03539, 2015. [31] B. O. F. Auer. GPU acceleration of graph matching, clustering, and partitioning. ContemporaryMathematics, 588:223–240, 2013. [32] B. O. F. Auer and R. H. Bisseling. A GPU algorithm for greedy graph matching. FacingtheMulticore-ChallengeII, 7174:108–119, 2012. [33] D. A. Bader, H. Meyerhenke, P. Sanders, and D. Wagner. 10th DIMACS imple- mentation challenge on graph partitioning and graph clustering. American Math- ematicalSociety, 588, 2013. [34] A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the Inter- nationalSymposiumonPerformanceAnalysisofSystemsandSoftware(ISPASS), Apr. 2009. [35] D. Bautista, J. Sahuquillo, H. Hassan, S. Petit, and J. Duato. A simple power- aware scheduling for multicore systems when running real-time applications. In ProceedingsoftheInternationalSymposiumonParallelandDistributedProcess- ing(IPDPS), May 2008. [36] R. Bellman. On a routing problem. QuarterlyofAppliedMathematics, 16:87–90, 1958. [37] B. Berger, R. Singht, and J. Xu. Graph algorithms for biological systems analy- sis. InProceedingsoftheAnnualACM-SIAMSymposiumonDiscreteAlgorithms, pages 142–151, 2008. [38] S. Borkar. Design challenges of technology scaling. In Proceedings of the Inter- nationalSymposiumonMicroarchitecture(MICRO), Nov. 1999. [39] A. Braunstein, M. M´ ezard, and R. Zecchina. Survey propagation: An algorithm for satisfiability. RandomStructures Algorithms, 27(2):201–226, 2005. [40] A. Buluc ¸, J. R. Gilbert, and C. Budak. Solving path problems on the GPU. Paral- lelComputing, 36(5-6):241–253, June 2010. [41] M. Burtscher, R. Nasre, and K. Pingali. A quantitative study of irregular programs on GPUs. In Proceedings of the International Symposium on Workload Charac- terization(IISWC), Nov. 2012. [42] F. J. Cazorla, P. M. W. Knijnenburg, R. Sakellariou, E. Fernandez, A. Ramirez, and M. Valero. Predictable performance in SMT processors: Synergy between the OS and SMTs. IEEETransactionsonComputers(TOC), 55(7):785–799, 2006. 146 [43] F. J. Cazorla, A. Ramirez, M. Valero, and E. Fernandez. Dynamically controlled resource allocation in SMT processors. In Proceedings of the International Sym- posiumonMicroarchitecture(MICRO), Dec. 2004. [44] D. Chandra, F. Guo, S. Kim, and Y . Solihin. Predicting inter-thread cache con- tention on a chip multi-processor architecture. InProceedingsoftheInternational SymposiumonHigh-PerformanceComputerArchitecture(HPCA), Feb. 2005. [45] S. Che, B. M. Beckmann, S. K. Reinhardt, and K. Skadron. Pannotia: Under- standing irregular GPGPU graph applications. InProceedingsoftheInternational SymposiumonWorkloadCharacterization(IISWC), Sept. 2013. [46] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the InternationalSymposiumonWorkloadCharacterization(IISWC), Oct. 2009. [47] J. Chen and G. Venkataramani. CC-Hunter: Uncovering covert timing channels on shared processor hardware. InProceedingsoftheInternationalSymposiumon Microarchitecture(MICRO), Dec. 2014. [48] L. Chen and T. M. Pinkston. Nord: Node-router decoupling for effective power- gating of on-chip routers. In Proceedings of the International Symposium on Mi- croarchitecture(MICRO), Dec. 2012. [49] S. Choi and D. Yeung. Learning-based SMT processor resource distribution via hill-climbing. In Proceedings of the International Symposium on Computer Ar- chitecture(ISCA), June 2006. [50] B. W. Coon, J. R. Nickolls, J. E. Lindholm, R. J. Stoll, N. Wang, J. H. Choquette, and K. E. Nickolls. Thread group scheduler for computing on a parallel thread processor, May 2012. US Patent 8732713. [51] T. A. Davis and Y . Hu. The University of Florida sparse matrix collection. ACM Trans.Math.Softw., 38(1):1:1–1:25, Dec. 2011. [52] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clus- ters. CommunicationsoftheACM, 51:107–113, June 2008. [53] L. Domnitser, A. Jaleel, J. Loew, N. Abu-Ghazaleh, and D. Ponomarev. Non- monopolizable caches: Low-complexity mitigation of cache side-channel attacks. ACMTransactionsonArchitectureandCodeOptimization,SpecialIssueonHigh PerformanceandEmbeddedArchitecturesandCompilers, Jan. 2012. [54] A. El-Moursy, R. Garg, D. H. Albonesi, and S. Dwarkadas. Compatible phase co-scheduling on a CMP of multi-threaded processors. In Proceedings of the International Symposium on Parallel and Distributed Processing (IPDPS), Apr. 2006. 147 [55] D. Evtyushkin, D. Ponomarev, and N. Abu-Ghazaleh. Understanding and mitigat- ing covert channels through branch predictors. ACMTransactionsonArchitecture andCodeOptimization(TACO), 13(1):10, 2016. [56] S. Eyerman and L. Eeckhout. Probabilistic job symbiosis modeling for SMT processor scheduling. In Proceedings of the International Conference on Archi- tecturalSupport forProgrammingLanguagesand OperatingSystems(ASPLOS), Mar. 2010. [57] Q. Fang and D. A. Boas. Monte carlo simulation of photon migration in 3d tur- bid media accelerated by graphics processing units. Opt.Express, 17(22):20178– 20190, Oct. 2009. [58] K. Flautner, N. S. Kim, S. Martin, D. Blaauw, and T. Mudge. Drowsy caches: simple techniques for reducing leakage power. InProceedingsoftheInternational SymposiumonComputerArchitecture(ISCA), May 2002. [59] R. W. Floyd. Algorithm 97: Shortest path. Communications of the ACM, 5(6), June 1962. [60] HSA Foundation. Heterogeneous system architecture (HSA): Architecture and algorithms. In Proceedings of the International Symposium on Computer Archi- tecturetutorial(ISCA), June 2014. [61] W. L. Fung and T. M. Aamodt. Thread block compaction for efficient SIMT con- trol flow. In Proceedings of the International Symposium on High Performance ComputerArchitecture(HPCA), Feb. 2011. [62] B. Gao, T. Wang, and T. Liu. Ranking on large-scale graphs with rich metadata. In Proceedings of the International Conference Companion on World Wide Web (WWW), March 2011. [63] W. Gao, N. T. T. Huyen, H. S. Loi, and Qian K. Real-time 2d parallel windowed fourier transform for fringe pattern analysis using graphics processing unit. Opti- calExpress, 17, Dec. 2009. [64] M. Gebhart, D. R. Johnson, D. Tarjan, S. W. Keckler, W. J. Dally, E. Lindholm, and K. Skadron. Energy-efficient mechanisms for managing thread context in throughput processors. In Proceedings of the International Symposium on Com- puterArchitecture(ISCA), June 2011. [65] S. Z. Gilani, N. S. Kim, and M. J. Schulte. Power-efficient computing for compute-intensive GPGPU applications. InProceedingsoftheInternationalSym- posiumonHighPerformanceComputerArchitecture(HPCA), Feb. 2013. 148 [66] S. Grauer-Gray, W. Killian, R. Searles, and J. Cavazos. Accelerating financial applications on the GPU. InProceedingsofthe6thWorkshoponGeneralPurpose ProcessorUsingGraphicsProcessingUnits(GPGPU-6), March 2013. [67] C. Gregg, J. Dorn, K. Hazelwood, and K. Skadron. Fine-grained resource sharing for concurrent GPGPU kernels. In Proceedings of the USENIX Conference on HotTopicsinParallelism(HotPar), June 2012. [68] A. V . P. Grosset, P. Zhu, S. Liu, S. Venkatasubramanian, and M. Hall. Evaluating graph coloring on GPUs. In Proceedings of the Symposium on Principles and PracticeofParallelProgramming(PPoPP), Feb. 2011. [69] A. Gundu, G. Sreekumar, A. Shafiee, S. Pugsley, H. Jain, R. Balasubramonian, and M. Tiwari. Memory bandwidth reservation in the cloud to avoid information leakage in the memory controller. In Proceedings of the Workshop on Hardware andArchitecturalSupportforSecurityandPrivacy(HASP), June 2014. [70] Z. Guz, E. Bolotin, I. Keidar, A. Kolodny, A. Mendelson, and U. C. Weiser. Many- core vs. many-thread machines: Stay away from the valley. Computer Architec- tureLetters(CAL), 8(1):25–28, 2009. [71] M. Hamilton. Keynote: The GPU accelerated data center. In GPU Technology Conference, Aug. 2015. [72] K. A. Hawick, A. Leist, and D. P. Playne. Parallel graph component labelling with GPUs and CUDA. ParallelComputing, 36(12):655 – 678, 2010. [73] B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang. Mars: A mapreduce framework on graphics processors. In Proceedings of the International Confer- enceonParallelArchitecturesandCompilationTechniques(PACT), Oct. 2008. [74] A. Herdrich, E. Verplanke, P. Autee, R. Illikkal, C. Gianos, R. Singhal, and R. Iyer. Cache QoS: From concept to reality in the Intel® Xeon® processor E5-2600 v3 product family. In Proceedings of the International Symposium on High Perfor- manceComputerArchitecture(HPCA), Mar. 2016. [75] Z. Hu, A. Buyuktosunoglu, V . Srinivasan, V . Zyuban, H. Jacobson, and P. Bose. Microarchitectural techniques for power gating of execution units. InProceedings oftheInternationalSymposiumonLowPowerElectronicsandDesign,(ISLPED), Aug. 2004. [76] J. Huang. Keynote: Leaps in visual computing. In GPU Technology Conference, Aug. 2015. 149 [77] C. Hunger, M. Kazdagli, A. Rawat, A. Dimakis, S. Vishwanath, and M. Tiwari. Understanding contention-based channels and using them for defense. In Pro- ceedings of the International Symposium on High Performance Computer Archi- tecture(HPCA), Feb. 2015. [78] C. Isci, G. Contreras, and M. Martonosi. Live, runtime phase monitoring and prediction on real systems with application to dynamic power management. In ProceedingsoftheInternationalSymposiumonMicroarchitecture(MICRO), Dec. 2006. [79] A. Jaleel, E. Borch, M. Bhandaru, S. Steely, and J. Emer. Achieving non-inclusive cache performance with inclusive caches - temporal locality aware (TLA) cache management policies. In Proceedings of the International Symposium on Mi- croarchitecture(MICRO), Dec. 2010. [80] A. Jaleel, K. Theobald, S. Steely, and J. Emer. High performance cache replace- ment using re-reference interval prediction (RRIP). In Proceedings of the Inter- nationalSymposiumonComputerArchitecture(ISCA), June 2010. [81] H. Jeon and M. Annavaram. Warped-DMR: Light-weight error detection for GPGPU. In Proceedings of the International Symposium on Microarchitecture (MICRO), Dec. 2012. [82] H. Jeon, Y . Xia, and V . K. Prasanna. Parallel exact inference on a CPU-GPGPU heterogenous system. InProceedingsoftheInternationalConferenceonParallel Processing(ICPP), Sept. 2010. [83] Z. H. Jiang, Y . Fei, and D. Kaeli. A complete key recovery timing attack on a GPU. In Proceedings of the International Symposium on High Performance ComputerArchitecture(HPCA), Feb. 2016. [84] Q. Jiao, M. Lu, H. P. Huynh, and T. Mitra. Improving GPGPU energy-efficiency through concurrent kernel execution and DVFS. In Proceedings of the Interna- tionalSymposiumonCodeGenerationandOptimization(CGO), Feb. 2015. [85] A. Jog, O. Kayiran, T. Kesten, A. Pattnaik, E. Bolotin, N. Chatterjee, S. W. Keck- ler, M. T. Kandemir, and C. R. Das. Anatomy of GPU memory system for multi- application execution. InProceedingsoftheInternationalSymposiumonMemory Systems(MEMSYS), Oct. 2015. [86] A. Jog, O. Kayiran, A. Mishra, M. Kandemir, O. Mutlu, R Iyer, and C. Das. Orchestrated scheduling and prefetching for GPGPUs. In Proceedings of the In- ternationalSymposiumonComputerArchitecture(ISCA), June 2013. 150 [87] O. Kalentev, A. Rai, S. Kemnitz, and R. Schneider. Connected component label- ing on a 2d grid using CUDA. Journal of Parallel and Distributed Computing, 71(4):615 – 620, 2011. [88] N. Kapre and A. DeHon. Performance comparison of single-precision SPICE Model-Evaluation on FPGA, GPU, Cell, and multi-core processors. In Proceed- ings of the International Conference on Field Programmable Logic and Applica- tions, Aug. 2009. [89] M. Kayaalp, N. Abu-Ghazaleh, D. Ponomarev, and A. Jaleel. A high-resolution side-channel attack on last-level cache. In Proceedings of the Annual Design Au- tomationConference(DAC), June 2016. [90] O. Kayiran, A. Jog, M. T. Kandemir, and C. R. Das. Neither more nor less: Opti- mizing thread-level parallelism for GPGPUs. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), Sept. 2013. [91] S. Kim, D. Chandra, and Y . Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. In Proceedings of the International Conference on ParallelArchitecturesandCompilationTechniques(PACT), Sept. 2004. [92] J. Kong, O. Aclicmez, J. Seifert, and H. Zhou. Hardware-software integrated ap- proaches to defend against software cache-based side channel attacks. InProceed- ingsoftheInternationalSymposiumonHighPerformanceComputerArchitecture (HPCA), Feb. 2009. [93] Fung W. L., I. Sham, G. Yuan, and T. A. Aamodt. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of the International SymposiumonMicroarchitecture(MICRO), Dec. 2007. [94] N. B. Lakshminarayana and H. Kim. Spare register aware prefetching for graph algorithms on GPUs. In Proceedings of the International Symposium on High PerformanceComputerArchitecture(HPCA), Feb. 2014. [95] J. Lee, V . Sathisha, M. Schulte, K. Compton, and N. S. Kim. Improving through- put of power-constrained GPUs using dynamic voltage/frequency and core scal- ing. InProceedingsoftheInternationalConferenceonParallelArchitecturesand CompilationTechniques(PACT), Oct. 2011. [96] K. Lee and L. Liu. Efficient data partitioning model for heterogeneous graphs in the cloud. In Proceedings of the International Conference for High Performance Computing,Networking,StorageandAnalysis(SC), Nov. 2013. 151 [97] S. Lee, M.and Song, J. Moon, J. Kim, W. Seo, Y . Cho, and S. Ryu. Improv- ing GPGPU resource utilization through alternative thread block scheduling. In Proceedings of the International Symposium on High Performance Computer Ar- chitecture(HPCA), Feb. 2014. [98] V . W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey. De- bunking the 100x GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU. In Proceedings of the International Symposium on Computer Architecture(ISCA), June 2010. [99] J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V . J. Reddi. Gpuwattch: Enabling energy optimizations in GPGPUs. In Proceedings of the International Symposium on Computer Architecture (ISCA), June 2013. [100] C. Li, S. Song, H. Dai, A. Sidelnik, S. Hari, and H. Zhou. Locality-driven dy- namic GPU cache bypassing. In Proceedings of the International Conference on Supercomputing(ICS), June 2015. [101] H. Li, S. Bhunia, Y . Chen, T. N. Vijaykumar, and K. Roy. Deterministic clock gating for microprocessor power reduction. In Proceedings of the International SymposiumonHighPerformanceComputerArchitecture(HPCA), Feb. 2003. [102] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the International Symposium on Microarchitecture(MICRO), Dec. 2009. [103] Y . Liang, H. P. Huynh, K. Rupnow, R. S. M. Goh, and D. Chen. Efficient GPU spatial-temporal multitasking. Transactions On Parallel and Distributed Systems (TPDS), 26(3):748–760, 2014. [104] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. NVIDIA tesla: A unified graphics and computing architecture. IEEEMicro, 28(2):39–55, 2008. [105] F. Liu, Q. Ge, Y . Yarom, F. Mckeen, C. Rozas, G. Heiser, and R. B. Lee. Catalyst: Defeating last-level cache side channel attacks in cloud computing. In Proceed- ingsoftheInternationalSymposiumonHighPerformanceComputerArchitecture (HPCA), Feb. 2016. [106] F. Liu and R. B. Lee. Random fill cache architecture. In Proceedings of the InternationalSymposiumonMicroarchitecture(MICRO), Dec. 2014. 152 [107] F. Liu, Y . Yarom, Q. Ge, G. Heiser, and R. B. Lee. Last-level cache side-channel attacks are practical. In Proceedings of the Symposium on Security and Privacy (SP), May 2015. [108] P. J. Lu, H. Oki, C. A. Frey, G. E. Chamitoff, L. Chiao, E. M. Fincke, C. M. Foale, S. H. Magnus, W. S. McArthur, D. M. Tani, P. A. Whitson, J. N. Williams, W. V . Meyer, R. J. Sicker, B. J. Au, M. Christiansen, A. B. Schofield, and D. A. Weitz. Orders-of-magnitude performance increases in GPU-accelerated correla- tion of images from the international space station. Journal of Real-Time Image Processing, 5(3):179–193, 2010. [109] M. Luby. A simple parallel algorithm for the maximal independent set problem. InProceedingsoftheSymposiumonTheoryofComputing(STOC), May 1985. [110] A. Lumsdaine, D. Gregor, D. Hendrickson, and J. W. Berry. Challenges in parallel graph processing. ParallelProcessingLetters, 17(1):5–20, 2007. [111] A. Lungu, P. Bose, A. Buyuktosunoglu, and D. J. Sorin. Dynamic power gating with quality guarantees. In Proceedings of the international symposium on Low powerelectronicsanddesign(ISLPED), Aug. 2009. [112] G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. InProceedings oftheSIGMODInternationalConferenceonManagementofData, June 2010. [113] R. Martin, J. Demme, and S. Sethumadhavan. Timewarp: Rethinking timekeep- ing and performance monitoring mechanisms to mitigate side-channel attacks. In Proceedings of the International Symposium on Computer Architecture (ISCA), June 2012. [114] C. Maurice, M. Weber, M. Schwarz, L. Giner, D. Gruss, C. A. Boano, S. Mangard, and K. Rmer. Hello from the other side: SSH over robust cache covert channels in the cloud. In Proceedings of the Network and Distributed System Security Symposium(NDSS), Feb. 2017. [115] J. Meng, D. Tarjan, and K. Skadron. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In Proceedings of the International SymposiumonComputerArchitecture(ISCA), June 2010. [116] T. M. Mitchell. Machine learning. McGraw-Hill Science/Engineering/Math, 1997. [117] A. Munshi. The OpenCL specification. In Khronos OpenCL Working Group, 2008. 153 [118] O. Mutlu and T. Moscibroda. Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems. In Proceedings of the InternationalSymposiumonComputerArchitecture(ISCA), June 2008. [119] H. Naghibijouybari and N. Abu-Ghazaleh. Covert channels on GPGPUs. IEEE ComputerArchitectureLetters, 2016. [120] V . Narasiman, M. Shebanow, C. Lee, Rustam Miftakhutdinov, Onur Mutlu, and Y . N. Patt. Improving GPU performance via large warps and two-level warp scheduling. InProceedingsoftheInternationalSymposiumonMicroarchitecture (MICRO), Dec. 2011. [121] V . M. A. Oliveira and R. A. Lotufo. A study on connected components labeling algorithms using GPUs. In Conference on Graphics, Patterns and Images (SIB- GRAPI), Aug. 2010. [122] D. Page. Partitioned cache architecture as a side-channel defense mechanism. In IACRCrypt.ePrintArch., 2005. [123] S. Pai, M. J. Thazhuthaveetil, and R. Govindarajan. Improving GPGPU concur- rency with elastic kernels. In Proceedings of the International Conference on ArchitecturalSupportforProgrammingLanguagesandOperatingSystems(ASP- LOS), Mar. 2013. [124] D. P. Palomar and J. R. Fonollosa. Practical algorithms for a family of waterfilling solutions. In Transactions on Signal Processing (TSP), volume 53, pages 686– 695, Feb. 2005. [125] J. J. K. Park, Y . Park, and S. Mahlke. Chimera: Collaborative preemption for multitasking on a shared GPU. InProceedingsoftheInternationalConferenceon ArchitecturalSupportforProgrammingLanguagesandOperatingSystems(ASP- LOS), Mar. 2015. [126] C. Percival. Cache missing for fun and profit. In Proceedings of the Technical BSDConference(BSDCan), May 2005. [127] J. R. Quinlan. Induction of decision trees. MachineLearning, 1986. [128] M. Qureshi and Y . Patt. Utility-based partitioning: A low-overhead, high- performance, runtime mechanism to partition shared caches. In Proceedings of theInternationalSymposiumonMicroarchitecture(MICRO), Dec. 2006. [129] P. Radojkovi´ c, V . ˇ Cakarevi´ c, M. Moret´ o, J. Verd´ u, A. Pajuelo, F. J. Cazorla, M. Ne- mirovsky, and M. Valero. Optimal task assignment in multithreaded processors: A statistical approach. In Proceedings of the International Conference on Archi- tecturalSupport for ProgrammingLanguagesand OperatingSystems(ASPLOS), Mar. 2012. 154 [130] M. Rhu and M. Erez. Maximizing SIMD resource utilization in GPGPUs with SIMD lane permutation. InProceedingsoftheInternationalSymposiumonCom- puterArchitecture(ISCA), June 2013. [131] M. Rhu, M. Sullivan, J. Leng, and M. Erez. A locality-aware memory hierar- chy for energy-efficient GPU architectures. In Proceedings of the International SymposiumonMicroarchitecture(MICRO), Dec. 2013. [132] T. G. Rogers, D. R. Johnson, M. O’Connor, and S. W. Keckler. A variable warp size architecture. In Proceedings of the International Symposium on Computer Architecture(ISCA), June 2015. [133] T. G. Rogers, M. O’Connor, and T. M. Aamodt. Cache-conscious wavefront scheduling. InProceedingsoftheInternationalSymposiumonMicroarchitecture (MICRO), Dec. 2012. [134] M. J. Schulte, M. Ignatowski, G. H. Loh, B. M. Beckmann, W. C. Brantley, S. Gu- rumurthi, N. Jayasena, I. Paul, S. K. Reinhardt, and G. Rodgers. Achieving ex- ascale capabilities through heterogeneous computing. IEEE Micro, 35(4):26–36, July 2015. [135] C. Scordino and G. Lipari. Using resource reservation techniques for power-aware scheduling. InProceedingsoftheinternationalconferenceonEmbeddedsoftware (EMSOFT), Sept. 2004. [136] A. Sethia and S. Mahlke. Equalizer: Dynamic tuning of GPU resources for effi- cient execution. In Proceedings of the International Symposium on Microarchi- tecture(MICRO), Dec. 2014. [137] A. Settle, J. Kihm, A. Janiszewski, and D. Connors. Architectural support for enhanced SMT job scheduling. In Proceedings of the International Conference onParallelArchitectureandCompilationTechniques(PACT), Sept. 2004. [138] A. Shafiee, A. Gundu, M. Shevgoor, R. Balasubramonian, and M. Tiwari. Avoid- ing information leakage in the memory controller with fixed service policies. In ProceedingsoftheInternationalSymposiumonMicroarchitecture(MICRO), Dec. 2015. [139] K. Shirahata, H. Sato, T. Suzumura, and S. Matsuoka. A scalable implemen- tation of a mapreduce-based graph processing algorithm for large-scale heteroge- neous supercomputers. InProceedingsoftheInternationalSymposiumonCluster, CloudandGridComputing, pages 277–284, May 2013. [140] B. A. Shirazi, K. M. Kavi, and A. R. Hurson, editors. Scheduling and Load Bal- ancinginParallelandDistributedSystems. IEEE Computer Society Press, 1995. 155 [141] A. Snavely and D. M. Tullsen. Symbiotic jobscheduling for a simultaneous mul- tithreaded processor. In Proceedings of the International Conference on Archi- tecturalSupport forProgrammingLanguagesand OperatingSystems(ASPLOS), Nov. 2000. [142] A. Snavely, D. M. Tullsen, and G. V oelker. Symbiotic jobscheduling with pri- orities for a simultaneous multithreading processor. In Proceedings of the Inter- national Conference on Measurement and Modeling of Computer Systems (SIG- METRICS), June 2002. [143] J. A. Stratton, C. Rodrigues, I. Sung, N. Obeid, L. Chang, N. Anssari, G. D. Liu, and W. M. Hwu. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Technical report, March 2012. [144] L. G. Szafaryn, K. Skadron, and J. J. Saucerman. Experiences accelerating MATLAB systems biology applications. In Proceedings of the Workshop on Biomedicine in Computing (BiC) at the International Symposium on Computer Architecture(ISCA), June 2009. [145] I. Tanasic, I. Gelado, J. Cabezas, A. Ramirez, N. Navarro, and M. Valero. En- abling preemptive multiprogramming on GPUs. In Proceedings of the Interna- tionalSymposiumonComputerArchitecuture(ISCA), June 2014. [146] Y . Tian, S. Puthoor, J. L. Greathouse, B. M. Beckmann, and D. A. Jim´ enez. Adap- tive GPU cache bypassing. In Proceedings of the Workshop on General Purpose ProcessingUsingGPUs(GPGPU), Feb. 2015. [147] R. Tonge, F. Benevolenski, and A. V oroshilov. Mass splitting for jitter-free paral- lel rigid body simulation. volume 31, pages 105:1–105:8, July 2012. [148] Y . Ukidave, C. Kalra, D. R. Kaeli, P. Mistry, and D. Schaa. Runtime support for adaptive spatial partitioning and inter-kernel communication on GPUs. In Proceedings of the International Symposium on Computer Architecture and High PerformanceComputing(SBAC-PAD), Oct. 2014. [149] A. S. Vaidya, A. Shayesteh, D. H. Woo, R. Saharoy, and M. Azimi. SIMD diver- gence optimization through intra-warp compaction. In Proceedings of the Inter- nationalSymposiumonComputerArchitecture(ISCA), June 2013. [150] L. G. Valiant. A bridging model for parallel computation. Communicationsofthe ACM, 33:103–111, Aug. 1990. [151] V . Varadarajan, Y . Zhang, T. Ristenpart, and M. Swift. A placement vulnerabil- ity study in multi-tenant public clouds. In Proceedings of the USENIX Security Symposium(USENIXSecurity), Aug. 156 [152] Vineet , V . and Narayanan, P. J. CUDA cuts: Fast graph cuts on the GPU. In Proceedings of the IEEE Computer Society Conference on Computer Vision and PatternRecognitionWorkshops(CVPRW), June 2008. [153] Vineet, V . and Narayanan, P. J. CUDA Cuts: Fast graph cuts on the GPU. In TechnicalReport,InternationalInstituteofInformationTechnology,Hyderabd. [154] J. Wang, N. Rubin, A. Sidelnik, and S. Yalamanchili. Dynamic thread block launch: A lightweight execution mechanism to support irregular applications on GPUs. InProceedingsoftheInternationalSymposiumonComputerArchitecture (ISCA), June 2015. [155] L. Wang, M. Huang, and T. El-Ghazawi. Exploiting concurrent kernel execution on graphic processing units. In Proceedings of the International Conference on HighPerformanceComputingandSimulation(HPCS), pages 24–32, 2011. [156] P. Wang, C. Yang, Y . Chen, and Y . Cheng. Power gating strategies on GPUs. ACM TransactionsonArchitectureandCodeOptimization(TACO), 2011. [157] R. Wang and L. Chen. Futility scaling: High-associativity cache partitioning. In ProceedingsoftheInternationalSymposiumonMicroarchitecture(MICRO), Dec. 2014. [158] Y . Wang, S. Roy, and N. Ranganathan. Run time power gating in caches of GPUs for leakage energy savings. InProceedingsoftheDesign,AutomationandTestin EuropeConferenceandExhibition(DATE), March 2012. [159] Z. Wang and R. B. Lee. New constructive approach to covert channel modeling and channel capacity estimation. In Proceedings of the International Conference onInformationSecurity(ISC), 2005. [160] Z. Wang and R. B. Lee. Covert and side channels due to processor architecture. InAnnualComputerSecurityApplicationsConference(ACSAC), volume 6, pages 473–482, 2006. [161] Z. Wang and R. B. Lee. New cache designs for thwarting software cache-based side channel attacks. InProceedingsoftheInternationalSymposiumonComputer Architecture(ISCA), June 2007. [162] Z. Wang and R. B. Lee. A novel cache architecture with enhanced performance and security. InProceedingsoftheInternationalSymposiumonMicroarchitecture (MICRO), Dec. 2008. [163] Z. Wang, J. Yang, R. Melhem, B. Childers, Y . Zhang, and M. Guo. Simultaneous multikernel GPU: Multi-tasking throughput processors via fine-grained sharing. In Proceedings of the International Symposium on High Performance Computer Architecture(HPCA), Mar. 2016. 157 [164] P. Xiang, Y . Yang, and H. Zhou. Warp-level divergence in GPUs: Characteriza- tion, impact, and mitigation. In Proceedings of the International Symposium on HighPerformanceComputerArchitecture(HPCA), Feb. 2014. [165] X. Xie, Y . Liang, Y . Wang, G. Sun, and T. Wang. Coordinated static and dynamic cache bypassing for GPUs. In Proceedings of the International Symposium on HighPerformanceComputerArchitecture(HPCA), Feb. 2015. [166] Q. Xu and M. Annavaram. PATS: Pattern aware scheduling and power gating for GPGPUs. InProceedingsoftheInternationalConferenceonParallelArchitecture andCompilationTechniques(PACT), Aug. 2014. [167] Q. Xu, H. Jeon, and M. Annavaram. Graph processing on GPUs: Where are the bottlenecks? In Proceedings of the International Symposium on Workload Characterization(IISWC), Oct. 2014. [168] Q. Xu, H. Jeon, K. Kim, W. W. Ro, and M. Annavaram. Warped-slicer: Efficient intra-sm slicing through dynamic resource partitioning for GPU multiprogram- ming. In Proceedings of the International Symposium on Computer Architecture (ISCA), June 2016. [169] M. Yan, Y . Shalabi, and J. Torrellas. Replayconfusion: Detecting cache-based covert channel attacks using record and replay. InProceedingsoftheInternational SymposiumonMicroarchitecture(MICRO), Dec. 2016. [170] M. K. Yoon, K. Kim, S. Lee, W. W. Ro, and M. Annavaram. Virtual Thread: Max- imizing thread-level parallelism beyond GPU scheduling limit. InProceedingsof theInternationalSymposiumonComputerArchitecture(ISCA), June. 2016. [171] W. S. Yu, R. Huang, S. Q. Xu, S. Wang, E. Kan, and G. E. Suh. SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi- threading. InProceedingsoftheInternationalSymposiumonComputerArchitec- ture(ISCA), June 2011. [172] H. Yun, G. Yao, R. Pellizzoni, M. Caccamo, and L. Sha. Memguard: Mem- ory bandwidth reservation system for efficient performance isolation in multi-core platforms. In Proceedings of the Real-Time and Embedded Technology and Ap- plicationsSymposium(RTAS), Apr. 2013. [173] J. Zhong and B. He. Kernelet: High-throughput GPU kernel executions with dynamic slicing and scheduling. TransactionsonParallelandDistributedSystem (TPDS), 25(6):1522–1532, 2014. [174] Y . Zhou, S. Wagh, P. Mittal, and D. Wentzlaff. Camouflage: Memory traffic shaping to mitigate timing attacks. InProceedingsoftheInternationalSymposium onHighPerformanceComputerArchitecture(HPCA), Feb. 2017. 158
Abstract (if available)
Abstract
Graphics processing unit (GPU) is the computing platform of choice for many massively parallel applications, including high performance scientific computing, machine learning and artificial intelligence. However, GPU energy efficiency has been significantly curtailed by severe resource underutilization. This thesis first presents a solution to improve the energy efficiency through power gating the unused resources. Pattern aware two-level scheduling (PATS) was proposed to deal with the divergent execution patterns and improve power gating efficiency. However, PATS alone is insufficient, as it is still a waste that the resources built inside the GPU are not well used. Concurrent kernel execution sheds light on resolving the resource underutilization issue through co-execution of kernels with complementary resource usage demands. However, it also brings to the forefront the vulnerability to covert-channel attacks. This thesis evaluates two facets of concurrent kernel execution: Firstly, dynamic intra-SM slicing to enable efficient sharing of resources within an SM across a scalable number of kernels. Secondly, a machine learning based intra-SM defense scheme that can reliably close the covert channels.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Demand based techniques to improve the energy efficiency of the execution units and the register file in general purpose graphics processing units
PDF
Efficient memory coherence and consistency support for enabling data sharing in GPUs
PDF
Architectural innovations for mitigating data movement cost on graphics processing units and storage systems
PDF
Resource underutilization exploitation for power efficient and reliable throughput processor
PDF
Energy efficient design and provisioning of hardware resources in modern computing systems
PDF
Energy proportional computing for multi-core and many-core servers
PDF
Design of low-power and resource-efficient on-chip networks
PDF
Efficient techniques for sharing on-chip resources in CMPs
PDF
Enhancing privacy, security, and efficiency in federated learning: theoretical advances and algorithmic developments
PDF
Low cost fault handling mechanisms for multicore and many-core systems
PDF
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
PDF
Data-efficient image and vision-and-language synthesis and classification
PDF
Accelerating reinforcement learning using heterogeneous platforms: co-designing hardware, algorithm, and system solutions
Asset Metadata
Creator
Xu, Qiumin
(author)
Core Title
Enabling energy efficient and secure execution of concurrent kernels on graphics processing units
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
07/21/2017
Defense Date
05/09/2017
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
branch divergence,covert channel,energy efficiency,GPU,graph algorithm,load balance,multiple kernels,OAI-PMH Harvest,parallel computing,power gating,resource utilization,scheduling,Security,timing attacks
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Annavaram, Murali (
committee chair
), Lloyd, Wyatt (
committee member
), Pinkston, Timothy (
committee member
)
Creator Email
qiumin@usc.edu,xuqiumin91@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-408325
Unique identifier
UC11265808
Identifier
etd-XuQiumin-5576.pdf (filename),usctheses-c40-408325 (legacy record id)
Legacy Identifier
etd-XuQiumin-5576.pdf
Dmrecord
408325
Document Type
Dissertation
Rights
Xu, Qiumin
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
branch divergence
covert channel
energy efficiency
GPU
graph algorithm
load balance
multiple kernels
parallel computing
power gating
resource utilization
scheduling
timing attacks