Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Demand based techniques to improve the energy efficiency of the execution units and the register file in general purpose graphics processing units
(USC Thesis Other)
Demand based techniques to improve the energy efficiency of the execution units and the register file in general purpose graphics processing units
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
DEMAND BASED TECHNIQUES TO IMPROVE THE ENERGY EFFICIENCY OF THE EXECUTION UNITS AND THE REGISTER FILE IN GENERAL PURPOSE GRAPHICS PROCESSING UNITS by Mohammad Abdel-Majeed A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER ENGINEERING) May 2016 Copyright 2016 Mohammad Abdel-Majeed This dissertation is gratefully dedicated to: My beloved parents My adorable sister My great brother My noble wife ii Acknowledgements I would like to thank God for the endless help and support that I received from many people around me during my PhD. This is a dream that has come true because of their unlimited support. It is with pleasure that I acknowledge those who contributed to my success. A lot of credit and appreciation goes first and foremost to my PhD advisor Prof.Murali Annavaram. His guidance, patience and the freedom he gave to me in my early PhD stages to explore different research areas are the reason behind the success and the con- fidence that I have right now. Also I would like to acknowledge my Qualification and defense exams committee members: Prof.Massoud Pedram, Prof.Jeff Draper, Prof.Michel Dubois, Prof.Hao Li, Aiichiro Nakano for joining my exams and for their insightful feedback. Also I would like to acknowledge Prof.Sandeep Gupta, Prof.Alice Parker and Prof.Ali Zadeh who introduced me to the VLSI. I would also like to acknowledge the memory architecture group members at Intel labs: Chris Wilkerson, Alaa Alamaldeen, Seth Pugsley, Zeshan Chishti, James Greensky, Tom, Prashant Nair and Jinchun kim. Working with such a great and highly qualified teammates is a great honor for me. I would like to express my grateful to Chris Wilkerson for being a great mentor during my internship at Intel labs. I learned a lot from you as a person and as a mentor. iii I would like to thank my lab mates who have been an important part of my PhD journey: Waleed Dweik, Daniel Wong, Hyeran Jeon, Abdulaziz Tabbakh, Krishna Giri Narra, Gunjae Koo, kumar, Bardia Zandian and Melina Demertzi. I will always remem- ber the ups and downs that we had during our PhD while doing research, hacking the simulators and waiting for the conference reviews. Also I would like to acknowledge my friends and collaborators in other research groups: Lihang Zhao, Aditya Deshpande, Woojoo Lee, Saurabh Hukerikar, Gopi Neela, Lizhong Chen, Yanzhi Wang, Qing Xie, Sang Wook Do. Special thanks to Daniel Wong, Hyeran Jeon, Mohammad Javad Dousti and Alireza Shafaei for their help and for being a great source of motivation and encour- agement during my PhD journey. I would like to thank my friends: Waleed, Anas, Laith, Daoud, Abdulla, Ayman, Wael, Wajih, Tawfig, Hassan, Yasser who were always ready to help and support. I would like to thank them for being great friends. Spending time with them is always the source of relaxation from the PhD work. I would like to thank my friend Mohammad Ababneh for his continuous support. Also I would like to thank Ramzi Saifan for being supportive and for his advises for me in my PhD and personal life. I would like to thank the Castillo family (Mr.jesse, Mrs.Maritza, Kenny and Brian) for hosting me for the first few years in my studies. I really enjoyed living with them and being part of their family for years. They made my move and stay in the US easy, peaceful and joyful. I am really gifted to have such a family in the US. Of course, I will not hesitate visiting them each time I have a chance to come to the US. Last but not least, I would like to acknowledge my family, to whom I owe an eternal debt of gratitude. I am deeply grateful to my father, Rajab Abdel-Majeed, and my precious mother, Wafa’ Al-Amiri, for the endless help, support, prayers and sacrifices they made for me throughout each step in my life. I cannot imagine to make it that far without them. Also I am deeply grateful to my amazing sister, Lama Abdel-Majeed, iv who is always open to listen and advise me without any complain. I would like to thank her for the continuous support and encouragement. Many thanks to my younger brother Ahmad Abdel-Majeed who is always there for me. We started the journey together and we did not disappoint at the end. At the end I would like to thank my wonderful wife, Lina Tayyem, who joined me in my last two years of my PhD. She is always there for me during the bad days and the good days. Her existence in my life made it peaceful and enjoyable. Her smile and excitement for every single achievement I made is a source of encouragement for me. v Contents Dedication ii Acknowledgements iii List of Tables ix List of Figures x Abstract xii 1 Introduction 1 1.1 The quest for power efficient GPGPUs . . . . . . . . . . . . . . . . . 1 1.1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Background 14 2.1 GPU Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.1 GPU Programming . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.2 Baseline GPU Architecture . . . . . . . . . . . . . . . . . . . . 17 2.2 Leakage Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3 Power Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3 Warped-Gates: Gating Aware Scheduling and Power Gating for GPGPUs 29 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Power Gating Challenges in GPUs . . . . . . . . . . . . . . . . . . . . 33 3.2.1 Need for Longer Idle Periods . . . . . . . . . . . . . . . . . . 35 3.3 Gating-Aware Two-Level Scheduler (GATES) . . . . . . . . . . . . . . 37 3.3.1 GATES Implementation Issues . . . . . . . . . . . . . . . . . . 40 3.4 Blackout Power Gating . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.4.1 Reducing Worst Case Blackout Impact with Adaptive Idle Detect 47 3.5 Architectural Support . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.6.1 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . 57 3.6.2 Increasing Power Gating Opportunities . . . . . . . . . . . . . 58 vi 3.6.3 Energy Impact . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.6.4 Performance Impact . . . . . . . . . . . . . . . . . . . . . . . 65 3.6.5 Hardware Overhead . . . . . . . . . . . . . . . . . . . . . . . 67 3.6.6 Sensitivity to Power Gating Parameters . . . . . . . . . . . . . 67 3.6.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4 Origami: Fine Grain Power Gating Techniques for GPGPUs 73 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.2 Prevalence of Pipeline Bubbles . . . . . . . . . . . . . . . . . . . . . . 75 4.3 Origami: Converting Pipeline Bubbles into Energy Savings Opportunity 78 4.3.1 Warp Folding . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.3.2 Origami Scheduler . . . . . . . . . . . . . . . . . . . . . . . . 88 4.3.3 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.3.4 Architectural Support . . . . . . . . . . . . . . . . . . . . . . . 94 4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.4.1 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . 100 4.4.2 Energy Impact . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.4.3 Performance Impact . . . . . . . . . . . . . . . . . . . . . . . 105 4.4.4 Sensitivity Studies: . . . . . . . . . . . . . . . . . . . . . . . . 107 4.4.5 Hardware Overhead . . . . . . . . . . . . . . . . . . . . . . . 107 4.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5 Warped Register File: A Power Efficient Register File for GPGPUs 113 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.2 Opportunities for Register File Power Savings . . . . . . . . . . . . . . 116 5.3 Reducing Register File Leakage Power . . . . . . . . . . . . . . . . . . 122 5.3.1 Architectural Support for Tri-mode Operation . . . . . . . . . . 125 5.3.2 Architectural Support for Reducing Drowsy Wakeup Latency . 127 5.4 Reducing Dynamic Power with Active Mask Aware Gating . . . . . . . 128 5.4.1 Architectural Support . . . . . . . . . . . . . . . . . . . . . . . 130 5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.5.1 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . 133 5.5.2 Leakage Power Savings with TRIC . . . . . . . . . . . . . . . 134 5.5.3 Dynamic Power Savings with COMA . . . . . . . . . . . . . . 135 5.5.4 Warped Register File . . . . . . . . . . . . . . . . . . . . . . . 136 5.5.5 Area and Performance overhead . . . . . . . . . . . . . . . . . 137 5.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 vii 6 Energy Efficient Partitioned Register File for GPGPUs 144 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 6.2 Register File Access Behavior in GPUs . . . . . . . . . . . . . . . . . 150 6.3 Partitioned Register File . . . . . . . . . . . . . . . . . . . . . . . . . 154 6.3.1 Architectural Support . . . . . . . . . . . . . . . . . . . . . . . 161 6.4 SRAM Cell Design in Sub-10nm . . . . . . . . . . . . . . . . . . . . . 166 6.4.1 Design Space Exploration . . . . . . . . . . . . . . . . . . . . 166 6.4.2 Dual-Gate FinFET . . . . . . . . . . . . . . . . . . . . . . . . 169 6.5 Partitioned Register File Design Using FinFET . . . . . . . . . . . . . 172 6.5.1 SRF Using FinfET . . . . . . . . . . . . . . . . . . . . . . . . 172 6.5.2 FRF Using FinFET . . . . . . . . . . . . . . . . . . . . . . . . 172 6.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 6.6.1 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . 176 6.6.2 Proposed Register File Characteristics . . . . . . . . . . . . . . 177 6.6.3 Energy Savings . . . . . . . . . . . . . . . . . . . . . . . . . . 179 6.6.4 Performance Overhead . . . . . . . . . . . . . . . . . . . . . . 182 6.6.5 Partitioned vs Hierarchal Register Files . . . . . . . . . . . . . 184 6.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 6.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 7 Conclusion 192 Bibliography 197 viii List of Tables 2.1 SRAM leakage current and DRV scaling trend . . . . . . . . . . . . . . 24 5.1 Workloads’ registers requirements . . . . . . . . . . . . . . . . . . . . 117 5.2 Simulation parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.1 Benchmarks runtime information. . . . . . . . . . . . . . . . . . . . . 151 6.2 characteristics of the used SRAM cells built in FinFET 7nm . . . . . . 171 6.3 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 6.4 Size, access energy and leakage power for the baseline and proposed register file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 ix List of Figures 2.1 GPU software model . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 GTX480 SM architecture . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3 Baseline schedulers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4 CMOS leakage current . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5 Power gating overview . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1 Power breakdown for execution units . . . . . . . . . . . . . . . . . . . 29 3.2 Idle period length distribution with 5 cycles idle-detect and 14 cycles break-even time forhotspot . . . . . . . . . . . . . . . . . . . . . . 33 3.3 Effect of warp scheduler on idle cycles . . . . . . . . . . . . . . . . . . 38 3.4 GPGPU workload characteristics . . . . . . . . . . . . . . . . . . . . . 40 3.5 Critical wakeup correlation . . . . . . . . . . . . . . . . . . . . . . . . 49 3.6 Architectural support for GATES, blackout, and adaptive idle detect . . 51 3.7 Increasing power gating opportunity for integer units. Floating point units exhibit similar trends. . . . . . . . . . . . . . . . . . . . . . . . . 59 3.8 Static energy impact of proposed techniques. . . . . . . . . . . . . . . 63 3.9 Performance impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.10 Sensitivity to BET and wakeup delay . . . . . . . . . . . . . . . . . . . 67 4.1 Percentage of bubbles shorter than 10 cycles normalized to the total ex- ecution time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.2 Workloads instruction type breakdown . . . . . . . . . . . . . . . . . . 78 4.3 Origami consists of the Origami scheduler and Warp Folding. . . . . . . 79 4.4 Effect of Warp Folding on SIMT lanes activity . . . . . . . . . . . . . . 81 4.5 Threads activity breakdown . . . . . . . . . . . . . . . . . . . . . . . . 89 4.6 Reordering and lane-shifting effects on power gating opportunities . . . 89 4.7 Warp Folding execution steps and detailed threads activity . . . . . . . 96 4.8 Modified GPU pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.9 Execution units leakage energy savings . . . . . . . . . . . . . . . . . 101 4.10 Execution units power gating overhead . . . . . . . . . . . . . . . . . . 104 4.11 Percentage of time Warp Folding is enabled . . . . . . . . . . . . . . . 106 4.12 Execution time normalized to Warped-Gates technique [15] . . . . . . . 106 x 5.1 GPGPU core pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.2 Registers inter-access cycle count . . . . . . . . . . . . . . . . . . . . 118 5.3 Warp utilization breakdown . . . . . . . . . . . . . . . . . . . . . . . . 120 5.4 SRAM cell leakage current in drowsy mode with different safety mar- gins normalized to Vdd leakage current . . . . . . . . . . . . . . . . . 123 5.5 The proposed register file with the tri-modal control unit(TRIC) and the coordinated mask aware control unit(COMA) integrated. . . . . . . . . 126 5.6 Schematic of the divided wordline . . . . . . . . . . . . . . . . . . . . 131 5.7 Leakage power, dynamic power and total power savings . . . . . . . . . 135 5.8 Performance degradation with drowsy wake-up latency of 2(Drowsy 2 cycle) and 3(Drowsy 3 cycle) cycles. . . . . . . . . . . . . . . . . . . . . . . 138 6.1 Delay of 40-stage FO4 inverter chain vs. Vdd for 7nm FinFET technology146 6.2 Percentage of accesses to the top N highly accessed registers . . . . . . 152 6.3 Baseline and proposed RFs . . . . . . . . . . . . . . . . . . . . . . . . 155 6.4 Kernel execution timeline . . . . . . . . . . . . . . . . . . . . . . . . . 157 6.5 Registers distribution between the FRF and SRF . . . . . . . . . . . . . 160 6.6 The swapping table content before and after the pilot warp finishes exe- cution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 6.7 The structure of the FinFET device model . . . . . . . . . . . . . . . . 170 6.8 Schematic for the modified decoder and the SRAM cell structure . . . . 174 6.9 Proposed register file access distribution . . . . . . . . . . . . . . . . . 180 6.10 Energy savings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 6.11 The execution time of the proposed ideas . . . . . . . . . . . . . . . . 183 6.12 Scalability of the RFC and partitioned register file . . . . . . . . . . . . 186 xi Abstract As Dennard scaling has slowed down, the growth in transistor density has outpaced our ability to reduce supply voltage proportionally. As such, power consumption of a com- puting devices has become a first order design constraint. As a result, processor design has shifted to integrating more parallel execution resources that can run at lower speed. To take advantage of the available hardware parallelism, software development models have also emerged that focus on exposing thread level parallelism (TLP) to the hard- ware. By combining these trends new computer system designs enable the execution of massive number of concurrent threads with limited hardware overhead. Graphical processing units (GPUs) have become the de facto chip designs for exploiting massive TLP at a relatively high performance per Watt, also termed as energy efficiency. Due to their high throughput computing capability coupled with the support for programming languages that expose TLP, such as CUDA and OpenCL, GPUs have dominated the parallel computing space. Given their impressive energy efficiency, GPUs are now being adopted for executing general purpose applications in addition to the multimedia applications. Significant effort is expended to port general purpose applications to run on GPUs. In spite of the best effort, our research, corroborated with other concurrent research studies, shows that there is still widely varying parallelism in general purpose applications. As a result, the large number of execution resources in GPUs also exhibit widely varying usage. The xii varying resource utilization compromises GPU’s energy efficiency. The goal of this dissertation is to improve energy efficiency of GPUs in the presence of dynamically varying resource utilization. In this dissertation we demonstrate that GPU execution model has unique features that can be exploited to tackle varying resource utilization concern. Accordingly, this dissertation presents three techniques to achieve its stated goal. The first part of the dissertation focuses on reducing the leakage power of the execu- tion units in GPUs using a technique called Warped-Gates. Leakage energy is a growing concern for chip designs due to small operating voltages. Our analysis shows that leak- age energy is 50% of the total GPU energy. Furthermore, execution units leakage energy is 10% of the total GPU energy. In traditional CPU designs leakage power of a circuit block is curtailed by simply power gating the block, where the supply voltage to the block is removed. Power gating is beneficial only when a block can be turned off for sufficiently long period of time to surpass the gating overheads. Our analysis shows that applying conventional power gat- ing to GPU execution units is ineffective. The primary reason is that the warp scheduler in a GPU that schedules a collection of threads (also called a warp or wavefront, which is typically 32 threads) is agnostic to the current power gating state of the execution units. As a result execution units frequently move between active and inactive states thereby curtailing power gating opportunities. We propose Warped-Gates to improve the coarse grain power gating efficiency in GPUs. Warped-Gates modifies the scheduler such that the scheduler will give a higher priority for the warps that use the same type of execution resources so as to elongate the active and idle times of each execution unit type. Thus Warped-Gates increases the power gating efficiency of GPU execution units. Furthermore, Warped-Gates also modifies the power gating state machine to eliminate xiii the scenarios where the power gated unit will switch to the ON state before compensat- ing for the power gating overhead. Combining these techniques Warped-Gates enhances power gating capabilities in GPU execution units. While Warped-Gates is efficient in gating an entire block of execution units, it does not exploit fine grain power gating opportunities. For example, it is not able to take advantage of the short pipeline bubbles that exist in the pipeline. Hence, we propose the Origami technique to enable power gating at a fine grain level. Origami relies on warp folding to create power gating opportunities. Warp folding essentially splits a single block of 32 threads into sub-warps each with fewer than 32 threads. When warp folding is enabled multiple sub-warps are scheduled to run only on the lower order execution lanes, thereby leaving higher order execution lanes to be idle for longer periods of time. Origami takes into consideration an applications execution phase as well as resource demands to decide when it is most opportune to enable folding. Origami significantly reduces the leakage power of the GPUs execution units with negligible performance overhead. The second part of the dissertation targets the power efficiency of the register file in GPUs. The register file in GPUs is very large, even larger than the cache, in order to enable fast switching between the concurrent threads. However, our register file access analysis shows two successive accesses to the same register are separated by hundreds of cycles. This large inter-access latency is a result of the large number of available registers and the large number of concurrent active threads. Furthermore, given the large register file some general purpose applications are even unable to use all the available registers. We propose Warped-Register File design where each register in the GPU register file is augmented with a tri-modal switch. The tri-modal switch enables the register to switch between the OFF, Drowsy and ON states based on the register usage mode. In addition, we enhanced the register file design to allow narrow-width register xiv access where only a subset of the 32 thread registers are activated on a single read operation. The set of active registers is determined based on the active state of each thread in the warp. Thus whenever there is branch divergence, memory divergence and insufficient parallelism only a subset of the registers are activated using our proposed design. In the last part of the dissertation we look ahead into the future technologies to explore new register file designs based on the FinFET technology. While the Warped- Register File is effective in reducing the register file dynamic and leakage power, the energy per access for the register file is still high. In order to solve this issue we pro- posed a hierarchical register file design built using FinFETs. The proposed mechanism divides the register file into two partitions, a slow register file(SRF) and a fast register file(FRF). The partitioning of the register file is based on the observation that a small portion of the registers assigned to each thread are accessed majority of the time. To exploit this observation we proposed allocating highly accessed registers in the FRF and the remaining registers in the SRF. In order to identify the slow and fast registers we pro- posed the pilot warp technique. The pilot warp technique is based on the feature that all the threads in the same kernel will be executing the same code. Hence, the pilot warp will collect statistics from the early running warps and use this information to optimize the register allocation of future warps to improve the power efficiency. xv Chapter 1 Introduction 1.1 The quest for power efficient GPGPUs Graphics processing units (GPUs) are massively parallel processors that are designed to run multimedia applications with thousands of concur- rent threads. By using a SIMT (single instruction multiple threads) exe- cution model, GPUs can execute the same instruction but with hundreds of different data operands concurrently. The simplified control logic cou- pled with massive parallelism can achieve hundreds of GFLOPs of peak throughput at low power. Due to their high throughput and excellent per- formance/watt, GPUs are being re-architected to run applications beyond traditional multimedia, such as modeling of physical phenomena and large scale data analytics [23,86,96]. With the introduction of CUDA [32] and openCL [8], the vast number of parallel execution resources on a GPU chip are made easily accessible to the application programmer in a hardware-independent manner. Thus many application developers have been attracted to this execution model resulting in a significant effort that has been expended to port current applications to GPUs. 1 In order to achieve high performance/watt, GPUs rely on fast thread switching to tolerate long latency operations, rather than hiding long la- tency through complex hardware instruction reordering. GPUs use signif- icant amount of parallel hardware resources to support hundreds or even thousands of thread contexts. In particular, GPUs are equipped with a very large register file, large number of execution units and a high mem- ory bandwidth. Large register file is used to hold the context of thou- sands of threads at the same time to enable fast context switching between threads without the need to save and restore architected state from mem- ory. Large number of execution units support concurrent execution of hundreds of threads each cycle. A high bandwidth memory system pro- vides the data necessary to keep the execution units busy. When an appli- cation developer exposes sufficient parallelism, these parallel resources in fact provide immense throughput with high power efficiency. However, operating large number of hardware resources consumes significant dynamic and leakage power. This problem will get even worse in the future: the reduction in supply voltage has slowed in re- cent years thereby limiting dynamic power scaling ability of the transis- tor. The reduction in threshold voltage is leading to a significant increase in the leakage power. As a result, GPU power consumption is receiv- ing significant attention from the industry and the academic community 2 [39, 42, 44, 45, 62, 81, 90, 101]. The large power consumption may be ac- ceptable if all the resources are put to good use to compute many threads in parallel. But achieving 100% resource utilization in GPUs is a signifi- cant challenge, particularly when general purpose applications are ported to run on GPUs. In several recent works [53,62] the authors showed that there is a wide variation in resource utilization when GPUs are stretched to run general purpose computations. This variation leads to resource un- derutilization in GPUs. However, even when there is insufficient paral- lelism current GPU hardware continues to burn significant power thereby compromising performance/watt. This dissertation tackles the problem of sustaining high power effi- ciency even in the presence of irregular parallelism and the resulting resource underutilization. In particular, the dissertation tackles static and dynamic power consumption of the two largest hardware compo- nents within GPUs, register file and execution units. In [62], the au- thors showed component-level power breakdown for the NVidia GTX480 GPU. Their results showed that execution units consume 20.1% of the total platform power, followed by memory (17.8%) and register file (13.4%). In terms of area, register files and execution units occupy 73% of the chip area. Thus we focus on these two components to save GPU power. Ideally power savings techniques should be able to effectively 3 save the GPU power without degrading the performance of the GPUs. As such in this dissertation we will provide insights into the unique features of the GPU execution model and then demonstrate how these unique fea- tures can be exploited to improve power efficiency. The first part of the dissertation focuses on reducing the leakage power of the execution units. Traditionally leakage power is curtailed by gating off the power supply to any block that consumes significant leakage en- ergy. Power gating typically requires long idle periods so as to compen- sate for the power gating overheads. Our analysis shows that traditional power gating techniques are not well suited for GPUs due to the follow- ing reasons: first, the warp scheduler that schedules groups of threads (called warps or wavefronts) does not take into consideration the type of the scheduled instruction. As a result different instruction types, such as integer and floating point instructions, are interspersed in the sched- ule. The fine grain interleaving of instructions also causes idle periods of integer and floating point execution units to be short lived. Second, the scheduler does not take into consideration the power gating state of the execution units before issuing an instruction. For instance, an integer execution unit that has entered a power gated state is woken up whenever an integer instruction is selected for execution. What is interesting to note is that the warp scheduler has access to a pool of warps that have ready 4 instructions. The pool is large enough to enable the scheduler to select the best candidate warp to schedule to improve gating efficiency. Inspired by these observation we proposed the Warped-Gates frame- work that enables coarse grain power gating. Warped-Gates modifies the scheduler so it will give priority to the instruction of the same type to be scheduled in the next cycle. For example, the INT instruction will be scheduled as long as we have integer instructions before switching to the FP instructions and vice versa. Also, to avoid the scenarios where the power gated unit has not been gated long enough to compensate for the power gating overhead, Warped-Gates uses the blackout technique. The blackout technique modifies the power gating state machine to eliminate such transitions. While Warped-Gates is efficient in gating an entire block of execution units, it does not exploit fine grain power gating opportunities. For ex- ample, it is not able to take advantage of the short pipeline bubbles that exist in the pipeline. Hence, we propose the Origami technique to enable power gating at a fine grain level. Origami relies on warp folding to create power gating opportunities. Warp folding essentially splits a single block of 32 threads into sub-warps each with fewer than 32 threads. When warp folding is enabled multiple sub-warps are scheduled to run only on the lower order execution lanes, thereby leaving higher order execution lanes 5 to be idle for longer periods of time. Origami takes into consideration an applications execution phase as well as resource demands to decide when it is most opportune to enable folding. Origami significantly reduces the leakage power of the GPUs execution units with negligible performance overhead. The second part of the dissertation deals with register file power con- sumption. We first motivate our work with a comprehensive analysis of the register file access behavior in GPUs. Our analysis shows that: first, not all the registers in the register file are assigned to the running threads. Second, two successive accesses to the same register are separated by hundreds of cycles. This large inter-access latency is a result of the large number of available registers and the large number of concurrent active threads. Due to the under utilization in the threads activity the registers of inactive threads are unnecessarily read. Finally while each thread is assigned a set of registers our runtime analysis of a wide range of GPU workloads shows that not all the allocated registers are equally accessed. Some registers are accessed more than other registers during the applica- tion runtime. Based on these observation we proposed the Warped Register File technique to reduce register file leakage power. The Warped-Register File is a unified solution to tackle various leakage power inefficiencies. It uses 6 a tri-modal switch attached to each GPU register to switch between ON, OFF, or Drowsy power states based on the demand for the register(TRIC). We then propose COMA (coordinated mask aware) register read design. COMA modifies the design of the register file to enable narrow wdith register access. COMA exploits the fact that not all the registers should be read all the time due to various branch and memory divergence related bottlenecks in GPUs. While Warped-Register File is effective in reducing the leakage and the dynamic energy, the energy per access for the register is still high. Also, the register file access energy is expected to increase as we build GPUs with larger register file [14]. To reduce the register file energy we proposed the Pilot-Register File technique. Pilot-Register File design is built to work most effectively on future technologies based on FinFETs. We propose a hierarchical register file design built using FinFETs. The proposed mechanism divides the register file into two partitions, a slow register file(SRF) and a fast register file(FRF). The partitioning of the reg- ister file is based on the observation that a small portion of the registers assigned to each thread are accessed majority of the time. To exploit this observation we proposed allocating the highly accessed registers in the FRF and the remaining registers in the SRF. In order to identify the slow and fast registers we proposed the pilot warp technique. The pilot warp 7 technique is based on the feature that all the threads in the same kernel will be executing the same code. Hence, the pilot warp will collect statis- tics from the early running warps and use this information to optimize the register allocation of future warps to improve the power efficiency. 1.1.1 Contributions To fulfill our dissertation goals we made the following contributions to improve the power efficiency of the register file and the execution units in GPUs: At the Execution units level the dissertation made the following con- tributions: Enable coarse grain power gating technique called Warped- Gates: Leakage power of the execution units is a concern because of the large number of the execution cores on the GPU chip and the increase in the importance of the leakage power with technol- ogy scaling. Power gating is a common technique that can be used to reduce leakage power but power gating requires long and un- interrupted idle periods to be effective. Our empirical evaluations show that the length of the idle periods is too short in current GPUs since the warp scheduler is agnostic to the execution unit usage 8 patterns. In order to improve the power gating potential we pro- posed Warped-Gates. Warped-Gates is a combination of microar- chitectural enhancements to enable execution unit gating and new scheduling policies that exploit these hardware enhancements. At the scheduler level we proposed a gating aware scheduler named GATES. GATES prioritizes issuing warps that use the same execu- tion resources as previous warps thereby increasing the continuous usage of a single execution resource, which creates longer idle win- dows for the unused execution resources. We then enhanced GATES with a GPU-specific power gating state machine, called Blackout. Blackout forces any gated execution unit to stay in the gated state until the power gating overhead is compensated. Finally, we pro- pose an adaptive idle detect technique that dynamically shrinks and stretches the length of time a unit is idle before the unit is power gated. This approach can save 1.5X of the leakage energy of the execution units with less than 1% performance overhead when com- pared with the conventional power gating techniques. Enable fine grain power gating techniques using Origami: An- other opportunity to power gate an SIMT lane ,that Warped-Gates 9 cannot exploit, occurs due to the short pipeline bubbles within ex- ecution units. The GPU pipeline witnesses such short bubbles be- cause SIMT lanes have different types of execution units, such as integer, floating point and special function units. Each of these ex- ecution units is pipelined. An integer instruction may be issued in one cycle, followed by a floating point instruction. When a floating point instruction is scheduled in a cycle there is no instruction en- tering the integer pipeline. Hence, there is an internal bubble within the integer pipeline. Our evaluations show that such internal bubbles are quite prevalent in GPUs. We propose to exploit this observation by using a new technique called warp folding. Our proposed warp folding mechanism splits the 32 threads in the same warp in to two sub-warps. Each sub-warp will have 16 threads. The two sub-warps will be scheduled back to back to the same execution unit. But each sub-warp will only use half of the lanes while the other half will be idle. When sufficient number of warps are folded one can cre- ate new idleness windows for execution units that previously were non-existent. 10 Current GPU schedulers greedily schedule instructions without con- sidering which SIMT lanes were idle in previous issue cycles. Con- sider a scenario where the scheduler issues two warps in two con- secutive cycles and both warps have lanes 0, 4, 6 and 7, which are idle. In the third cycle the scheduler issues a single warp with all lanes active. The new warp disrupts the amount of idle time expe- rienced by idle lanes. To increase power gating opportunities we propose to evaluate new scheduling algorithms that give priority to warps with similar active masks. We proposed an active mask aware scheduling algorithm that reads active masks of multiple ac- tive warps and groups warps based on active mask similarity. Once warps are grouped based on active masks we propose to explore lane shifting to further extend idleness windows for SIMT lanes. Warp folding combined with the active mask aware scheduler form our fine grain power gating framework known as Origami. At the register file level the dissertation made the following contribu- tions: Improve the power efficiency of the register file using Warped- Register File: The dissertation presents a comprehensive register file access analysis. Our analysis show that the register inter-access time is in the order of hundreds of cycles. We can refer this to 11 the large number of threads and the large number of registers as- signed for each thread. To exploit these observations we proposed the Tri-Modal register file (TRIC). TRIC switches the registers be- tween ON, OFF and drowsy (low power) states. The unused regis- ters will be placed in the OFF state all the time. The used registers will be in the low power drowsy state by default. When a read/write request is triggered to a specific register, the register will switch to the ON state to enable the access request. Once the request is com- plete it will switch back to drowsy state. In order to enable the TRIC technique we augmented each register with a tri-modal switch and we proposed a policy that controls the register state switching. In addition to TRIC, we also proposed an active mask aware gating technique, named COMA, which exploits the underutilization in the threads activity. COMA power gates the registers assigned to the inactive threads within the scheduled warp to save dynamic power. Both techniques combined, named as Warped-Register File, are able to save 69% of the register file power with negligible performance overhead. Improve the power efficiency of the register file using Pilot- Register File:In addition to the underutilization observations ex- ploited by the Warped-Register File technique, our analysis show 12 that the GPU workloads have the following characteristics: first, while the warps are assigned a large number of registers, the warps tend to heavily access a small subset of the registers. Second, since the GPUs are applying the SIMT execution model, then all the warps in the GPUs will be executing the same code. Inspired by theses two observations we proposed the Pilot-Register File technique. The Pilot-Register File partitions the register file into two register files, namely the fast register file (FRF) and the slow register file (SRF). The highly accessed register will be allocated to the FRF and the re- maining registers will be allocated to the SRF. Identifying the highly accessed registers will be done using the pilot-warp. The pilot-warp is the first warps that is running in each kernel. During the runtime of the pilot-warp the statistics about the registers access frequency will be collected. At the end, the registers allocation will be changed based on the collected statistics. Since the SRF will not be accessed frequently we proposed operating it at Near-Threshold. 13 Chapter 2 Background In this chapter we will present a brief overview of the GPU hardware and software execution model. We will then go through some details about the leakage power and how it will become a major issue in future technologies. We will provide a brief background on prior techniques to save leakage power. 2.1 GPU Background In this subsection we cover the software and the hardware execution mod- els in GPUs. Since this work uses Nvidia GPUs as the baseline architec- ture we will use Nvidia terminology to refer to various architectural and microarchitectural terms used in common practice. But the ideas are well suited even for AMD GPUs which may use different terminology but in essence implements similar microarchitectural and architectural features as the baseline architecture. 14 Application Thread Warp CTA GPU SM SM SM SM SM SM Figure 2.1: GPU software model 2.1.1 GPU Programming The introduction of the CUDA [32] and the OpenCL [8] parallel comput- ing platforms enabled the programmer to write parallel applications that can run on GPUs. The written application will have pieces of code that are designed to run on the CPU or GPU. The main function in the CUDA applications is called a kernel. Each application can have multiple ker- nels. When an application running on the CPU hits a kernel function call then the host CPU will invoke the kernel execution on GPU. The function call includes moving the data to the GPU memory, executing the kernel body on the moved data-set, and moving the data back to the CPU mem- ory to continue executing the rest of the code. CUDA and openCL give the programmer the flexibility to determine the number of threads that will be launched on the GPU during the kernel 15 execution. Hence, in addition to the data-set, the kernel function call passes information about the number of the thread blocks or cooperative threads arrays(CTAs) and the number of running threads in each CTAs. Figure 2.1 shows the breakdown of the application. CTA is the minimum amount of work that can be assigned to a streaming multiprocessors(SM) for execution as we will describe in the following subsection. Since each CTA has hundreds of threads, each N threads will be grouped together in a Warp. The warp is the minimum amount of work that can be scheduled in each SM. In Nvidia GPUs that we are using as our baseline architecture the warps consists of 32 threads and Up to 48 warps and 64 warps can run concurrently on each SM in fermi [6] and kepler [14] architectures, respectively. GPUs uses the single instruction multiple threads(SIMT) execution model where all the threads inside the same warp will be issued together and will be executing the same instruction but with different operands values. Warps from the same CTA can be explicitly synchronized using synchronization promitives like syncthreads() that is used in CUDA. On the other hand, the CTAs can be synchronized through the global memory or they cane reliably synchronized with each other at the end of the kernel execution. The details of the SM architecture and warp scheduling are described in the following subsection. 16 C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST LD/ST SFU SFU SFU SFU SP1 SP0 LD/ST SFU INT Unit Operands Result Queue FP Unit Warp Scheduler (2-level) Register File Execution Units 64KB shared Memory/L1 cache SM Instruction Cache Fetch and decode I_Buffer SIMT Lane Figure 2.2: GTX480 SM architecture 2.1.2 Baseline GPU Architecture A GPU consists of a set of streaming multiprocessors (SMs). For in- stance, Nvidia fermi architecture [6] has 15 SMs. The overall architecture for one SM is shown in figure 2.2. Each SM is comprised of a large reg- ister file, thread scheduler(s) and execution units. Each SM has its own 64KB shared memory and L1 cache. Each SM supports up to 48 active warps as in Fermi [6] architecture and up to 64 warps as in kepler [14] architecture . Each warp comprises of 32 threads executing the same in- struction in a lock-step manner, also called Single Instruction Multiple Thread (SIMT) execution model. The execution flow within each SM can be broadly divided into three stages: Fetch and Decode: The instruction fetch and decode logic has an in- struction buffer that stores decoded instructions. The instruction buffer is divided between warps. Each entry in the instruction buffer has a valid bit (V in figure 2.2), the decoded instruction bits (Dec INST ), and a ready 17 bit (R) to indicate that the instruction is ready for execution. The decoded instruction field includes the instruction type that determines which exe- cution unit type that instruction requires for execution, namely an integer unit (INT), floating point unit (FP), special purpose functional unit (SFU), or load/store unit (LD/ST). The ready bit is set at the issue stage using the scoreboard logic and resource availability. A separate scoreboard unit (not shown in the figure) is responsible for solving any RAW and WAW hazards between the threads within the same warp. The scoreboard will reserve the destination registers till the result is ready and written back. Each warp has an entry in the scoreboard. This entry has the list of the reserved registers for each warp which indicate that these registers will be updated by instructions executing from that warp. Before issuing any warp, the scheduler will check the scoreboard for the dependencies. If it turns out that the operands are ready and can be read from the register file, then the warp will be assigned a collector unit and the access requests will be sent to register file. GPUs have a large multi-banked register file that is used to manage the execution context of the warps scheduled on an SM by the scheduler. For instance, in Fermi(Kepler) each SM has a 128KB(256KB) register file. An instruction warp is scheduled for execution only when the input registers are ready. Even if the register inputs are ready, accessing a large 18 multi-banked register file can take multiple cycles before the instruction can be executed. In order to reduce the latency penalty associated with register reads, an instruction warp that is ready to be scheduled will be assigned a collector unit. The collector unit stores the warp id, the in- struction opcode, the source operand register number, the source operand value and a ready bit for each operand. Note that all input operands are already available in the register file when an instruction is scheduled for execution. Hence the ready bit in a collector unit is simply used to inform the scheduler when the register read operation is complete. Thus the pur- pose of the collector unit is to spread out accesses to the register file to avoid bank conflicts between different warp requests. When all operands values are read from the register file, as indicated by the ready bit, the instruction will be issued to the execution stage. Whenever the instruction is issued, the collector unit will be freed and the scheduler can assign a different instruction to that collector unit. To avoid structural hazards GPUs has several collector units. Traditionally register files in GPUs are very wide [73]; a single entry in a register file is 128 bytes wide and contains 32 32-bit operands. Hence one register entry is able to provide the input operand values for all the 32 threads within the same warp. To reduce the access latency of a large reg- ister file, it is divided into multiple single ported banks. While there are 19 multiple possible organizations of a banked register file, the most com- mon approach is to distribute the registers associated with a warp across multiple banks. For instance, two registers R0 and R1 used by the same warp may be placed in different banks. This organization allows multiple registers to be read by each instruction in a warp from across multiple banks. Thus each bank needs only be single ported, which reduces the power and design costs. Each warp has its own set of registers indexed by the warp id. For example, R1 used by the threads in warp 0 is different from R1 used by the threads executing in other warps. It is likely that registers used by different warps can be assigned to the same bank. Hence, the banked reg- ister file organization may lead to some conflicts between requests from different warps when they are mapped to the same bank. The collector unit can handle the bank conflicts by acting as a buffer for register reads. Warp Scheduler: Each SM has it own warps scheduler. The warps scheduler will fetch a warp from the instruction buffer and issue it to the execution units. In order to improve performance in GPUs, more than one scheduler can be integrated within an SM. For example, In Fermi(kepler) architecture two(four) schedulers are integrated in each SM and each scheduler can issue one(two) ready instructions per cycle as long as there are no structural hazards. 20 The warp scheduler can use different scheduling technique that can help improving the performance of the GPU. In our evaluation for the pro- posed techniques in this dissertation we used different schedulers, namely the two-level warp scheduler [42], the fetch group scheduler [73] or the GTO scheduler. The two level scheduler is shown in figure 2.3a. In the two-level scheduler, all warps waiting on long latency events, such as memory accesses, are placed into a pending warps set. The active warps set holds all the warps that are either waiting on a short latency depen- dency or whose input operands are already available in the register file. The two-level scheduler only issues an instruction from the active warps set when all the input operands are ready for that instruction. The fetch-group scheduler divides warps into fetch groups as shown in Figure 2.3b. It schedules the warps from each fetch group in a round- robin fashion. When the warps in the highest priority fetch group stall, the priority will be switched to the next fetch group in the priority line. It is able to achieve high performance compared to the two-level warp scheduler because of its ability to reduce the contention on the memory and shared resources by restricting warp selection to one fetch group at a time. 21 P P P P P P P A A A A Select Pending Warps Active Warps Execu,on Units P . . . . . . (a) Two-level scheduler Warps Fetch Group 1 Fetch Group 2 Fetch Group 3 Fetch Group 4 (b) Fetch Group scheduler Figure 2.3: Baseline schedulers GTO warp scheduler is a special case of the fetch group warp sched- uler where the fetch group size is 1. In the GTO warp scheduler instruc- tions will be issued from the highest priority warps as long as it is stall free. When the warp scheduler cannot issue from the warp due to struc- tural or data hazards then the scheduler will start issuing instructions from the next warp in the queue. Execution Units: Figure 2.2 shows the block diagram of the execution units inside each SM. The execution units contain shader processors (SP), LD/ST units for memory operations and special function units(SFUs) for the special arithmetic functions like sine and cosine. Each SP contains 32 SIMT lanes(called CUDA cores) running As a result, each SP can run 32 concurrent threads over one issue cycle. Each CUDA core contains one integer unit and one floating point unit. While we evaluated most of the proposed ideas on fermi-like architec- ture, we also did design space exploration evaluations using Kepler-like 22 architecture [14]. In Kepler 192 CUDA cores are integrated in each SM and the register file size is 256KB. In order to utilize the resources inside each SM in Kepler, it is augmented with 4 schedulers and each scheduler can issue up to two instructions each cycle. The growing size of the execution units and register file leads to sig- nificant amount of leakage power consumption in GPUs. Such trend re- quires energy efficient techniques that can be scalable and effective in the long term. 2.2 Leakage Power Leakage power in the deep scaling is becoming a major concern as it con- tributes significantly to the total CMOS circuit power. It is expected that leakage power will dominate as technology scales because of the reduc- tion in the threshold voltage, oxide thickness and channel length. Drain- induced barrier lowering, gate-induced drain leakage, and gate oxide tun- neling are some sources of leakage in current CMOS technologies [84]. For example, DIBL effect occurs because of the short distance between the source and the drain and the high voltage of the drain. Under such conditions, part of the channel will be depleted and the threshold voltage will be lower. So, any small voltage on the gate will be enough for the carriers to create a conduction path between the source and the drain. 23 Figure 2.4: CMOS leakage current In order to quantify the leakage current effect as technology scales, we performed a circuit level simulation on a 6T SRAM cells built us- ing 90, 65,and 32 nm technologies. We used the technology files from the predictive models [1]. The leakage current for each SRAM cell is shown in Table 2.1. The leakage current is measured by measuring the total current drawn from the Vdd when the SRAM cell is in the standby mode(BL,BLB =vdd , WL=0, Data=0, DataB=vdd). The second column shows the supply voltage, the third column quantifies the leakage current of a single SRAM cell and the fourth column quantifies the leakage power of an 8kB register file bank calculated using the technique proposed by [64] . As shown, the Leakage current nearly doubled as the devices are scaled from 90nm to 32nm. Technology VDD Leakage Bank Leakage DRV (V) (nA) (mW) (mV) 90nm 1.2 14.6 2.9 120 65nm 1.1 22.8 3.5 145 32nm .9 26.08 5.6 220 Table 2.1: SRAM leakage current and DRV scaling trend 24 t0 t1 t2 t3 t4 time Energy Eoverhead !"#$%&'()$*& +,%+-,)&./"+0& VDD GND Sleep 1*/$2*$)$+)& 34+"56$47()$*& 8(0$2-6& 9"56$47()$*& 9:+/$&;& 9:+/$& ;<=>?& Ready_intruction_scheduled Cycles>wakeup_delay Cycles>idle_detect Cycles>BET time Busy (a) Gating circuit t0 t1 t2 t3 t4 Eoverhead Overhead to sleep and Wakeup Overhead to sleep Time Static Energy Cumulative energy savings (b) Break-even time Idle_detect Uncompensated Wakeup Compensated Cycle 1 Cycle 1+BET Ready_instruction_scheduled Cycles>wakeup_delay Cycles>idle_detect Cycles>BET time Busy (c) Power gating state machine Figure 2.5: Power gating overview In order to tackle the leakage power issues several techniques have been proposee [38, 51]. Power gating is one of the efficient techniques that can be used to reduce the leakage power of the execution units. In the next section we will go through the basics of the power gating technique. 2.3 Power Gating Power gating is a technique that is used to cut off the leakage current that flows through a circuit block. Power gating is implemented by adding a properly sized header transistor (between Vdd and the circuit block) or footer transistor (between the circuit block and Gnd) as shown in fig- ure 2.5a. When the transistor is OFF, the circuit block will be power gated and there will be no path from Vdd to Gnd, resulting in a very small leak- age current. When the power gating transistor is ON, then the circuit block will operate normally. Figure 2.5b illustrates the cumulative energy savings and energy over- heads when applying conventional power gating as described in [51]. The 25 solid green curve represents the cumulative energy savings from reducing leakage in the circuit. At timet 0 the power gating signal is enabled and the switch will turn off att 1 . The leakage energy savings begin increasing at timet 1 and will continue to accumulate with time as seen by the raising solid curve. The leakage savings stop at timet 4 when the sleep transistor is turned back ON to bring the circuit back to active state and the circuit block wakes up. However, there is a dynamic energy penalty for switching the sleep transistor on and off. The red dashed curve represents the cumulative en- ergy overhead due to power gating. The black dotted line labeled Eover- head shows the total energy overhead for each power gating instance. Where the dotted black line intersects with the energy savings curve ( at t 2 ) is called the break-even time (BET), which is the minimum number of consecutive power gated cycles that are required to compensate for the energy overhead of the power gating switch [51]. If the block is turned ON beforet 2 , then the power gating overhead exceeds the leakage energy savings resulting in net negative energy savings. The time betweent 3 andt 4 is the wakeup delay, which is the minimum number of cycles required to return the operating voltage range to Vdd. At t 4 , the functional unit is fully powered up and operational. Recent studies on power gating of execution blocks has estimated the wakeup 26 delay to be around 3 cycles and break-even time to be between 9 and 24 cycles [51]. Since the circuit block cannot wakeup instantly when a request arrives, the wakeup penalty can lead to performance degradation. In our experimental evaluation we use wakeup delay of 3 cycles. In [51] break-even values of 9,14,19 and 24 were explored. In our experimental evaluation we use a value of 14 cycles as the break-even time. Power Gating State Machine: Figure 2.5c shows the state machine for the power gating controller. As long as the block is busy, the block will stay in theIdle detect state. As soon as the block is idle for at least idle- detect cycles, the block will be moved into theUncompensated state and the circuit block will be power gated. TheUncompensated state means that the power gating overhead has not been compensated and the energy overhead of operating the sleep switch exceeds the total leakage savings. The block will stay in theUncompensated state as long as the number of cycles is less than the break-even time. After the break-even time, the block is moved to the Compensated state. In conventional power gating, the controller will move the block to the Wakeup state at any time if the block is needed for execution. In theWakeup state, the block needs wakeup delay cycles before switching back to theIdle detect state again. If the wakeup happens when the block is in uncompensated state, then the total energy saved due to the power gating attempt will be negative. 27 In order for power gating to be effective, it is not sufficient to just have idle periods, rather it is critical to have long enough idle periods so that power gating can achieve net energy savings. The idle periods should be at least longer than the break-even time to translate the power gating into positive savings. Also the power gating technique should have a minimal performance overhead. The performance overhead is related to the wake- up latency that is paid every time the power gating is enabled. In order for the power gating to have low performance overhead the wake-up latency should be small in the future technologies and does not cause any extra overhead that may affect the power savings. 28 Chapter 3 Warped-Gates: Gating Aware Scheduling and Power Gating for GPGPUs 3.1 Introduction Graphics processing units (GPUs) are massively parallel processors that are designed to run workloads with thousands of concurrent threads. By using a SIMT (single instruction multiple threads) execution model, GPUs can execute the same instruction but with hundreds of different data operands concurrently. The simplified control logic coupled with mas- sive parallelism can achieve hundreds of GFLOPs of peak throughput at low power. Due to their high throughput and excellent performance/watt, 0% 20% 40% 60% 80% 100% Int Fp Int Fp Baseline Conven7onal Power Ga7ng Normalized Energy Dynamic Overhead Sta7c Figure 3.1: Power breakdown for execution units 29 GPUs are being re-architected to run applications beyond traditional mul- timedia, such as modeling of physical phenomena and large scale data analytics. When GPU designs are stretched to become general purpose GPUs (GPGPUs), they suffer reduced performance/watt due to several reasons. In [42, 53] the authors showed that there is a wide variation in resource utilization when GPUs run applications with diverse parallelism demands. This variation leads to resource underutilization in GPUs which reduces power efficiency. As mentioned before the execution units consume 20% of the to- tal platform power, followed by memory (17%) and register file (13%). While prior research has reduced static power consumption in memory and register files of GPUs [42, 101, 104], but to the best of our knowl- edge, techniques for reducing static power of execution units within a GPU have not been explored. Using the GPUWattch tool [62] we measured the static and and dy- namic power for integer and floating point units for NVidia GTX480 GPU while running a range of GPU workloads (experimental details provided shortly). In Figure 3.1 the first two bars show the distribution of static and dynamic power. These results show that static energy accounts for about 50% of the total energy consumed in integer execution units, and 30 more than 90% in floating point units. As technology scaling continues, handling component leakage power will become increasingly important. In this research, we show that due to inherent application-level in- efficiencies GPU execution units experience idle time. We propose to power gate idle execution units to reduce the static power. Power gat- ing allows unused execution units to be turned off (or gated) when not in use thereby mitigating static power consumption. Power gating has been widely used in microprocessors [51], caches [38], and NOCs [31]. Power gating has also been used in GPUs, but at a coarser granularity of gat- ing whole streaming multiprocessor (SM) [90]. However, when applying power gating at a finer granularity of execution units in a GPU several new challenges arise. The focus of this chapter is to first describe these challenges and propose solutions to address them. The following are the contributions of the work presented in this chap- ter: Limitations of conventional power gating for GPGPU execution units: We found that GPU execution units tend to be idle only for very short periods (majority of time less than 10 cycles), which conventional power gating techniques [51] are unable to exploit. We show that the two-level warp scheduler [42] used in current GPUs greedily schedules 31 instructions to execution units that results in short switching cycles be- tween different types of execution units, such as floating point, integer, special function units, and load/store units. Hence, no single execution unit stays idle for sufficiently long period, called the break-even time, to amortize the cost of power gating overhead. Gating-aware two-level warp scheduler: To address the inefficien- cies of GPU scheduler in extracting idle periods, we present a gating- aware two-level scheduler (GATES). GATES prioritizes issuing clusters of instructions that require the same type of execution units for longer in- tervals before switching to a new instruction type. Thus GATES stretches the length of idle periods for each execution unit type. GATES can be built through low overhead extensions to current two-level scheduler. Blackout power gating: While GATES extends the idle periods there are still many idle windows that are shorter than the break-even time. To address this concern we propose a new power gating controller called blackout. Blackout places new limitations on power gating state transi- tions. In particular, blackout forces an execution unit to be gated for at least as many cycles as it takes to recoup the power gating overhead. This policy is applied even when there are instructions that are waiting to use the execution unit during the gated time interval. We show that forcing an execution unit to stay gated even when there is a ready instruction 32 0% 10% 20% 30% 40% 50% 0 5 10 15 20 25 Frequency Idle period length 83.4% 10.1% 6.5% (a) Conventional power gating 0% 10% 20% 30% 0 5 10 15 20 25 Frequency Idle period length 59.0% 22.1% 18.9% (b) Gating aware scheduler 0% 10% 20% 30% 0 5 10 15 20 25 Frequency Idle period length 54.3% 0.0% 45.7% (c) Blackout power gating Figure 3.2: Idle period length distribution with 5 cycles idle-detect and 14 cycles break- even time forhotspot does not hurt performance in GPUs primarily because of the abundant heterogeneity of available instruction types. Adaptive idle detect: Finally, we present a runtime approach to limit performance loss of blackout for certain workloads by adaptively adjust- ing the amount of time a unit is idle before the unit is gated. We call this approach adaptive idle detect. Adaptive idle detect relies on easy to ob- tain runtime performance metrics to determine the amount of time a unit must be idle before gating is enabled. We combine GATES and blackout to create a coordinated power gating scheme, called Warped-Gates, with virtually no performance loss. 3.2 Power Gating Challenges in GPUs In this chapter we will focus on leakage energy savings for CUDA cores, comprising of INT and FP units. The techniques presented can also be ap- plied to SFUs. However, SFU instructions are relatively rare and hence, conventional power gating scheme will be sufficient to recover most of 33 the wasted leakage energy in SFUs. Furthermore, SFUs account for only 2.5% of total execution unit static power consumption. The relatively large number of INT and FP units, make them the primary target for leak- age energy savings, compared to SFUs. As described earlier, Figure 3.1 shows the average energy breakdown for the INT and FP units. The first two bars show the baseline energy consumption breakdown when no power gating is applied. Static energy accounts for nearly 50% of the total INT energy and 90% of the FP en- ergy. The large proportion of static energy in the FP unit is due to the relatively low usage of the FP units compared to the INT units. Thus, there is a large potential for reducing static energy with power gating of these two units. The last two bars show the energy breakdown after applying conven- tional power gating [51] with an idle-detect time of five cycles and a break-even time of 14 cycles. Conventional power gating reduces INT and FP unit energy by 11% and 29%, respectively. However, even af- ter applying conventional power gating the static power consumption still accounts for 31% of total INT unit energy consumption and 61% of the total FP unit energy consumption. It is important to bring attention to the power gating overhead component in the last two bars. This component of energy consumption comes from the extra power burned to turn on and 34 off the sleep transistor. Power gating overhead accounts for 9% and 20% of the INT and FP units overall energy, respectively. We will highlight the reasons for high power gating overhead and missed power gating op- portunity in this section and present solutions to alleviate these issues in the next section. 3.2.1 Need for Longer Idle Periods In order for power gating to be effective, it is not sufficient to just have idle periods, rather it is critical to have long enough idle periods so that power gating can achieve net energy savings. In traditional microproces- sor functional units, the majority of idle periods lengths are many 10s of cycles [36]. Since the idle duration is longer than the typical break-even time, conventional power gating is an excellent option to reduce static power in traditional microprocessors. GPUs typically have many ready warps with a diverse instruction mix ready for execution. The two-level warp scheduler schedules ready warps from the active warps set without taking into consideration what other instruction types have been issued prior to the current issue cycle. As a result, different instruction types get issued within a short scheduling window. Interspersing different instruction types results in idle period lengths in the order of a few cycles for any given execution unit type in 35 GPUs. Figure 3.2a shows the idle period length distribution in cycles of a representative GPU benchmark,hotspot [68]. The data in the figure is partitioned into three regions. The left-most region (in blue) represents the idle period lengths which falls within idle-detect time. The middle region (in red) represents the idle period lengths which falls within idle- detect and idle-detect + BET. The right most region (in green) represents idle period lengths which are longer than idle-detect + BET cycles. For this specific benchmark, 83.4% of the idle periods are less than the idle- detect, and only 6.5% of the idle periods are longer than idle-detect + BET cycles. In conventional power gating, only those idle windows that are in the last category lead to positive energy savings. The first category represents wasted idle periods that cannot be power gated due to their short duration. The middle range represents the set of idle periods that will result in net energy loss (or at best energy neutral) if conventional power gating is used. While the results presented in this figure correspond to the hotspot benchmark, similar patterns can be found in all other benchmarks in our experiments. What is important to note here is that unlike conventional microprocessor functional units, the majority of idle periods are only a few cycles long. Figure 3.3 illustrates the shortcomings of current warp scheduling and its implication on power gating techniques. In this simplified illustration, 36 the active warps set contains 10 warps with a mix of integer and floating point instructions. The order of instructions in the set is shown in the figure at the top. We assume each instruction is a simple add instruction, each instruction has a latency of four cycles, and initiation interval is one cycle. These are the default parameters in GPGPUsim’s configuration file for Fermi [21]. A two-level scheduler would issue warps from the front of the active warps set without regard to instruction type as shown in the center figure. For instance, in cycle three a FP instruction is available at the top of the active warps set which is issued to the FP unit. As a result, the FP unit, which was idle during the first two cycles, starts executing an instruction in cycle three. Similarly, the floating point unit is assigned another instruction to execute in cycle five. After the first two floating point instructions are completed, the FP unit has one idle cycle followed by the execution of two additional instructions. As a result, during the entire 15 cycle window the FP pipeline experienced three idle periods of two, one, and one cycle(s), which are too short for conventional power gating to take advantage of. 3.3 Gating-Aware Two-Level Scheduler (GATES) The data presented in the previous section points to the need for a tech- nique that can coalesce short idle periods (the first two idle period ranges 37 !"#$ !"#% &'$ !"#( &'% !"#) !"#* !"#+ !"#, &'( &') !"#- ./0123 $ % ( ) * + , - 4 $5 $$ $% $( $) $* 62/ 789:;!<12 789:;=0:9>2 !8;'9?21982 ./0123 $ % ( ) * + , - 4 $5 $$ $% $( $) $* =0:9>2;@AB?3;!83:BC !"#+ !"#, !"#- &'$ &';'9?2 #DEF12>21;G;HEI8<;BEJ98 KA:98L;=DAB2;M0N2<I12B &'( &') &';'9?2 !"#;'9?2 !"#;'9?2 !"#$ !"#% !"#( !"#) &'( &') !"#$ !"#% !"#* !"#- &'$ &'% &'% !"#( !"#) !"#* !"#+ !"#, Figure 3.3: Effect of warp scheduler on idle cycles in Figure 3.2a) into fewer but longer idle periods. Such a technique will shift the distribution of idle periods into the right most range in the figure where power gating is beneficial. To accomplish this goal we propose a gating-aware two-level scheduler (GATES) which takes into account pre- viously issued instruction types in determining which ready warp to issue next. GATES prioritizes issuing the same instruction type as was issued in prior issue cycle to coalesce the utilization and idle periods of integer and floating point units. GATES will keep issuing instructions from the same type as long as there are ready warps in the active warps set. GATES switches to a warp with different instruction type when there are no more ready warps in the active warps set with the same instruction type as the one issued in the previous issue cycle. Note that GATES does not lead to starvation as long as there is some dependence between INT and FP instructions. Eventually all independent instructions of a given type will 38 be exhausted leaving room for the other instruction type to start issuing. The designer can also set a large maximum switching time threshold to force a scheduler to switch priorities at the end of the threshold. When GATES is applied to our previous illustrative example in fig- ure 3.3, all the INT instructions would be issued first, and when there are no additional INT instructions, the FP instructions would be issued. As shown in the bottom figure the INT pipeline now has four consecu- tive idle cycles, while the FP pipeline has eight consecutive idle cycles. By coalescing instructions by type, we remove isolated bubbles from the execution unit pipeline and create longer idle periods to increase oppor- tunities for power gating. GATES would be effective if there exists sufficient number of active warps with a good instruction mix of integer and floating point instruc- tions to allow the scheduler ample opportunities to rearrange warps. Fig- ure 3.4a shows the instruction mix for a large number of GPU workloads. Except for a couple of pure integer workloads (such as lavaMD), most benchmarks have a sufficient mix of integer and floating point instruction types. Figure 3.4b shows the maximum and average number of active warps available during runtime. Majority of benchmarks have a large number of active warps during runtime allowing ample opportunities for 39 0% 20% 40% 60% 80% 100% lavaMD nw MUM heartwall bfs kmeans btree NN hotspot gaussian srad backprop WP LIB mri-‐q lbm sgemm cutcp Instruction Type Percentage Benchmarks LDST SFU INT FP (a) Instruction mix 0 10 20 30 srad lbm backprop mri MUM bfs hotspot lavaMD sgemm cutcp btree heartwall kmeans WP LIB NN gaussian nw AcEve Warps Benchmarks Max Average (b) Runtime active warps set size Figure 3.4: GPGPU workload characteristics rearranging ready warps. Only five out of 18 benchmarks have fewer than ten active warps on average. 3.3.1 GATES Implementation Issues In this section we describe the microarchitectural support needed for im- plementing GATES. We extend the default two-level scheduler with two enhancements: (1) per instruction type active warps subset, and (2) a dy- namic priority-based instruction issue scheme. Per instruction type active warps subset: Since GATES priori- tizes issuing instructions of a specific type, we propose to logically split 40 the active warp set into four active warp subsets, namely integer (INT), floating point (FP), special function unit (SFU) and load/store (LDST) subsets. Each subset is associated with the corresponding execution re- source. This partitioning of the active warp set can be done logically, rather than physically separating the set, by adding two bits per entry for the active warps set. The two bits in each entry specify the execution unit needed for executing the corresponding warp instruction. Since instruc- tions entering the active warps set are already decoded, the decoder can simply set the two-bit execution unit type as part of the decoded instruc- tion information. Instruction issue priority: The instruction issue arbiter inside the warp issue logic is modified with a simple priority-based issuing algo- rithm which assigns each instruction type an issue priority. We ordered the instructions in our implementation as: INT/FP, LDST, SFU, FP/ INT. The ordering implies that either INT or FP is given the highest priority first. If INT is given the highest priority, then FP will be given the lowest priority and vice-versa. This ordering always separates integer and float- ing point instructions to the two ends of the priority. The ordering priority between LDST and SFU is not relevant to this work, but we gave LDST a higher priority over SFU assuming memory operations have longer mem- ory access latency. Fermi’s instruction scheduler is capable of issuing two 41 instructions per cycle. The only time an integer and floating point instruc- tions are issued in the same cycle is when there is just one of either INT or FP instruction in highest priority, no LDST or SFU instructions, and one (or more) of instructions of INT/FP that are not in the highest prior- ity. By pushing INT and FP instructions to the two ends of scheduling priority, units with lowest priority will enjoy longer idle periods. Further- more, warps that need lowest priority unit will accumulate until a priority switch at which time there will be many ready warps that need the same execution unit type. Dynamic priority switching: Instead of using static instruction pri- ority, the priority ordering is dynamically switched during workload exe- cution. We initialize INT as the highest priority and FP as the lowest pri- ority. During execution if the INT active warp subset is empty while the FP active warp subset is not empty, then the priority is switched between INT and FP. Similarly, if FP is the highest priority and the scheduler sees that the FP active warp subset is empty and the INT active warp subset is not empty, then INT is given the highest priority and FP is given the lowest priority. 42 GATES creates new idle periods and also lengthens existing idle pe- riods by coalescing the bubbles in the functional unit pipelines. Fig- ure 3.2b shows the effect of using GATES on idle period length dis- tribution. With GATES, 59.0% of idle periods are wasted due to idle- detect window (down from 83.4% compared to the basic two-level sched- uler). A larger portion of idle periods were moved into the power gating safe region, 18.9% (up from 6.5%) of idle periods are now longer than idledetect +BET . While GATES was successful in creating posi- tive power gating opportunities, unfortunately some of the increased idle periods moved into the negative energy savings region (center region), moving up to 22.1% (from 10.1%). Recall that in this region the func- tional unit is power gated after idle-detect window but is woken up before the break-even time has passed. One potential solution to this issue is to naively increase the idle-detect window, thus lowering the possibility of the net energy loss scenario, but this would also result in more wasted idle periods. Clearly, there is a need to better address the negative energy savings region, which is the focus of next section. 3.4 Blackout Power Gating In this section we propose a modified power gating scheme called black- out. When a unit is power gated, it is placed into a blackout state, where 43 the unit cannot be woken up until it has been power gated for at least the break-even time, even if there are ready instructions. Blackout com- pletely eliminates the net energy loss occurrences. Figure 3.2c shows the combined effect of GATES with Blackout power gating on the idle cy- cle distribution. Blackout power gating essentially pushes all idle cycles within the middle region into the rightmost region by forcing idleness of execution units. In the case illustrated in the figure, 45.7% of idle cycles now result in net energy savings, a 7x increase compared to conventional power gating. By forcing execution units to be power gated for break-even time, even when there are ready instructions, conventional wisdom tells us that it will most likely lead to performance penalties. But due to the unique exe- cution environment of GPUs, blackout does not suffer from performance penalties as feared. The primary reason is that GPUs have a variety of het- erogeneous execution resources (INT, LDST, FP, and SFUs) coupled with a good mix of available instructions that are ready to be issued. When one execution resource type is forced idle, work can still be completed by the other execution resource types. As a result, the performance penalty due to forcing idleness on execution resources is hidden, leading to minimal performance impact. 44 Furthermore, the trends in GPU design shows that even a single execu- tion resource type is going to be split into multiple clusters. For instance, in Fermi architecture there are two clusters of INT and FP units organized into two SPs as shown in Figure 2.2. The more recent Kepler architec- ture uses six clusters of INT and FP organized as six SPs [14]. Similarly, AMD’s GCN architecture currently has four clusters of SP-like SIMD pipelines in each SM-like core [13]. Considering these developments, we will propose an enhanced blackout mechanism that specifically takes advantage of the clustered GPU architectures to further reduce the perfor- mance losses due to blackout. In this section we explore two policies, naive blackout and coordinated blackout. Both these policies are implemented on top of GATES, which was discussed in the previous section. Naive blackout: In this policy once the unit is idle for at least idle- detect number of cycles, the unit is placed in blackout mode. The wakeup mechanism differs from conventional gating scheme. Compared to the conventional power gating state machine shown in figure 2.5c, naive blackout will not have a state transition from the uncompensated state to the wakeup state. The only transition to the wakeup state takes place from the compensated state. Once a unit enters the blackout state, the scheduler will simply avoid issuing instructions to the execution unit until after the 45 break-even time is over. Once the break-even time is over, the scheduler is allowed to issue a ready instruction, if any, to the gated unit and trigger a wake up for the gated unit. Coordinated blackout: As mentioned earlier, clustered integer and floating point units are now common in GPUs. Coordinated blackout takes advantage of the clustered architecture. In the description below we assume the baseline architecture has 32 integer and floating point units that are clustered into two groups of 16 units each. When both clusters of a given type (integer or floating point) are in an active state, coordi- nated blackout simply uses idle-detect window to detect idle cycles. If the idle cycles of a given cluster exceed the idle-detect window, then that cluster is placed in power gating mode and enters blackout state. Once a single cluster enters blackout mode, the second cluster will not use idle- detect cycles any longer in determining its power gating state. The second cluster will instead check the number of active warps waiting in the ac- tive warps subset associated with that execution resource. If no warp is waiting in the active warp subset, then the second cluster enters blackout state immediately, even if its idle cycle window is less than the idle-detect window. On the other hand, if a single warp is waiting in the active warp subset then the second cluster will not power gate even if the idle period length exceeds the idle-detect window. 46 The coordinated blackout mechanism makes the assumption that if there is a warp waiting in the active warp subset, then it is likely to become ready relatively soon. Since one of the clusters of that execu- tion resource type has already entered the blackout state, it is best to avoid putting the second cluster in blackout state to avoid the perfor- mance penalty associated with waking up the unit. Hence, this approach improves performance by avoiding excessive wakeups associated with power gating, and saves power by avoiding the power gating overhead. Therefore, at least one of the two clusters will be always on whenever there is a warp in the associated active warp subset. With coordinated blackout, GATES instruction priority switching policy is also extended to switch instruction priority type if both execution units of the highest priority type are in blackout. 3.4.1 Reducing Worst Case Blackout Impact with Adaptive Idle De- tect Till now, all of our proposed approaches use statically fixed idle-detect window size. Hence, once an idle-detect window is selected it is not changed at runtime. Further improvements to blackout can be achieved by allowing the idle-detect window to be dynamically changed based on perceived performance loss at runtime due to blackout. Hence, blackout 47 is augmented with an adaptive idle detect mechanism that dynamically adjusts the idle-detect window to match an application’s runtime behav- ior. This mechanism will rely on simple metrics to infer performance loss. Inferring performance loss: It is not possible to precisely track the performance loss due to blackout. However, it is possible to use sec- ondary metrics as a proxy for inferring potential performance loss due to blackout. While we explored a number of metrics, for simplicity, we describe one simple metric that we used in our evaluation. We use a met- ric called critical wakeups to measure performance loss. Critical wakeup is defined as a wakeup that occurs the moment the blackout period ends. This metric implies that there was at least one instruction that was blocked in the active warps subset waiting for its corresponding unit to finish its break-even time before wakeup. The reason why this metric is only a proxy for performance loss is that we do not know how long the instruc- tion has been waiting (it could be have just entered the ready state, or a few cycles ago in the middle of the blackout period). Furthermore, not every blocked instruction leads to a performance loss because instruction execution delay does not always fall in the critical path latency. 48 0.95 1 1.05 1.1 1.15 0 5 10 15 20 25 30 Normalized Run6me Cri6cal Wakeups per 1000 cycles heartwall (0.99) NN (0.99) backprop (0.99) hotspot (0.99) nw (0.99) btree (0.99) gaussian (0.99) bfs (0.98) srad (0.97) lbm (0.96) cutcp (0.90) LIB (0.60) kmeans (-‐0.30) MUM (-‐0.28) lavaMD (-‐0.24) mri (0.21) WP (0.24) sgemm (0.06) Figure 3.5: Critical wakeup correlation Figure 3.5 shows the correlation of critical wakeups per 1000 cycles and performance loss for each benchmark across a range of static idle- detect values (0-10). The Pearson correlation coefficient (r) is displayed next to each benchmarks name. As can be seen, 11 benchmarks have strong correlation (r > 0:9) between critical wakeups and performance loss, showing great confidence that by regulating critical wakeups, we can limit performance loss. Some benchmarks (kmeans,MUM,lavaMD, mri,WP, andsgemm) have relatively low correlation. The reason for this low correlation is that these benchmarks do not suffer from any perfor- mance loss due to blackout to begin with and hence changing idle-detect window is neither beneficial nor harmful. 49 Algorithm: Adaptive idle detect breaks execution time into epochs (in our case, 1000 cycles). During each epoch, a counter keeps track of the number of critical wakeups that occur. At the end of the epoch, if the number of critical wakeups is greater than a defined threshold, then the idle-detect time will be incremented by one. We explored various thresh- old values and empirically determined that a value of five gives the best balance between performance and energy savings. During a 1000 cycle window, if there are more than five critical wakeups for a given unit type, then its idle-detect time will be incremented by one. By increasing the idle-detect, we will power gate more conservatively, therefore, decreas- ing the number of critical wakeups. Idle-detect is decremented conservatively every four epochs only if we do not go over the target critical wakeup. Our approach quickly increases idle-detect (react quickly to performance critical phases) and slowly de- creases idle-detect. To prevent run away idle-detect values we bound the value to be between 5-10 cycles. We also explored unbounded idle-detect values and found that bounded idle-detect yields better tradeoff between performance and energy savings. Since each application has its own instruction mix and scheduling or- der, in adaptive idle detect there will be different idle-detect values for INT and FP and each of these values may change over time. As a result, 50 V Dec_INST INT R V Dec_INST FP R V Dec_INST SFU R V Dec_INST LD R Issue Logic I_Buffer Scoreboard Priority Logic PG_Logic BET_counter Count_info PG_status Highest priority PG_Signals AdapBve idle detect Critical wakeup trigger Conv_PG Gating Aware Scheduler Coordinated Blackout Adaptive Idle_detect ExecuBon Units Scheduler CriBcal Wakeup Counter Ready, Type Ready, Type Ready, Type Ready, Type Arbiter FP_ACTV Idle detect logic Idle_detect value INT_ACTV FP_RDY INT_RDY SFU_RDY LDST_RDY Issued Instruction Figure 3.6: Architectural support for GATES, blackout, and adaptive idle detect each unit will automatically reach the appropriate idle-detect value that gives the best combination of power savings and performance. 3.5 Architectural Support Figure 3.6 shows the required architectural support for GATES, Coor- dinated Blackout and Adaptive idle detect mechanisms. The additional hardware support needed is shaded in different colors in the figure for each of the three proposed enhancements. GATES: The base machine architecture is shown in Figure 2.2. Each entry in the active warps set has a ready bit which is set whenever all the input operands are ready for that warp. The ready bit is used by the baseline two-level scheduler to issue an instruction to the execution 51 units. GATES requires each active warp entry to have two additional bits indicating the instruction type of the decoded instruction. The two-bit instruction type is set by the instruction decoder. Existing scheduling mechanisms already need the resource requirement information of each instruction for resource reservation purposes. Hence this two-bit instruc- tion type may already be present in decoded instruction bits, in which case, GATES can make use of the existing information. GATES modifies the instruction issue arbiter to determine which in- struction type has the highest priority for scheduling. In order to dy- namically determine instruction priority, the active warp set is enhanced with two additional counters: INT ACTV and FP ACTV . These coun- ters are incremented every time the corresponding instruction type enters the active warp subset and decremented whenever an instruction leaves the active warp subset. The instruction priority logic will make use of these counters to dynamically determine the higher instruction priority to pass to the instruction arbiter. For instance, if INT ACTV is zero and FP ACTV is non-zero then the scheduler switches the highest priority to FP, and vice-versa. The current scheduling priority is stored as a two-bit value indicating the highest priority instruction type (either an INT or FP). Note LDST and SFU have fixed priority and hence when the highest priority instruction 52 type between INT or FP is known then the other instruction type will become the lowest priority. Hence the instruction arbiter uses this two- bit value to determine the total priority ordering. Based on the current priority, GATES identifies N instructions to fit the issue width of N. In the Fermi architecture the value of N is two. To quickly find the N ready instruction we added four counters to the issue logic. These counters count the number of ready instructions of each instruction type that are present in each active warp set. As men- tioned earlier, when an instruction enters the active warp set, it is not necessarily ready for execution. It may be waiting for short latency input dependencies to be satisfied from a previously issued instruction. It is the job of the scoreboard to identify when the input operands of an ac- tive warp entry are ready, at which point it sets the ready bit. Whenever the ready bit is set for a given warp, the corresponding instruction type counter is also incremented. For the Fermi architecture used in our exper- iments, each counter is five bits wide, since at most 32 active warps are present in the active warp set. Thus, there are four 5-bit counters (shown as INT RDY , FP RDY , LDST RDY , SFU RDY in Figure 3.6). The priority scheduler looks at the current priority and these counters to see what instruction types must be scheduled next. For instance, if the highest priority is INT then it looks at INT RDY to see if there are 53 at least two INT instructions ready in the active warp set. If so, it will scan the active warp set to identify the two INT instructions for schedul- ing. Note that even in the base machine the scheduler has to scan the active warp set to identify ready instructions for scheduling. Hence, our enhanced scheduler only needs to scan and match the two bit instruction type information with the current instruction priority. Thus in a single scan the two instructions that will be issued can be identified, just as in the base machine without adding any additional scans on the active warp set. Once the instruction types are issued, the corresponding ready coun- ters are decremented appropriately. Similarly, if the highest priority is INT but INT RDY shows only one ready warp, then the second issue slot will be filled with either LDST, SFU or FP instruction, in that order. Conventional power gating: We assume that the conventional power gating technique is implemented as shown in [51]. This approach uses idle-detect logic and the ready-instruction detect logic. The idle-detect logic can be implemented as a counter that will be incremented every time an idle cycle is detected and cleared whenever a ready instruction is detected. Whenever the counter hits the idle-detect threshold, the power gating logic will trigger the power gating signal for that specific unit. 54 Blackout power gating: To support blackout power gating, each of the power gated units is associated with an N-bit count down counter, called the blackout counter. Recall that all 16 integer units within a clus- ter are operated by a single power gating switch. Hence, for the design shown in Figure 3.6 we need four N-bit counters per SM, one counter per each cluster (two integer and two floating point clusters). The size of the counter must accommodate the break-even time for a given power gating design. The counter will be loaded with the break-even time as soon as the unit is power gated. Since most execution units need less than 24 cy- cles for break-even time in our current implementation, we need a 5-bit counter to store the break-even time whenever a cluster is power gated. As long as the value of the counter is not zero the unit will remain power gated and GATES will not assign an instruction to that unit. Coordinated blackout: Coordinated blackout requires the knowledge of whether one of the two clusters is already in blackout state. In which case, the second cluster will not enter blackout as long as there is at least one instruction waiting in the active warp subset. To achieve this, the INT ACTV or FP ACTV counters available in the priority logic are used. Note that RDY instruction counters can not be used since an instruction may be in active warp set but not yet ready for execution. When one of the two clusters of a given type is power gated, then the coordinated 55 blackout logic will check INT ACTV or FP ACTV counters to see if at least one instruction of the given type is in the active warp subset. If so, then the idle-detect mechanism is disabled to prevent the second cluster from entering the power gating state. On the same note, if there are zero instructions waiting in the active warp subset then the second cluster is immediately put to blackout state, even if the idle period is less than the idle-detect threshold. Adaptive idle detect: Adaptive idle detect technique keeps track of the number of critical wakeups during each epoch using a critical wakeup counter. The counter will be incremented every time an execution unit gets a signal to wake up at the same cycle the break-even time counter hits zero. At the end of the epoch, the value of the critical wakeup counter will be compared to a pre-defined threshold, which was empirically set to five as described earlier . If the counter value is greater than the threshold then the idle-detect value will be incremented and will be loaded into the idle-detect register used in baseline power gating technique. Note that in the baseline the idle-detect value was a fixed value that could be hard- coded into the logic. However, for adaptive idle detect we need a register that can be incremented or decremented in an epoch. 56 3.6 Evaluation 3.6.1 Evaluation Methodology We evaluated our proposed techniques for performance and energy saving using GPGPU-Sim v3.02 [21]. We used the default Nvidia GTX480-like configuration provided with GPGPU-sim. The baseline architecture, with a core clock of 700MHz, contains 15 SMs with two SP units, four SFUs, and 16 LDST unit per SM. Each SP unit contains 16 double-frequency CUDA cores, each with individual integer and floating point pipelines (total of 32 CUDA cores per SM). The default warp scheduler is the two- level scheduler with 48 warps per SM and capable of issuing two warps per cycle per SM. GPUWattch [62] and McPAT [63] are used for power estimations. We selected eighteen benchmarks to cover a wide range of scientific and computation domains from several benchmark suites in- cluding Rodinia [68], Parboil [9], and ISPASS [21]. For all the power gating results presented in this section, unless specified otherwise, we as- sume a default idle-detect window of five cycles and a break-even time of 14 cycles. 57 3.6.2 Increasing Power Gating Opportunities The following naming convention describes the techniques evaluated and applies to all figures: ConvPG refers to conventional power gating with two-level scheduler (section 2.3); GATES refers to the GATES sched- uler + conventional power gating (section 3.3); Naive Blackout and Co- ordinated Blackout (section 3.4) refers to GATES + naive blackout and GATES + coordinated blackout, respectively. Finally, we collectively re- fer to the combination of all of our proposed techniques as Warped-Gates. Warp gates refers to GATES + coordinated blackout + adaptive idle detect (section 3.4.1). Extracting idle cycles through GATES and blackout: Figure 3.7a shows the effectiveness of GATES and coordinated blackout at extracting idle cycles. The y-axis shows the fraction of idle cycles (idle cycles/ex- ecution cycles) normalized to the fraction of idle cycles extracted by the two-level scheduler. The numbers reported here are for the integer unit, but the improvements are similar for the floating point unit. GATES alone was able to extract 3% more idle cycles than the baseline two-level sched- uler. These extra idle cycles represent the idle cycles that were extracted by coalescing pipeline bubbles. Coordinated blackout increases the nor- malized fraction of idle cycles by 10%. Since this is a normalized fraction 58 the increase is not just due to increase in idle cycles but it is also due to reduced execution time, which leads to increased fraction of idle cycles. 0.9 1.0 1.1 1.2 1.3 1.4 backprop bfs btree cutcp gaussian heartwall hotspot kmeans lavaMD lbm LIB mri MUM NN nw sgemm srad WP geomean Normalized FracIon of Idle Cycles Benchmarks GATES Coordinated Blackout Warped Gates (a) Idle cycles -‐20% 0% 20% 40% 60% 80% backprop bfs btree cutcp gaussian heartwall hotspot kmeans lavaMD lbm LIB mri MUM NN nw sgemm srad WP geomean Compensated Cycles Benchmarks ConvPG GATES Coordinated Blackout (b) Compensated cycles 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 backprop bfs btree cutcp gaussian heartwall hotspot kmeans lavaMD lbm LIB mri MUM NN nw sgemm srad WP geomean Normalized Wakeups Benchmarks GATES Coordinated Blackout Warped Gates (c) Normalized wakeups (PG overhead) Figure 3.7: Increasing power gating opportunity for integer units. Floating point units exhibit similar trends. In most cases Warped-Gates achieves slightly reduced fraction of idle cycles compared to coordinated blackout. This result is due to the fact 59 that Warped-Gates reduces execution time penalty of power gating by ac- tivating fewer power gating events. But the reduction in execution time is outpaced by the reduction in idle cycles. In other words, with Warped- Gates idle cycles were reduced faster than execution time. This result is not a surprise since in many applications reducing the idle cycles does not correspondingly reduce the execution cycles, since not all idle cycle reductions are in the critical path of a program execution. For the same reason, inheartwall the fraction of idle cycles decreased with coordi- nated blackout compared to GATES. In this benchmark, with coordinated blackout the execution time decrease was outpaced by idle cycle decrease. It is also interesting to note that GATES and blackout provide savings under different operating regimes. When there are plenty of instructions in the active warp set, GATES allows instruction reordering to effectively improve idle cycle windows. However, when there are only a few active warp entries then GATES is unable to find opportunities for instructions reordering. In this case blackout allows instruction build-up whenever a functional unit is power gated. When the power gated unit comes out of blackout, then there are more opportunities for GATES to reorder in- structions again. Hence, they both complement each other in effectively increasing the idle cycle windows. 60 Increasing time in compensated state. Figure 3.7b shows the per- centage of cycles that the execution unit stays in the compensated power gating state for the integer unit for various techniques. Bars with negative values represents workloads where the execution unit was in an uncom- pensated state more than a compensated state. This figure demonstrates the effectiveness of our technique at extending the idle period length. Some benchmarks (backprop, lavaMD) tend to have highly utilized functional units and thus have very few idle cycles. Hence these bench- marks do not need any static energy saving techniques. There are a few benchmarks, such as cutcp and mri, that spends a significant amount of time in an uncompensated state with either conventional power gating or GATES. In these benchmarks many wakeups occur before break-even time. However, the coordinated blackout mechanism was able to sig- nificantly increase time spent in a compensated state that GATES alone cannot achieve. Coordinated blackout power gating is able to achieve sig- nificant increase in the percentage of time units are in compensated state. The geometric mean of cycles in compensated state is 20.9% for con- ventional power gating, 22.6% for GATES, and 33.5% with Coordinated Blackout. Power Gating Wakeup and Overhead. Figure 3.7c shows the num- ber of wakeups generated by each power gating technique normalized to 61 a conventional power gating scheme. The number of wakeups can also be interpreted as the number of idle windows that are power gated. Power gating overhead is directly correlated to the number of wakeups. In gen- eral, if we reduce the number of wakeups, we reduce the power gating overhead. As expected GATES alone increases the number of wakeups in some cases. This result shows that GATES in some cases increases the idle cycle window to be just beyond the idle-detect window and hence triggers gating more often which leads to more wakeups. Coordinated blackout, by design, decreases the number of wakeups by 26% compared to conventional power gating. Finally, Warped-Gates further brings down the number of wakeups by 46% compared to conventional power gat- ing by dynamically changing the idle-detect window to avoid excessive power gating overheads. Thus, our proposed techniques can essentially reduce power gating overhead in half. 3.6.3 Energy Impact Figure 3.8a and Figure 3.8b show the static energy savings by taking into account power gating overhead for the integer and floating point units. The results are normalized to a baseline with no power gating. All floating point results reported in this section excludes integer-only benchmarks which have no floating point activity. Conventional power gating with 62 two-level scheduler saves 20.1% and 31.4% of static power for integer and floating point units, respectively. Benchmarks such as backprop, cutcp,lavaMD, andNN experience negative or no energy savings with conventional power gating since the gating overhead exceeds the static energy savings. 20.1% 31.6% -‐0.2 0.0 0.2 0.4 0.6 0.8 backprop bfs btree cutcp gaussian heartwall hotspot kmeans lavaMD lbm LIB mri MUM NN nw sgemm srad WP average Int StaKc Energy Savings Benchmarks ConvPG GATES Naive Blackout Coordinated Blackout Warped Gates (a) Int unit 31.4% 46.5% -‐0.2 0.0 0.2 0.4 0.6 0.8 1.0 backprop bfs btree cutcp gaussian heartwall hotspot kmeans lavaMD lbm LIB mri MUM NN nw sgemm srad WP average Fp StaMc Energy Savings Benchmarks ConvPG GATES Naive Blackout Coordinated Blackout Warped Gates (b) FP unit Figure 3.8: Static energy impact of proposed techniques. GATES alone with conventional power gating saves 21.5% and 35.2% of static power for integer and floating point units, respectively. Hence, GATES alone creates some additional opportunities to save static energy. In some benchmarks GATES pushes some idle cycles windows to go past 63 idle-detect window but not sufficiently far to overpass the break-even time. In these cases, GATES triggers more power gating events which result in uncompensated power gating. In fact, we already showed this result in figure 3.7b. Naive blackout further increases static energy savings to 27.8% and 41.1% for integer and floating point, respectively. There are just three cases where naive blackout leads to lower energy savings (backprop, heartwall andNN). Naive blackout can potentially power gate too ag- gressively, leading to higher power gating overhead in these three bench- marks. Coordinated blackout power gates the second cluster more conser- vatively than naive blackout. Hence, coordinated blackout increases static energy savings to 31.5% and 45.6% for integer and floating point units, respectively. Coordinated blackout saves 1.5x more static power than conventional power gating across both integer and floating point units. Note that our approaches do not increase the dynamic energy of func- tional units. The amount of work done (total number of accesses for each functional unit type) is constant per workload, irrespective of power gat- ing. We also accurately modeled the few microarchitectural counters that were added to our design using RTL-level design and synthesis. We accu- rately measured their dynamic power due to counting activity. Our results show that these counters add less than 0.1% dynamic energy. 64 To estimate total on-chip energy savings we first estimate the frac- tion of leakage power that is consumed by the execution units. From GPUWattch, the total on-chip leakage power of GTX480 accounts for 26.87W. The integer units and floating point units account for 0.00557W and 4.40W, respectively. Using these values we estimate that execution units account for 16.38% of on-chip leakage power. Assuming leakage power accounts for 33% of total on-chip power and our technique can save 30% - 45% of static power, we estimate that our technique can save 1.62% - 2.43% of total on-chip power. As technology scaling continues, it is expected that static power will account for an increasing fraction of total on-chip power. If we assume leakage power accounts for 50% of total on-chip power, then our techniques can save 2.46% - 3.69% of total on-chip power. 3.6.4 Performance Impact Figure 3.9 shows the performance impact due to power gating. Conven- tional power gating and GATES result in similar performance overheads of 1%. Naive blackout suffers the worst performance overhead of 5% due to its aggressive shutting down of units for break-even time without considering active warps that may be ready soon. 65 0.7 0.8 0.9 1.0 1.1 backprop bfs btree cutcp gaussian heartwall hotspot kmeans lavaMD lbm LIB mri MUM NN nw sgemm srad WP geomean Normalized Performance Benchmarks ConvPG GATES Naive Blackout Coordinated Blackout Warped Gates Figure 3.9: Performance impact Coordinated blackout alleviates this concern and has 2% performance overhead. By taking into consideration soon-to-be ready active warps it can avoid aggressive power gating events. In spite of this effort, coor- dinated blackout suffers performance losses in certain benchmarks, such ascutup,heartwall andNN. The primary reason is that coordinated blackout places both the SP0 and SP1 clusters of INT or FP in blackout state after confirming no instructions are present in the active warps set. Unfortunately, as soon a unit is placed in blackout state in both clusters a ready instruction immediately enters the active warps set. While these are rare cases, Warped-Gates changes the idle-detect window length to avoid even the corner case performance losses. Hence, Warped-Gates achieves virtually the same performance overhead as conventional power gating, but with significantly more energy savings. 66 0.4 0.6 0.8 1 9 14 19 9 14 19 9 14 19 Int Fp Perf Norm. Energy and Perf. Break-‐even Time ConvPG Warped Gates (a) BET 0.4 0.6 0.8 1 3 6 9 3 6 9 3 6 9 Int Fp Perf Norm. Energy and Perf. Wakeup Cycles ConvPG Warped Gates (b) Wakeup Figure 3.10: Sensitivity to BET and wakeup delay 3.6.5 Hardware Overhead We implemented the various counters added to the base machine to en- able the proposed techniques in verilog. We synthesized them using NCSU PDK 45nm library [3]. Also we extracted the area of the SM from GPUWattch [62]. An SM occupies 48.1 mm 2 . The set of counters occupies 1,210.8 um 2 , resulting in an 0.003% area overhead. An SM uses 1.92 W of dynamic power and 1.61 W of leakage power. The counters uses 1.55e-3 W of dynamic power and 1.21e-5 W of leakage power total, accounting for 0.08% dynamic and 0.0007% leakage power overhead. 3.6.6 Sensitivity to Power Gating Parameters We conducted a sensitivity analysis to various wakeup delays and break- even time values. These results are shown in figure 3.10. 67 Regardless of break-even times, Warped-Gates always outperforms conventional power gating with two-level scheduler. With smaller break- even times, the energy savings gap between Warped-Gates and conven- tional power gating narrows since the occurrence of negative power gat- ing events decrease. As break-even time increases, the energy savings widens between Warped-Gates and conventional power gating. For ex- ample, at a break-even time of 19 cycles, conventional power gating saves only 17% of integer static power, while Warped-Gates saves 33%, a nearly 2x increase. Performance remains relatively constant with varying break-even times. For higher wakeup delay values, the performance and energy savings for conventional power gating degrades significantly. Recall that conven- tional power gating results in high number of wakeups due to aggressive power gating of execution units. This results in high performance penalty due to paying more wakeup delays every time, and less power savings as the unit is consuming power when waking up, but not doing any use- ful work. With a wakeup delay of nine cycles, conventional power gat- ing saves only 6% and 10% of integer and floating point static energy. Warped-Gates is able to sustain 33% and 48% of integer and floating 68 point static energy savings. Performance-wise, conventional power gat- ing has a nearly 10% performance impact, while Warped-Gates suffers 3% overhead with a nine cycle delay. 3.6.7 Related Work GPU schedulers: Since the scheduling decision can have a great impact on the GPUs performance and power, GPUs scheduler has been the tar- get for different types of optimizations. The two-level scheduler [42] is an optimization over prior schedulers that placed all pending and active warps in a single queue. The proposed scheduler has a positive impact on performance in addition to its role in reducing the proposed register file cache size. Narasiman [73] tackled the issue of the traditional GPUs schedulers,like round robin, that causes a long GPU stalls due large num- ber of concurrent memory requests generated by the running workloads. To solve this issue they proposed another two-level scheduler that im- proves performance and reduces stalls due to memory requests by divid- ing warps into fetch groups. The two level scheduler gives priority to each fetch group and rotates fetch groups whenever a long latency event occurs. Rogers [83] proposed a warp scheduler to improve cache local- ity and to reduce cache conflicts. Jog [54] proposed a warp scheduler to enable efficient pre-fetching policies. In this chapter we are proposing a 69 new power gating aware scheduling scheme that improves the power gat- ing opportunities with a negligible performance overhead. We used the two level scheduler proposed by [42] as our baseline scheduler in all our evaluations. CPU Power aware schedulers: Power aware schedulers for CPUs and multi-core systems have been studied extensively [22, 88]. Previous work focused on dynamic power aware scheduling and DVFS decisions based on available task and service time. Our proposed scheduler are based on the GPGPU execution model and targets improving the power gating potential of the GPGPU execution units. Power gating: Power gating techniques have been widely applied in microprocessors [51, 67], caches [38], and NOCs [31]. In the work pre- sented in this chapter we show that applying power gating at the SM level is conservative and there are plenty of power gating opportunities when the technique is applied inside at finer levels. Also we proposed the black- out power gating state machine to reduce the power gating overhead. GPU power saving: Power efficiency of GPUs micro-architectural blocks has been extensively studied. At the register file level, several works [42, 62] proposed techniques to save dynamic and static power of the GPUs register file using circuit level and micro-architectural tech- niques. At the execution units level, Leng [62] explored clock gating and 70 DVFS to save dynamic power of the execution units based on the mask activity and execution phases. Gilani [46] proposed several techniques to save the execution units and the register file dynamic power. The pro- posed technique takes advantage of the similarity in the data values in GPU workloads to save power. Also they proposed combining two sim- ple instructions into one composite instruction that can be executed by an enhanced fused-multiply-add units. The static power was not considered in Leng’s and Gilani’s work, which is the focus of the ideas proposed in this chapter. 3.7 Summary In this chapter we first analyzed the effectiveness of conventional power gating techniques when applied to GPU execution units. We showed that the basic two-level scheduler that frequently intersperses different instruction types leads to short idle periods for a given execution unit type. These short idle periods limit the effectiveness of conventional power gating. We proposed GATES to aggregate instruction issue such that clusters of instructions of the same type are given priority. GATES is effective in extracting and coalescing idle periods. We then evaluated a new power gating scheme called blackout to avoid the negative effects of power gating a unit that does not have sufficiently long idle periods, even 71 after applying GATES. We then proposed an adaptive idle detection ap- proach that dynamically varies the size of the idle detect window before a power gating event is triggered. We call the combined approach of using GATES, blackout and adaptive idle detect as Warped-Gates. With neg- ligible area and performance overhead, Warped-Gates saves ˜1.5x more static power than conventional power gating, achieving 31.6% and 46.5% of integer and floating point static energy savings overall. 72 Chapter 4 Origami: Fine Grain Power Gating Techniques for GPGPUs 4.1 Introduction In the previous chapter we explored opportunities to power gate an entire cluster of INT or FP units whenever the entire cluster is unused. We ex- ploit the heterogeneity in the executed instruction types to improve power gating opportunities. Also we prioritize scheduling either INT or FP in- structions for extended periods of time to stretch the idle periods for un- used execution unit type. In addition we modified the power gating state machine by forcing the power gated unit to stay in the gated state for at least the break-even time, which is the minimum time necessary to overcome the power gating overheads. These improvements collectively, titled as Warped-Gates, are able to tackle inefficiencies due to temporal idleness by coalescing short idle periods to create longer idle cycles to efficiently power gate clusters of execution units. However, even after applying Warped-Gates, there is still a significant percentage of idle peri- ods that are too short to be exploited as shown in figure 3.2c. These short 73 idle periods appear as pipeline bubbles and mainly caused by uncoalesced instruction scheduling and heterogeneous resource usage. Another limitation for Warped-Gates is that it works by gating an en- tire cluster of INT or FP units. As a result it will not be able to exploit the fine-grain spatial idleness caused by the variation in the activity of the SIMT lanes within the same cluster. Spatial idleness exists in the execution units in the form of inactive lanes caused typically by branch divergence or insufficient parallelism. Many prior works have tackled the inefficiencies that cause spatial and temporal idleness through reducing branch divergence by reforming warps [40, 41], issuing multiple branch paths simultaneously [25, 80], by embracing branch divergence for energy efficiency through power gating of individual lanes [98] or by proposing high performance scheduling algorithms like GTO and the two-level schedulers [73]. Despite these enhancements to improve GPUs hardware efficiency on various fronts, there still exist significant fine-grain pipeline bubbles. In this chapter, we first show that these fine-grain pipeline bubbles exists regardless of workloads or scheduling policies. These fine-grain pipeline bubbles are a major source of wastage in execution lane energy. In order to improve the efficiency of execution units in the face of fine- grain pipeline bubbles, we propose Warp Folding. Warp Folding splits a 74 warp into two half-warps, which are scheduled in succession, in order to fill these fine-grain pipeline bubbles. Warp Folding in turn creates idleness in half of the execution unit lanes, which can be leveraged for energy efficiency gains through power gating without the need for forcing additional idleness through techniques such as Blackout as described in the previous chapter. As we show in our results section the benefit of not forcing idleness is that we can virtually eliminate all performance losses compared to Warped-Gates. Warp Folding is then enhanced with folding-aware scheduler to min- imize performance overhead. We present the Origami Scheduler, which schedules warps based on the instruction type and active mask pattern in order to issue warps with similar type and active masks to coalesce idleness across the time domain. Warp Folding and Origami scheduler together serve to maximize the fine-grain idleness opportunities through spatial-temporal manipulations of threads. We refer to these two tech- niques collectively as Origami. 4.2 Prevalence of Pipeline Bubbles Now we demonstrate the pervasiveness of fine-grain pipeline bubbles that exists in GPGPUs regardless of workloads or warp schedulers. The 75 pipeline bubbles exist because the scheduler was unable to issue two in- structions back to back to the same pipeline. Note that in designs evalu- ated in this work many of the execution units have an initiation interval of one cycle and hence there is no design imposed limit on using the execution units in consecutive cycles. But due to data hazards (such as read-after-write) and/or insufficient parallelism the pipeline bubbles are pervasive. These pipeline bubbles lead to leakage energy losses. Figure 4.1 shows the percentage of idle cycles where the execution units (INT and FP units) have bubbles that last less than 10 cycles, nor- malized to the total execution time. Short idle cycles (less than 10 cycles) present a difficult challenge to traditional power gating techniques as they require a minimum breakeven time before the leakage energy savings can overcome the power gating overhead. The figure shows the results for three different warp schedulers, namely GTO, two-level warp sched- uler [42] and gating aware two-level warp scheduler (GATES) sched- uler [15]. Such schedulers are agnostic to resource usage, and therefore suffer many short pipeline bubbles universally across all benchmarks. On average, with the two-level scheduler, the execution unit experiences short pipeline bubbles 25% of the time for integer and floating point units, respectively. 76 0 0.1 0.2 0.3 0.4 0.5 CP kmeans LIB Mri-q sgemm backprop BFS btree gaussian heartwall RAY MUM nw lud stencil WP AVG Percentage of Bubbles GTO Two-level GATES (a) INT pipeline 0 0.1 0.2 0.3 0.4 0.5 CP kmeans LIB Mri-q sgemm backprop BFS btree gaussian heartwall RAY MUM nw lud stencil WP AVG Percentage of Bubbles GTO Two-level GATES (b) FP pipeline Figure 4.1: Percentage of bubbles shorter than 10 cycles normalized to the total execution time As shown in figure 4.2, these workloads have a diverse mix of in- structions. When these diverse mix of instructions are interspersed then pipeline bubbles occur in execution resources. In this context,GATES scheduler [15] was proposed to schedule instructions based on execution unit resource type in order to coalesce resource usage, leading to longer idle periods for any given unused resource. These longer idle periods are then utilized for power gating purposes at the SP level. But even with the GATES scheduler, we see that short pipeline bubbles are 23% and 19% 77 0 0.2 0.4 0.6 0.8 1 CP kmeans LIB Mri-q sgemm backprop BFS btree gaussian heartwall RAY MUM nw lud stencil WP Instruction Type Percentage LD/ST SFU INT FP Figure 4.2: Workloads instruction type breakdown for integer and floating point units, respectively. GATES targets elimi- nating intermediate length bubbles which are amenable to power gating, leaving very short bubbles intact. This prevalence of fine-grain pipeline bubbles, regardless of the warp scheduler and workloads, presents a challenge (and an opportunity) to improve GPU power efficiency. Our goal in this chapter is to tap into the energy savings potential of these short pipeline bubbles. 4.3 Origami: Converting Pipeline Bubbles into Energy Savings Opportunity In order to leverage these fine-grain pipeline bubbles and convert them into energy savings opportunities, we present Origami. Origami consists of two components: the Warp Folding and Origami scheduler. 78 INT FP Active Mask Cycles INT FP INT FP INT FP Type Mask Origami Scheduler Warp Folding (a) (b) (c) (d) Figure 4.3: Origami consists of the Origami scheduler and Warp Folding. In order to leverage these fine-grain pipeline bubbles and convert them into energy savings opportunities, we present Origami. Origami consists of two components: Warp Folding and the Origami scheduler. A simplified overview of Origami is presented in figure 4.3 to illus- trate the various components. Figure 4.3a shows warps (in blue) that are executed in the integer and floating point pipelines, along with the active mask of each warp (the active mask is represented by the width of the blue line for illustrative purposes). The gaps seen between two warps are the pipeline bubbles where there are no available warps for execution in that warp scheduling window. This scenario, which is a typical state of execution units, shows that there are many fine-grain pipeline bubbles interspersed throughout the execution pipeline. The goal of Origami is to convert these wasted idle cycles into long stretches of idleness that can be harnessed for energy savings. The Origami scheduler uses a two step scheduling policy. First it schedules warps based on the instruction type. As shown in figure 4.3b, by scheduling based on instruction type, we can coalesce all 79 the INT or FP instructions to appear close in time. Thus we can squeeze out course-grain idleness with type based coalescing, similar to the Gates scheduler proposed in the previous chapter, which traditional power gat- ing techniques can leverage for energy savings (show in green). Once the instruction type based scheduling is complete it still leaves plenty of short pipeline bubbles that are beyond the reach of traditional power gating. Furthermore, in the presence of divergence there is also lane level idleness that is dispersed. Origami then uses a second level scheduling that issues warps with similar active masks. As shown in figure 4.3c, scheduling by active mask can stretch lane-level idleness in order to extract more lane-level power gating opportunity. Finally, with Warp Folding, we can convert fine-grain pipeline bubbles into contiguous idle lanes to maximize energy savings as shown in figure 4.3d. In the rest of this section, we discuss each component of Origami. 4.3.1 Warp Folding We will first discuss in detail how Warp Folding can convert fine-grain pipeline bubbles into energy savings opportunity. Then in the next sec- tion, we will discuss in detail how the Origami Scheduler can extract course-grain idleness, leaving behind only fine-grain idleness that Warp Folding can utilize. 80 We will use figure 4.4 to illustrate how Warp Folding can be used to convert wasteful pipeline bubbles into useful idleness. Lane#: 0 1 2 3 4 5 6 7 Cycle x: 1 1 1 1 1 1 1 1 Cycle x+1: Bubble Cycle x+2: 1 1 1 1 1 1 1 1 Cycle x+3: Bubble (a) Execution pattern with fine-grain bubbles Lane#: 0 1 2 3 4 5 6 7 Cycle x: 1 1 1 1 0 0 0 0 Cycle x+1: 1 1 1 1 0 0 0 0 Cycle x+2: 1 1 1 1 0 0 0 0 Cycle x+3: 1 1 1 1 0 0 0 0 (b) Threads activity after folding Lane#: 0 1 2 3 4 5 6 7 Cycle x: 1 1 0 0 1 1 0 0 Cycle x+1: 1 1 0 0 1 1 0 0 Cycle x+2: 1 1 0 0 1 1 0 0 Cycle x+3: 1 1 0 0 1 1 0 0 (c) Threads activity after cluster based folding Lane#: 0 1 2 3 4 5 6 7 Cycle x: 1 1 1 1 1 1 1 1 Cycle x+1: 1 1 1 1 1 1 1 1 Cycle x+2: 1 1 1 1 1 1 1 1 Cycle x+3: 1 1 1 1 1 1 1 1 (d) Execution pattern for heavy loaded unit Lane#: 0 1 2 3 4 5 6 7 Cycle x: 1 1 0 0 1 1 0 0 Cycle x+1: 1 1 0 0 1 1 0 0 Cycle x+2: 1 1 0 0 1 1 0 0 Cycle x+3: 1 1 0 0 1 1 0 0 Cycle x+4: 1 1 0 0 1 1 0 0 Cycle x+5: 1 1 0 0 1 1 0 0 Cycle x+6: 1 1 0 0 1 1 0 0 Cycle x+7: 1 1 0 0 1 1 0 0 Performance Overhead (e) Folding overhead for heavy loaded unit Figure 4.4: Effect of Warp Folding on SIMT lanes activity Figure 4.4a illustrates the scenario where two instructions with a full active mask are issued to an INT pipeline but these instructions are issued two cycles apart (similar to the bottom of figure 4.3c). The intervening cycle between the two instructions is idle. This scenario occurs when the scheduler does not have back-to-back instructions to issue to the same SP unit. Such a scenario can occur due to instruction mix, resource availabil- ity (structural hazards) elsewhere in the microarchitecture, and stalls due to data hazards. Warp Folding splits up the 32 threads in a warp into two half-warps, which are then reshuffled to use the same active lanes and then issued in succession. The two half-warps will be scheduled back-to-back to the 81 same SP unit. Each half-warp will be using half of the SIMT lanes in the designated SP. The other half will be idle and will not be activated while the half-warps are scheduled. Figure 4.4b shows an example of Warp Folding. Warp Folding trans- fers the bubbles in the lower order lanes to the higher order lanes of the designated execution pipeline. The transferred bubbles coalesce with the already existing bubbles in the higher order half to generate a longer se- quence of bubbles that creates a higher potential for power gating (similar to the bottom of figure 4.3d). Because we are scheduling two half-warps back to back, the two half-warps must be issued with a delay equal to the initiation interval of the pipeline for that particular instruction. In order to take advantage of Warp Folding, the threads within the second half-warp should be scheduled on the same SIMT lanes as the first half-warp used. As a result, half of the SIMT lanes are inactive when the half-warps are scheduled. Under unconstrained Warp Folding, the higher order half of the SIMT lanes’ data will be directed to the lower order half of the SIMT lanes. This requires a large multiplexer with significant delay and a complexity. Hence, rather than using unconstrained Warp Folding across all 32 lanes, we propose to fold within a SIMT cluster of 4 lanes to minimize hardware overhead. This is demonstrated in figure 4.4c where warps are folded within each 4-lane cluster. As shown the threads 82 within the same cluster are split into two sub-warps where each sub-warp has half of the threads in each cluster. This design is simpler because the threads assigned to a certain cluster are moving only between the lanes within the cluster. Hence the wiring overhead and the additional multiplexers delay are mitigated. 4.3.1.1 Warp Folding Policies Warp Folding too naively may translate into a performance overhead. While the ideal situation for Warp Folding is to have a bubble between each scheduled warp, this may not always be the case. For example, fig- ures 4.4d and 4.4e show the case where Warp Folding has a negative im- pact on performance. This scenario occurs when multiple instructions are issued back-to-back in consecutive cycles to the same pipeline. Hence, in order to avoid impacting performance, we have to take into account run- time execution resource utilization when deciding on when to fold warps. In this work we explore different Warp Folding policies. Warp Folding can fold aggressively for the minority instruction type and conservatively for the majority instruction type. The Warp Folding process starts by counting the number of instruc- tions of each instruction type (INT or FP). For this purpose we augment 83 the scheduler to include instruction type (INT, FP, LDST, SFU) count in- formation within each group. If the INT instructions count is higher than FP instruction count then warps with INT instructions (majority type) are conservatively folded and warps with FP instructions (minority type) are aggressively folded into two half-warps. The fundamental intuition is that when there are more instructions of a given type, that instruction type should be given more resources for execution, even at the expense of leaving some pipeline bubbles as is. On the other hand, an instruction type with fewer instructions can be curtailed more aggressively to use fewer execution resources, and thus more pipeline bubbles are created for energy savings. Since we are applying aggressive and conservative fold- ing techniques based on the number of instructions of a given type, the warp scheduler should also be made folding-aware as will be discussed later. Aggressive folding targets folding instruction type (INT or FP) that has lower count. There are typically more pipeline bubbles, and more opportunities to fold warps. We empirically observed that folding more than 70% of the time can lead to performance loss as some of the depen- dent instructions have to wait for the folded warps to finish. Hence, we selected a 70% threshold. Therefore, in aggressive folding, we fold the 84 selected instruction type for 70% cycles of each phase. In the remaining 30%, the warps are not folded. Conservative folding targets folding the instruction type that has higher instruction count already. Since there are more instructions of this type in the pipeline already it is important to allow these instruc- tions more resources to finish their execution even at the expense of some missed power gating opportunities. Using empirical measurement we se- lected a 40% threshold. Thus warps are folded for 40% cycles of each phase. In the remaining 60%, the warps are not folded. Adaptive folding: In addition to the instruction count, we take the re- sources utilization into account to make sure that folding is not enabled when the workload is highly utilizing the resources. Figure 4.4e shows an example where it is not recommended to fold. So during the current phase we keep track of the total issued instructions. At the beginning of the next phase the total issued instructions count in the current phase is considered to decide if we should enable the aggressive and the conservative folding policies during the next phase or they should be disabled. Hence, the policy will adapt to the application behavior and make sure that Warp Folding does not result in performance loss. For example we decided to disable Warp Folding when the total number of issued instructions in the previous phase is 90% of the maximum possible issued instructions. In 85 the results section we will discuss the folding ratios for each workload and how the adaptive folding can be used to avoid harmful scenarios. Phase length: Folding one warp creates only a two-cycle bubble in the higher order lanes. In order to create longer idle windows in the higher order lanes, Warp Folding must be continuously activated for at leastN Fold cycles or greater to exploit power gating opportunities. Thus the decision to either fold or not is made at beginning of each phase. If the value of folding period N Fold is small, it may force the power gated units to wake-up before the break-even time is elapsed. On the other hand, if N Fold is large we cannot issue warps for extended period to power gated lanes. Hence, the value ofN Fold is selected based on the power gating parameters. The minimum folding length that can translate the threads inactivity into savings is shown in Equation 4.1. N Fold =N pipeflush +N idledetect +N breakeventime (4.1) Where N pipeflush is the number of pipeline stages in an execution unit. Recall that each execution unit, such as INT, FP and SFU, is pipelined and instructions have different latencies based on the opcode. Hence, once an instruction is issued to the execution unit the unit cannot be power gated for N pipeflush cycles. Based on the default instruction latency numbers in GPGPUsim [21] configuration for fermi GTX480 architecture, N pipeflush is 86 typically about five cycles for most instructions, and for a few instructions it can reach up to 19 cycles. Only divide instruction has longer than 19 cycles latency, but this instruction occurs rarely in most workloads we analyzed. Hence, we conservatively chose 19 cycles for N pipeflush . N idledetect is the number of idle cycles before power gating can be activated and its value is set to 5 in our experiments based on prior research that empirically showed that 5 cycles is a good trade off to reduce unnecessary power gating while still capturing most opportunities [51]. N breakeventime is the minimum number of power gating cycles needed to compensate the power gating overhead and it has been shown to be in the range of 9-19 cycles in [51]. Hence, the value ofN Fold should be at least 50 cycles. As discussed earlier, since we enable the conservative folding policy only for 40% of the time, the phase lengthN phase should be at least 120 cycles(i.e. N phase * 40%>50). We did sensitivity analysis on the phase length by changingN phase value between 150-500 cycles. Our results show that the phase length of 250-400 cycles provides the optimal trade offs between the energy savings and the performance overhead. Longer phases reduce the proposed techniques adaptation to application behavior and results in a small degradation in performance. In our simulations we usedN phase of 300 cycles. 87 4.3.2 Origami Scheduler While Warp Folding can convert fine-grain pipeline bubbles into energy saving opportunities, the Origami Scheduler can extract coarse-grain idle- ness. The Origami scheduler squeezes together idleness at the coarse grain level across execution units (through type-aware scheduling as shown in figure 4.3b) and lanes (through lane-activity-aware scheduling and lane shifting as shown in figure 4.3c). This way the scheduler can coalesce resource usage and avoid scheduling warps that uses higher or- der lanes. This approach leaves fine grain idleness that Warp Folding can efficiently extract as discussed in the previous subsection. As a first step the Origami scheduler classifies warps based on instruc- tion type, and prioritizes issuing instructions of a given type. As such it enables INT and FP execution resources, to be either utilized or idle for longer stretches of time, as has been demonstrated in the previous chapter. In the second step the Origami scheduler uses lane aware scheduling. The need for lane aware scheduling exists because of inactive SIMT lanes due to branch divergence or insufficient parallelism. Figure 4.5 breaks down the number of active threads in each issued warp. As shown, the workloads on the left side of the figure do not exhibit lane level idleness. However, as shown on the right side of the figure, 88 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 CP kmeans LIB Mri-q sgemm backprop BFS btree gaussian heartwall RAY MUM nw lud stencil WP Active Threads Percentage 1-4 5-8 9-12 13-16 17-24 25-32 Figure 4.5: Threads activity breakdown other workloads exhibit lane level idleness due to divergence. For exam- ple, 40% of the issued warps in the heartwall workload do not use all the 32 SIMT lanes. Figure 4.6a shows an illustration of the threads activity of different SIMT lanes when underutilized warps are issued. When under utilized warps are issued arbitrarily the fine grain bubbles in each SIMT lane can be unaligned. To exploit lane level idleness more efficiently the Origami scheduler uses lane-aware scheduling to coalesce lane-level idle- ness. Origami scheduler divides the warps that are scheduled on the SP units into two groups. The first group has the warps with less than 32 ac- tive threads and we refer to this group as less than 32 group. The second group has the warps with 32 active threads and we refer to this group as equal to 32 group. Lane#: 0 1 2 3 4 5 6 7 Cycle x: 1 1 0 1 0 1 0 0 Cycle x+1: 0 1 1 1 0 1 0 0 Cycle x+2: 1 1 1 1 1 1 1 1 Cycle x+3: 0 0 1 1 0 1 0 1 Cycle x+4: 0 1 1 1 0 1 1 0 Cycle x+5: 1 1 1 1 1 1 1 1 (a) Execution pattern Lane#: 0 1 2 3 4 5 6 7 Cycle x: 1 1 0 1 0 1 0 0 Cycle x+1: 0 1 1 1 0 1 0 0 Cycle x+2: 0 0 1 1 0 1 0 1 Cycle x+3: 0 1 1 1 0 1 1 0 Cycle x+4: 1 1 1 1 1 1 1 1 Cycle x+5: 1 1 1 1 1 1 1 1 Equal to 32 group Less than 32 group (b) Reordering opportunities Lane#: 0 1 2 3 4 5 6 7 Cycle x: 1 1 1 0 1 0 0 0 Cycle x+1: 1 1 1 0 1 0 0 0 Cycle x+2: 1 1 0 0 1 1 0 0 Cycle x+3: 1 1 1 0 1 1 0 0 Cycle x+4: 1 1 1 1 1 1 1 1 Cycle x+5: 1 1 1 1 1 1 1 1 Equal to 32 group Less than 32 group (c) Lane shifting opportunities Figure 4.6: Reordering and lane-shifting effects on power gating opportunities 89 At the beginning of each phase the scheduler checks the number of warps in each group (=32 or <32) and select the group with the higher number of warps. This approach ensures that at the beginning of each phase the scheduler is likely to find instructions from the same group for the upcoming N consecutive cycles. Figure 4.6b shows the effect of the Origami scheduler grouping on the power gating opportunities. As shown, the Origami scheduler avoids mixing fully active warps with under utilized warps. Lane Shifting: The last step in Origami schedule is lane shifting. The previous step simply avoids interspersing fully active warps with under- utilized warps. Thus for a contiguous stretch some of the lanes in a warp are active, but they are not necessarily the same lanes. Lane shifting is the technique that aligns the inactive lanes to be the same by moving computation across SIMT lanes. However, unrestricted lane shifting across all SIMT lanes incurs penal- ties in transferring register data across lanes. GPUs currently cluster the SIMT lanes into four lanes per cluster for design simplicity [42, 53]. Hence, we exploit this clustering and restrict lane shifting to be within the cluster boundaries. All the shift operations move active threads to the left most available lane. Moving the threads from their designated lanes to other lanes have been used before. The authors in [53] used this feature 90 to replicate the thread execution on an empty lane to enable the intra- warp DMR technique proposed in that paper. However, in lane shifting we move the threads from their designated lanes to an empty lane within the same cluster to improve the power gating opportunities. Limiting the shifting to the cluster boundaries may reduce some opportunities for power gating, but it still provides significant power savings with minimal hardware cost. Figure 4.6c shows the effect of applying lane shifting on each clus- ter of 4 SIMT lanes. In the first cycle the active thread running on lane 3 is shifted to lane 2 in the first cluster. Also, the active thread running on lane 5 is shifted to lane 4 in the second cluster. By shifting all ac- tive threads to the left most inactive lane within each cluster we created a new four cycle idle window for lane 3 in the first cluster. In the second cluster we extended the three cycle idle window of lane 6 to four cycles. As shown in the illustration, lane shifting improves the power gating po- tential by doing the best effort to align the threads within each cluster. However, the restriction of shifting lanes within a cluster leads to some missed opportunities. For instance, if lane shifting was allowed across all 32 lanes, without any cluster restriction, then the two isolated inactive threads, marked as two red ovals in the figure, could have been aligned to create a three cycle idle window. 91 4.3.3 Optimizations Avoiding Starvation: Separating warps into groups may lead to starva- tion if warps continue to enter the currently prioritized group for schedul- ing. In reality such scenarios don’t occur in applications since eventually warps from the active group will be stalled or switch to be part of the in- active groups warps (i.e. due to change in the active mask activity and/or instruction type). Hence, the default scheduler eventually cannot issue from the current active group due to stall and data dependencies and it will be forced to switch to a different group of warps.. However, to bound starvation length, the scheduler is forced to switch between groups every M phases, where each phase runs forN phase cycles. Thus afterMN phase cycles, a group switch is automatically initiated. This forced switch gives a chance for the warps in the other groups to be scheduled. As a trade off between performance impact and power savings opportunity we selected M to be 2. As a result, the scheduler is allowed to issue from the same group for two phases at most. Avoiding unnecessary warp folding: Every time Warp Folding is enabled, the warp is divided into two half-warps scheduled back to back on the same lanes. However, benchmarks like gaussian, nw, heartwall and MUM have many scheduled warps with less than 16 active threads due to insufficient thread level parallelism. In this case the original warp need 92 not to be folded. This simplification will disable the need to schedule the two sub-warps over two different issue cycles. The decision to check whether the entire upper or lower half of the active mask vector is idle can be done at the scheduler stage. The active mask is visible to the scheduler and the scheduler will be able to decide if a half-warp has all zeros when a warp is folded. Eager power gating: The conventional power gating scheme [51] forces each unit to wait for an idle-detect period before switching to the uncompensated state. The reason to wait for idle-detect period is to ob- serve sufficient idleness history before activating power gating. How- ever, the proposed Warp Folding approach guarantees that once a warp is folded, then the instruction type associated with that warp is continu- ously folded for the entire phase duration. Hence, rather than waiting for idle-detect window the power gating logic can immediately initiate gat- ing of an unused lane. We call this approach eager power gating. In the eager power gating, whenever Warp Folding is enabled this information is conveyed from the scheduler to the execution unit’s power gating logic. As soon as the first idle cycle is detected in the upper half of the SIMT lanes and Warp Folding enable signal is turned ON the power gating logic immediately gates the upper half of the SIMT lanes within each cluster. 93 4.3.4 Architectural Support In order to enable Warp Folding, the warp scheduler and folding logic should be modified to be able to decide when to fold and how to fold. INT-Count and FP-Count counters are used to hold the count of the warps with INT and FP instructions, respectively. The values of the INT-Count and the FP-Count are used to decide on which instruction type will be ag- gressively folded and which instruction type will be conservatively folded at the beginning of each phase. A two-bit instruction type field is added to the decoded instruction in the instruction buffer to annotate the instruc- tion type. In case of Warp Folding, half of the threads will be scheduled in the current cycle and the other half will be scheduled in the next cycle. For the first half, the threads execute on their designated SIMT lanes without any change. For the second half, the threads are shifted to the same SIMT lanes used by the previous threads. In order to make the hardware imple- mentation simpler and avoid using a very large crossbar, each 4 SIMT lanes are clustered together [42] and a 4 to 1 shifting logic is assigned to each SIMT lane. To honor this clustering limitation, instead of folding the threads at the middle of the active mask as shown in figure 4.4b, the fold will be at the cluster level as shown in figure 4.4c. 94 In order to make sure that the operands that belong to the threads in the second half-warp will not be overwritten, the SP pipe register will be marked as busy till the second half-warp is scheduled. At the end of the execution of a folded warp, a re-shifting process should take place to return the output data to its original threads. Both the shifting and the re-shifting operations are controlled by the active mask of the scheduled threads. The only difference is that the re-shifting logic shifts the output to the right to return the threads to their original place. Figure 4.7 shows the Warp Folding steps in the context of two clus- ters for simplicity. Figure 4.7a shows the basic architecture when a fully active warp (i.e. has an active mask of 11111111) has been scheduled. Figure 4.7b shows the modified pipe with the shifting and the re-shifting logic added per a 4 SIMT lanes cluster. The figure shows the active mask of the threads in the first half-warp before and after the shifting logic. As shown, the threads activity of the first half-warp is 1100 1100. Since they are already mapped to run on the lower order SIMT lanes the threads mapping will not be affected by the shifting logic. On the other hand the second half-warp threads activity will be 0011 0011 as shown in Fig- ure 4.7c. During the second half-warp execution the shifting logic shifts the active threads to the lower order SIMT lanes. As a result, after the shifting logic the active mask will be 1100 1100 instead of 0011 0011. 95 SP Pipe C C C C C C C C 1 1 1 1 1 1 1 1 Result Collector (a) SP design without folding support Shifting Logic SP Pipe Shifting Logic 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 C C C C C C C C Re-Shifting Logic Re-Shifting Logic Result Collector 1 1 0 0 1 1 0 0 Selective Write (b) The first half-warp threads activity and execution flow Selective Write Shifting Logic SP Pipe Shifting Logic 0 0 1 1 0 0 1 1 1 1 0 0 1 1 0 0 C C C C C C C C Re-Shifting Logic Re-Shifting Logic Result Collector 1 1 0 0 0 0 1 1 (c) The second half-warp threads activity and execution flow Figure 4.7: Warp Folding execution steps and detailed threads activity After the threads finish their execution they will be remapped to their original location by the re-shifting logic. It is important to note that the proposed Warp Folding approach does not impact LD/ST instructions. Since warps are folded at the beginning of the EX stage and unfolded at 96 Decode and schedule Issue EX1 EXN Write-Back Shifting Re- Shifting Pipe Register 0 Pipe Register 1 L0 L1 L2 L3 Pipe Register N Pipe Register N+1 2 to 1 Mux Ready Result collection Buffer Sub-warp#, Split 0011 1100 Sub-warp# AND Mux Split Mask temp_Mask Shifting Muxes Re-Shifting Muxes temp_Mask {Split, sub-warp#, temp_Mask} Figure 4.8: Modified GPU pipeline the end of the EX stage and before the write-back stage Warp Folding does not have any impact on memory coalescing. We implemented the lane shifting logic and the required multiplexers in RTL to estimate the overall delay of the additional logic. The results show that the lane shifting logic delay is significantly smaller than the typ- ical execution stage delay. However, we pessimistically assume that the lane-shifting and re-shifting logic need an extra two pipeline stages, one for each operation. Figure 4.8 shows the updated pipeline after adding the two additional pipeline stages along with the additional hardware that is required by each stage. For simplicity, we show a block level design rather than the full circuit level detail as used in our RTL implementation. 97 In the shifting stage we have the shifting logic. In the normal case the ac- tive mask is used to decide which lanes are active and which lanes are inactive(i.e. when the mask is 0 then the lane will be inactive and when the mask is 1 the lane will be active). However, when the shifting and the folding techniques are applied the lane shifting logic shifts the input values to different lanes. As a result the original active mask should be shifted accordingly to reflect the change in the lane activity. The logic at the bottom of the shifting stage is responsible for generating the new ac- tive mask. The new logic uses the sub-warp# (sub-warp0 or sub-warp1) and the Split signal to generate the new active mask. The new active mask is ANDed with the original active mask(labeled Mask in the figure) to generate the correct threads activity labeled as temp Mask in the figure. temp Mask is carried with the threads through the pipeline and is used by the re-shifting stage. The re-shifting stage has the re-shifting logic and an additional result collection buffer. The result collection buffer collects the output data from each sub-warp before sending the output data of the 32 threads on the the result bus at the same time. If Warp Folding is not enabled then the output data will bypass the collection buffer. All the results in the evaluation section are based on the execution model where two additional pipeline 98 stages are added to our implementation compared to a baseline without the two additional stages. Origami scheduler requires the warps to be classified into less than 32 group and equal to 32 group. To enable this classification, as soon as a warp is fetched into the instruction buffer we use an all-ones detector logic on the active mask that is already stored in the SIMT stack. We set a one-bit grouping field in the instruction buffer to enable this classification. The detector logic can be shared across all warps within an SM and it only needs access to the SIMT stack. The detection logic is also off the critical path since instructions spend multiple cycles in instruction buffer after fetch before they are scheduled for execution. In this work we assume the base machine can schedule two instructions every cycle. Hence, at most all-ones detection logic must be able to set the one bit field for two instructions every cycle. Once the classification bit is set by the all-ones detector we use a counter to count the number of warps pending in each group. We use two five bit counters to count the number of warps in each group. Also to keep track of the idle-detect time of each SIMT lane a 3-bit counter is added per-lane. The dynamic power of the 3-bit counter is 0.1uW per SIMT lane, about 0.2% power overhead per SIMT lane. 99 4.4 Evaluation 4.4.1 Evaluation Methodology We evaluated our proposed techniques for performance and energy saving using GPGPU-Sim v3.02 [21]. We used the default Nvidia GTX480-like configuration provided with GPGPU-sim. The baseline architecture, with a core clock of 700MHz, contains 15 SMs with two SP units, four SFUs, and 16 LDST units per SM. Each SP unit contains 16 double-frequency SIMT lanes, each with individual INT and FP pipelines (total of 32 SIMT lanes per SM) .GPUWattch [62] is used for power estimations. We se- lected benchmarks to cover a wide range of scientific and computation do- mains from several benchmark suites including Rodinia [68], Parboil [9], and ISPASS [21]. For all the power gating results presented in this sec- tion, unless specified otherwise, we assume a default idle-detect window of five cycles and a break-even time of fourteen cycles andN phase value is selected as 300 cycles. The base machine uses a two-level scheduler, implemented as described in [42]. 4.4.2 Energy Impact Figure 4.9 shows the static energy savings after accounting for the power gating overhead for the integer and floating point pipelines. 100 -0.2 0 0.2 0.4 0.6 0.8 1 CP kmeans LIB Mri-q sgemm backprop BFS btree gaussian heartwall RAY MUM nw lud stencil WP AVG Leakage Energy Savings Two-level Warped-Gates Warp Folding Origami (a) INT pipeline -0.2 0 0.2 0.4 0.6 0.8 1 CP kmeans LIB Mri-q sgemm backprop BFS btree gaussian heartwall RAY MUM nw lud stencil WP AVG Leakage Energy Savings Two-level Warped-Gates Warp Folding Origami (b) FP pipeline Figure 4.9: Execution units leakage energy savings The results are normalized to the baseline machine that uses the two- level scheduler and does not apply power gating at the execution units level. All floating point results reported in this section excludes integer- only benchmarks which have no floating point activity. Conventional power gating with the two-level scheduler (the first bar in Figure 11) saves 29% and 34% of the leakage energy for integer and floating point units, respectively. When applying Warp Folding on top of the two-level scheduler, we are able to save 41% and 43% of the integer and floating 101 point pipelines leakage energy respectively. This demonstrates the effec- tiveness of Warp Folding in converting wasteful fine-grain pipeline bub- bles into useful power gating opportunities. In addition, Warp Folding is able to eliminate the negative energy savings for backprop, CP and MUM workloads by folding the warps long enough to guarantee a positive net energy savings every time we power gate. When using Warp Folding with Origami scheduling, the energy sav- ings increase to 49% and 46% for the integer and floating point pipelines, respectively. The extra savings are due to the ability of the Origami sched- uler to coalesce fine-grain bubbles in each pipeline by scheduling the in- structions based on the instructions type and threads activity. Overall, the proposed Origami technique is able to save 22% and 15% more leakage energy when compared to the baseline machine. Figure 4.9 also shows the static energy savings of Warped-Gates [15] presented in the previous chapter. Origami technique is able to save leakage energy slightly better than the Warped-Gates technique. However, as stated before, the main goal of the Origami solution is to achieve better savings than the Warped- Gates techniques without degrading the performance as in Warped-Gates as will be shown in the next subsection. To estimate the overall savings at the execution units level we ex- tracted the leakage power and the per access energy for the FP and the 102 INT pipelines from the GPUWattch [62]. Then we multiplied the energy numbers by the number of times the FP and INT pipelines are activated for each application. Our estimates show that the total static energy nor- malized to the total energy of the execution units( i.e static + dynamic) accounts for 50% of the total execution units energy. Hence applying our proposed techniques saves 25% of the execution units total energy. In ad- dition, since the execution units in the GTX480 account for almost 20% of the total GPU power [62], our proposed techniques are able to save 5% of the the total GPU power. We also analyzed the reasons for improvement in the energy savings by looking at the number of power gating events that were activated. As mentioned before, each switch to the power gating state adds additional power overhead equal to the break-even time. Also each switch out of the power gating state adds additional delay cycles because of the wakeup latency. Using Origami we are able to coalesce many short idle periods into fewer but longer idle periods. In addition the proposed Warp Folding policies keep folding the warps continuously for at leastN Fold cycles that are required to amortize the power gating overhead. 103 0 0.1 0.2 0.3 0.4 0.5 CP kmeans LIB Mri-q sgemm backprop BFS btree gaussian heartwall RAY MUM nw lud stencil WP AVG PG Overehad Two-level Warped-Gates Folding Origami (a) INT pipeline 0 0.1 0.2 0.3 0.4 0.5 CP kmeans LIB Mri-q sgemm backpr… BFS btree gaussian heartwall RAY MUM nw lud stencil WP AVG PG Overhead Two-level Warped-Gates Folding Origami (b) FP pipeline Figure 4.10: Execution units power gating overhead As a result, the number of power gating events reduced but at the same time the amount of time spent in each power gating phase is longer. Fig- ure 4.10 shows the power gating overhead for the integer and the float- ing point pipelines. The power gating overhead is simply the number of power gating events multiplied by the break-even time. The first col- umn shows the overhead of traditional power gating scheme. The sec- ond column shows the power gating overhead for the Warped-Gates [15] technique. Compared to traditional power gating scheme, Warped-Gates is able to reduce the power gating overhead for the integer and floating 104 point pipelines by 6% and 9%, respectively. Warp Folding alone is able to reduce the power gating overhead for the integer and floating point pipelines by 7% and 9% respectively. When combined with the Origami scheduler, power gating overhead is reduced by 9% and 14% respectively. 4.4.3 Performance Impact Our proposed Warp Folding techniques apply different folding policies based on the demand for the resources. Hence the folding distribution is different for each workload. Figure 4.11 shows percentage of time the warps are folded. The first column shows the percentage for the inte- ger pipeline and the second column shows the percentage for the floating point pipeline. On Average, 35% of the warps are folded. As mentioned before the folding decision depends on the instruction mix and the re- source utilization within each workload. For example, mri-q workload has more FP instructions than INT instructions. Hence, the INT instruc- tions have been folded 60% of the time while the FP instructions have been folded only 35% of the time. In addition, and due to the adap- tive policy, some workloads, like backprop and heartwall, have not been folded more than 20% because the adaptive policy disabled Warp Folding during high utilization phases. 105 0 0.2 0.4 0.6 0.8 CP kmeans LIB Mri-q sgemm backprop BFS btree gaussian heartwall RAY MUM nw lud stencil WP AVG Split Ratio INT Pipeline FP Pipeline Figure 4.11: Percentage of time Warp Folding is enabled 0 0.2 0.4 0.6 0.8 1 1.2 CP kmeans LIB Mri-q sgemm backprop BFS btree gaussian heartwall RAY MUM nw lud stencil WP AVG Normalized execution time Warp Folding Origami Figure 4.12: Execution time normalized to Warped-Gates technique [15] Figure 4.12 shows the execution time due to power gating normalized to the execution time of the Warped-Gates technique [15]. The first bar( labeled Warp Folding) shows the performance overhead of Warp Folding, and the second bar (labeled Origami) shows our overall Origami solution. When Warp Folding is applied naively we have a performance degrada- tion of 1.6%. After integrating the Origami scheduler with Warp Folding the performance of the origami solution is better than the Warped-Gates performance by 6.8%. Hence, Origami is able to sustain the energy sav- ings without impacting average performance across our workloads. The performance improvements are due to the adaptive folding policies that 106 takes into account the application utilization. Surprisingly, some work- loads like backprop and btree show performance improvement over the base machine. The reason for this anomalous behavior is that dynami- cally dividing the warps into scheduling groups based on the instruction type and active mask results in a different warp scheduling order, which in some cases reduces the contention on the memory bus, which in turn improves performance. 4.4.4 Sensitivity Studies: We have also done an extensive set of sensitivity studies to varying power gating parameters. When the wakeup delay is increased to 9 cycles the performance overhead increases by 1.4%. When the break-even time is increased to 19 cycles the degradation in performance is less than 0.2%. On the other hand, The total energy savings for the INT and the FP pipelines degraded only by 4% and 1% when the break-even time is 19 cycles and the wakeup delay is 9 cycles,respectively. 4.4.5 Hardware Overhead For the power and area overhead we implemented the extra blocks needed by the proposed idea in RTL. The blocks include the all-ones detection logic, the additional blocks in the lane-shifting and re-shifting stages and 107 the extra counters used by the scheduler. We used Synopsis design com- piler and the NCSU PDK 45nm [3] to synthesize the additional logic. The dynamic power of the additional logic is 0.64 mW per SIMT lane. The dynamic power of the additional logic is counted every time the logic is activated by issuing a warp to one of the SP units. Since there maybe a deviation in the power numbers generated by our synthesis and GPUWattch, we normalized our power overhead to the execution units overhead by synthesizing our own execution unit built from 1 adder (IN- T/FP) + 1 multiplier(INT/FP) + 1 logic unit and the power overhead is 1.6% of the power of the execution units. Also we measured the power overhead for each benchmark based on the power numbers reported by GPUWattch [62] for that benchmark. The results show that the power overhead is less than 1% of the power of the execution units. The area of the additional logic is 0.055 mm 2 and the total area of the execution unit is 32.7mm 2 . As a result, our total area overhead is 0.15%. 4.5 Related Work GPU schedulers: Since the scheduling decision can have a great im- pact on the GPUs performance and power, GPUs scheduler has been the target for different types of optimizations. The two-level scheduler [42] 108 is an optimization over prior schedulers that placed all pending and ac- tive warps in a single queue. The proposed scheduler has a positive im- pact on performance in addition to its role in reducing the register file cache size proposed in that work. Narasiman et. al., [73] proposed an- other two-level scheduler that improves performance and reduces stalls due to memory requests by dividing warps into fetch groups. The two level scheduler gives priority to each fetch group and rotates fetch groups whenever a long latency event occurs. By limiting the scheduling de- cisions to warps within just one fetch group the number of concurrent memory requests generated are reduced thereby reducing memory bottle- necks. Rogers [83] proposed a warp scheduler to improve cache locality and to reduce cache conflicts. Jog [54] proposed a warp scheduler to enable efficient pre-fetching policies. In our chapter we are proposing a new power gating aware scheduling scheme that improves the power gating opportunities with a negligible performance overhead. CPU Power aware schedulers: Power aware schedulers for CPUs and multicore systems have been studied extensively [22, 70, 71, 88, 95]. Previous work focused on dynamic power aware scheduling and DVFS decisions based on available task and service time. Our proposed sched- ulers are based on the GPU execution model and targets energy efficiency through improving the power gating potential of the GPU execution units. 109 Power gating: Power gating techniques have been widely applied in microprocessors [51, 67, 69], caches [38, 93], and NOCs [31, 102]. In this chapter we showed that applying power gating at the SM level is conservative and there are plenty of power gating opportunities when the technique is applied at SIMT lanes level. Specifically, we showed that in- stead of relying on existing idleness for power gating opportunities, as in prior work, we create opportunities by folding warps to create contiguous idleness in high order lanes. GPU power saving: Power efficiency of the GPUs micro- architectural blocks has been extensively studied. At the register file level, several works [38, 42] proposed techniques to save dynamic and static power of the GPUs register file using circuit level and micro-architectural techniques. At the execution units level, Leng [62] explored clock gating and DVFS to save dynamic power of the execution units based on the mask activity and execution phases. Gilani [46] proposed several tech- niques to save the execution units and the register file dynamic power. The proposed technique takes advantage of the similarity in the data val- ues in the GPU workloads to save power. Also they proposed combining two simple instructions into one composite instruction that can be exe- cuted by an enhanced fused-multiply-add units. The static power was not 110 considered in Leng’s and Gilani’s work, which is the focus of this chap- ter. The authors in [98] proposed detecting the divergence pattern of the running warps and run the warps with the same patterns to create gating opportunities. Their approach is only applicable when the under utilized active masks appear repeatedly while our proposed approach works even when applications have a fully utilized active mask. In this chapter we ap- plied lane shifting that is able to create the matching opportunities without any need to lookup for exact matching. Warp Size: The authors in [82] proposed the variable warp sizing (VWS) technique that targets improving the performance of the divergent applications. In VWS the warp size and the SIMT lane cluster size change at run time in order to improve performance of the divergent application. VWS with a smaller warp size as a base warp unit and then gangs them to form a larger warp. On the other hand, our work assumes that wider warps (32 lanes in our case) is in fact a good base unit and enables folding when there is divergence. As shown in figure 4.5 and figure 4.1 we evaluated our techniques using divergent and non divergent applications and the applications with no divergence still have significant pipeline bubbles. Our evaluation show that folding warps has no performance impact and can provide additional opportunities to power gate. Hence our technique is effective both for divergent and non-divergent codes. Finally, in our 111 approach warp folding is a very localized operation which is done in the execution stage and the folded warps are combined back at the end of the execution stage. As such Warp Folding does not require as extensive pipeline modifications as the VWS approach. 4.6 Summary In this chapter we first showed the pervasiveness of fine-grain pipeline bubbles regardless of warp schedulers or workloads. In order to con- vert these wasteful pipeline bubbles into useful opportunities for energy savings, we proposed Origami . Origami consists of Warp Folding, a technique where warps are split into half-warps, which are scheduled in succession. This creates contiguous chunks of idleness in half of the SIMT lanes, which can be leverage for energy saving opportunity through power gating. Origami also consists of the Origami scheduler, a new warp scheduler that is cognizant of the Warp Folding process and tries to fur- ther extend the sleep times of idle execution lanes. By combining the two techniques we show that Origami can preserve energy savings that were achieved by prior approaches, while virtually eliminating all the perfor- mance overheads associated with prior GPU power gating solutions. 112 Chapter 5 Warped Register File: A Power Efficient Register File for GPGPUs 5.1 Introduction As mentioned before, graphics processing units (GPUs) use an execution model called SIMT (Single Instruction Multiple Threads) [6] that allows many of the processing elements to share a single program counter to execute the same instruction but on different data elements concurrently. Concurrent thread execution with fast thread switching is supported by a large register file, even larger than a cache, that holds much of the execu- tion state of each thread. For example, in GTX480 GPU there are a total of 16 streaming multiprocessors (SMs) and each SM has 32 cores. To en- able 32 concurrent threads, each SM has a 128KB register file. However, each SM has only 16KB L1 cache and 48KB shared memory. Thus the total size of the register file across all SMs is 2MB, while the shared L1 cache size is only 512 KB. The inversion in sizing between cache and reg- ister file, compared to the traditional memory hierarchy in the CPU, is a 113 critical microarchitectural feature that is needed for supporting massively parallel execution. Operating a large register file consumes significant dynamic and leak- age power. This problem will get even worse in the future: the reduction in supply voltage has slowed in recent years thereby limiting dynamic power scaling ability of a transistor. The reduction in threshold voltage is leading to a significant increase in the leakage power. As a result, GPU register file power consumption is receiving significant attention from the industry and the academic community [42] [101]. There has been prior research in reducing the register file power con- sumption in traditional CPUs [50] [61] [77]. The SIMT execution model, however, provides GPU-specific opportunities to further reduce register file power. For instance, our analysis (details presented later) shows that once a register is accessed (read/write) by a thread that register is not ac- cessed again for several hundreds of cycles. The long inter-access delay can be exploited to save leakage power by placing registers in drowsy state immediately after each access. Second, the utilization of SIMT cores within a GPU vary dynamically due to the varying amount of par- allelism and branch divergence problems in the applications. As GPUs 114 are increasingly deployed in wide range of application domains, the par- allelism variance in application activity will only grow. Hence, dynami- cally disabling access to inactive registers can save significant amount of dynamic power. In this chapter we will show GPU-specific microarchitectural features that we will take advantage of in our proposed techniques to reduce the leakage and the dynamic power of the GPU register file. The contribution of this chapter are: Tri-modal register file: We exploit the property that the inter-access cycle count to registers is in the order of hundreds of cycles across a wide range of GPU workloads. We propose and evaluate a tri-modal register file that can switch between ON, OFF and drowsy states to reduce the leakage power consumption. Active mask aware gating: We exploit the dynamic variance in the available parallelism within a warp to disable bitline and wordline activ- ity of unused registers. We rely on GPU’s built-in active mask feature to identify inactive threads within a warp well ahead of scheduling an in- struction. Thus by using the active mask we disable unnecessary register activity to reduce the dynamic power. By combining the above two techniques we show that the power con- sumption of the register file can be reduced by 69% on average. 115 Figure 5.1: GPGPU core pipeline 5.2 Opportunities for Register File Power Savings In order to show GPU-specific power saving opportunities in register files, in this section we present results from our experiments character- izing several GPU workloads. We used benchmarks from NVIDIA Com- puting SDK [5], Rodinia Benchmark suite [68], and Parboil Benchmark suite [9] . The list of benchmarks used in this study are listed in Col- umn 1 of Table 5.1. The workload characterization results are obtained by running the benchmark suites using GPGPU-Sim v3.02 [21]. Register Allocation at Compile Time: We extracted the number of registers used by the compiler for different benchmark kernels using the ” -ptxas-options=-v” in the nvcc compilation flags. In Table 5.1 the last column shows the percentage of the total available registers that are al- located by the compiler for benchmark execution. On average, 46% of the register file is never even allocated for executing a program. Given the vast number of registers available in a GPU, compilers simply cannot 116 find enough demand to allocate all available registers for most applica- tions. An unallocated register can be power gated at the beginning of the program execution without worrying about waking up that register. This is just one example of a GPU-specific opportunity to reduce register file leakage power. Benchmarks concurrent Allocated CTAs register % Cutcp 8 62.5% blackscholes 8 50% mri-q 6 57% sgemm 5 53% Pathfinder 6 43% streamcluster 2 31.3% Backprop 5 52% dct8*8 8 37.5% nn 8 3.9% hotspot 3 62.7% heartwall 2 78.3% nw 8 9.4% bfs 3 33.3% lbm 7 82% sad 8 43% Table 5.1: Workloads’ registers requirements Register Inter-access Distance: In the next characterization experi- ment we focus only on the registers allocated by the compiler for a pro- gram execution. We measured the number of cycles elapsed between two accesses to the same register. Figure 5.2 shows the average inter-access cycle count for the allocated registers for several workloads. To eliminate the skew generated by a 117 572 647 623 655 407 438 819 316 797 1073 1118 292 1122 2000 951 789 0 500 1000 1500 2000 cutcp Mri-q sgemm blackschoes backprop pathfinder dct8*8 streamcluster hotspot heartwall nw nn lbm sad bfs AVG Cycles Count Figure 5.2: Registers inter-access cycle count few very low utilized registers, the results shown in Figure 5.2 exclude the registers with an inter-access cycles count of more than 3000 cycles. Most workloads have an inter-access cycles count in the order of hun- dreds of cycles. On average, once a register is accessed in a cycle its next access will be 789 cycles later. We can refer this to the GPU execution model where when a warp instruction is executed it is unlikely that the same warp is scheduled for execution in the next cycle by the two-level warp scheduler. The only time a warp is scheduled in two consecutive cycles is when no other warp is ready in the active warp queue and the current warp’s next instruction is ready for execution. In all other cases, there is a delay between two consecutive scheduling cycles for any warp, which results in large inter-access delay for a given register. This large inter-access delay provides additional leakage power savings opportuni- ties by using drowsy-cache approach to put a register to drowsy state immediately following the current access. This behavior is not scheduler 118 dependent. For example, even when other schedulers like the GTO sched- uler is used, when the scheduler switches to a different warp then the scheduler will spend large number of cycles scheduling the other warps before switching back to the same warp. Underutilization of Warps: As mentioned earlier, the minimum unit of work that can be scheduled on a GPU is called a warp. Each warp consists of 32 threads executing the same instruction (PC) in a lock-step manner. A fully utilized warp has 32 active threads executing one in- struction at a time. Figure 5.3 shows the breakdown of the number of active threads for different workloads. Each bar in the graph is divided based on the number of active threads. The top most component of a bar labeled 25-32 corresponds to the amount of time a warp has more than or equal to 25 active threads and less than 32 active threads. We grouped the workloads based on active threads count into three categories : Category 1 has the workloads where all the warps have 32 active threads through- out the entire workload execution. Category 2 has the workloads that have utilization levels between 90%-99%. Category 3 has the workloads that have utilization levels below 90%. The data shows that only four workloads are in Category 1 indicating that many workloads rarely uti- lize all 32 threads. Throughout the rest of this chapter, we organized our workloads as follows: in all the figures(tables) the left most(upper) group 119 presents Category 1 workloads, the group in the middle presents Cate- gory 2 workloads and the right most(lower) group represents Category 3 workloads. 0.00% 20.00% 40.00% 60.00% 80.00% 100.00% cutcp Mri-q sgemm BlackScholes backprop pathfinder DCT8*8 streamcluster hotspot hearwall nw nn lbm sad bfs Percentage of Active Threads 1-4 5-8 9-12 13-16 17-24 25-32 Figure 5.3: Warp utilization breakdown The number of active threads within each warp can be less than 32 for two reasons: First, the workload itself may not have enough threads to fill all the warps with 32 threads. The inherent limitation in the amount of available parallelism in a workload has not been a significant concern for purely GPU-oriented workloads. But as GPUs are used for more general purpose computing the parallelism limitation is becoming a concern. For example, nn workloads has only 16 active threads in all scheduled warps. Second, if a branch instruction is encountered during the execution, then some of the threads will diverge to the taken path and the rest will diverge to the not-taken path. As a result, the warp will be scheduled in two phases. In the first phase the threads in the taken path will execute and all the threads in the not-taken path will be idle. In the following phase all 120 threads in the not-taken path execute while the threads in the taken path idle. The GPU scheduler uses the active mask, a 32-bit vector, that shows the state of the active threads within the scheduled warp in that cycle. If the active mask bit for a thread is zero then that thread will not be active during that cycle. Even though many warps have fewer than 32 threads, each warp reads all 32 register operands from the register file, which wastes dynamic power. Hence, using active mask to reduce register file activity can re- duce dynamic power. Impact of Register Usage on Power:The data from the characteriza- tion experiments showed that in many applications the compiler cannot even use all the available registers and hence many registers are left un- allocated. Even when a register is allocated the distance between two consecutive accesses to the same register is around 789 cycles. Thus vast majority of the GPU registers are in idle state for long periods. Despite the fact that registers are not accessed, these registers still burn leakage power. As shown in Chapter 2 the leakage current will double when we move from 90nm to 32nm. Moreover, it is expected that the leakage power will dominate as we move to a smaller feature size. Static Noise Margin (SNM): As technology scales SNM degrades. As a result, the minimum data retention voltage(DRV) should increase. 121 We measured the DRV for the 6T SRAM cells scaled from 90nm to 32 nm. Table 2.1 fifth column shows the ideal DRV values for 90nm, 65nm, and 32nm [1]. The retention voltage necessary to keep the data alive in an SRAM cell has increased from 120mV to 220mV . 5.3 Reducing Register File Leakage Power Given the opportunities presented in section 5.2 we now present our leakage power reduction technique and the required circuit and micro- architecture level support. Turning OFF Unallocated Registers: Our analysis showed that many registers in GPUs are not even allocated for program execution. The first step is to identify all the unallocated registers by analyzing the compiled code. This is a simple static code analysis that can be done on the application binary. We then turn off the unallocated registers com- pletely at the start of the application execution. The microarchitectural support necessary to turn off the GPU registers is described later. Drowsing Allocated Registers: The register inter-access cycle count shown in Figure 5.2 indicates that there are opportunities to further save on leakage power by exploiting the long idle times between two consec- utive accesses to the same register. While turning off the register means zero leakage current, the content of the register is lost. Since threads use 122 registers to maintain the context it will not be possible to just turn off any allocated registers, even if the inter-access delay is long. Hence, we use drowsy mode operation [38] to put the registers into drowsy state. Drowsy state incurs less leakage saving but the register content is saved and will be accessible when needed. While an ideal drowsy mode operation can operate at DRV , in practice it is necessary to add safety margins to take into account the non-idealities and the mismatch between the transistors. Figure 5.4 shows the leakage current savings when adding different safety margins. The Y-axis plots the percentage of leakage current consumed by an SRAM cell operating at DRV+margin, compared to an SRAM cell operating at Vdd. The DRV and Vdd values for each technology node are taken from Table 2.1 . Even if we conservatively add 250mV to DRV an SRAM cell in the drowsy mode consumes less than 10% of the leakage current consumed by the SRAM cell that operates at full Vdd. Figure 5.4: SRAM cell leakage current in drowsy mode with different safety margins normalized to Vdd leakage current 123 Switching Policy: The switching policy decides when to turn ON a register and when to put a register to drowsy state. In this chap- ter, we describe one switching policy called read-aggressive and write- conservative. Whenever a specific warp is assigned to a collector unit, the associated input and output registers for that warp are switched from drowsy state to the ON state. The two-level scheduler allocates a collector unit to the scheduled warp and at the same time the switch to ON signal will be sent for the input and output registers. In the most optimistic sce- nario input registers are read in the next cycle into the collector unit. Due to bank conflicts sometimes the collector unit is unable to read the second input register operand concurrently with the first input operand. Hence, there is a delay between reading the first and second input operands. Also, sometimes even if the warp operands are ready it can not be issued to the execution stage directly because more than one warp(collector unit) can be ready at the same time; we can issue up to two warps instructions each cycle. In our proposed policy all input registers of an instruction are wo- ken up as soon as the collector unit is assigned. However, these registers switch back to drowsy state independently as soon as they provide data to the collector unit. For the output register we use a slightly conservative approach. The output register is kept in the ON state from the time the collector unit 124 is assigned until the time the instruction completes execution and writes back the result. The scoreboard will track the output registers for each issued warp already. Once the result is ready and written back to the register file, the scoreboard can instruct the output register to be switched back to the drowsy state. 5.3.1 Architectural Support for Tri-mode Operation Based on the description above, each register entry in the register file needs to be in one of three states: namely ON, OFF and drowsy. In [38], a drowsy control signal is used to switch the cache line between the drowsy state and the ON state. To switch between three states the mi- croarchitectural support suggested in [38] is inadequate for our approach. In [75] the authors proposed a tri-modal switch that can put the logic in one of three states ON, OFF and drowsy. The proposed switch uses MTCMOS transistors to control the voltage levels and speed. We use the tri-modal switch to place the unallocated registers into OFF state and all the remaining registers are placed either in drowsy or ON state. As described in our switching policy we place the register in drowsy state by default. Whenever the register is accessed we move that register to the ON state temporarily to enable register read/write accesses. The register will switch back to drowsy state after the operation is complete. 125 Figure 5.5: The proposed register file with the tri-modal control unit(TRIC) and the coordinated mask aware control unit(COMA) integrated. Figure 5.5 shows the block diagram of the register file tri-modal con- trol (called the TRIC) unit. The figure shows two rows of registers and each row has 32 32-bit registers, namedR 0 T 0 indicating Register R0 for Thread T0 and so on. Each register has its own tri-modal switch that receives its control signals form TRIC. TRIC will first receive the appli- cation’s register allocation information. This information is first extracted by the compiler analysis and is provided as part of the application binary metadata. TRIC will use the allocation information to turn OFF all the unallocated registers at the start of the application execution and then put all the allocated registers into drowsy mode by default. At run time the two level scheduler sends the input and output reg- isters of the warp that is being assigned a collector unit to TRIC. TRIC will use this information to move input and output registers from drowsy state to ON state. Once a register value is read, the collector unit sets the corresponding input ready bit to “1” and also informs TRIC that it 126 received the register value. TRIC then switches that input register back into drowsy state. Finally when a warp enters the write back stage the scoreboard sends notification to TRIC that an output register is written. TRIC then switches the output register to drowsy state. 5.3.2 Architectural Support for Reducing Drowsy Wakeup Latency One potential drawback of using the drowsy technique is the performance loss due to the wakeup latency. As explained before, the scheduler only looks at the ready warp queue to issue a warp for execution. In our base- line implementation the scheduler sends wakeup notifications to TRIC for registers associated with the warp that is currently being scheduled which only gives one cycle lead time. Thus there is a one cycle delay between when the scheduler sends a wakeup signal to the registers and when the collector unit starts reading the register value. Based on our evaluation, a one cycle wakeup delay does not cause any performance loss in our base machine. If the wakeup latency is multiple cycles then there may be some performance penalty. In this section we present one potential approach to entirely hide the multi-cycle wakeup latency of a drowsy register, albeit with a small mod- ification to our baseline. For instance, if drowsy wakeup latency is two cycles, then the scheduler needs to send the register information to TRIC 127 at least one cycle before issuing the associated instruction to the collector unit. To handle this case the scheduler can issue one instruction and con- currently look at the ready warp queue to find the warp that is going to be issued in the next cycle. If the scheduler knows which warp will be is- sued next cycle it can then pro-actively send the register read information to TRIC one cycle before the warp is scheduled. Thus one can eliminate a two cycle delay associated with wakeup of a drowsy register. In fact the scheduler can look ahead into the ready warp queue to identify the warps that will be issued “n” cycles ahead and may wakeup the drowsy registers to hide up to “n” cycles of wakeup delay. In the worst case, the scheduler will only know which warp will be issued only during the start of the cycle. In this case the scheduler cannot hide the latency of register wakeup. In our results section we explored the performance impact of a range of drowsy cache wakeup latencies, assuming there is no way to hide wakeup delays beyond one cycle. 5.4 Reducing Dynamic Power with Active Mask Aware Gating In the previous section we described our approach for reducing the leak- age power of the register file. But every time a register entry is placed in 128 ON state all 32 register operands associated with a single warp instruc- tion will be woken up. Thus, every access to the register file will read 128 byte register entry to feed the 32 threads in a warp with their source operands. Reading such a large register will incur significant dynamic power because of activating the bitlines, wordline and the sense ampli- fiers. As shown in Figure 5.3, the number of active threads within the sched- uled warp are fewer than the warp width of 32 threads. Some workloads like nw and nn do not have more than 16 active threads throughout the execution time. During runtime many workloads have varying number of active threads within a warp. Scheduling a warp with partial utiliza- tion still activates the wide register file entry. For example, scheduling a warp with 31 active threads out of 32 means that we have to charge the wordline (WL) segment of 32 cells, pre-charge 32 bit lines (BL) and 32 bit line bars (BLB), and activate 32 sense amplifiers, although only 31 of them are useful for warp execution. According to [52], the active power of read operation from the SRAM cell is proportional to the number of the accessed bits and the access time. The power optimal solution for such an access behavior is to access only the registers associated with active threads within a warp. GPUs already use an active mask to determine which threads are active and 129 which threads are inactive. Hence, we will exploit this information to disable register activity associated with inactive registers. Thus, we will use the active mask of each warp to disable the BL, BLB, sense amplifiers and the output circuitry of the inactive part of the accessed register. 5.4.1 Architectural Support Recall that in our baseline design the register file is banked and each reg- ister entry is 128 bytes wide. In order to support active mask aware access to the register file, we use the Divided Word Line (DWL) approach [99]. DWL was originally implemented to save dynamic power on large caches where a single word need to be accessed at anytime. Divided Word-Line(DWL): In the DWL technique the WL is divided or segmented into different wordlines. Figure 5.6 shows the schematic for the DWL. Each WL segment has its own local decoder, a simple AND gate, that enables or disables accessing the SRAM cells attached to that WL segment. For our work we modified the original DWL approach so that each access to register entry can provide data to only a subset of threads within a warp. GPU designs are particularly suited for easy integration of the DWL approach into a register file. A warp’s active mask provides all the necessary decoding information to identify active and inactive threads within a warp. Whenever a register file is accessed, 130 the active mask of the scheduled warp can be used by DWL to activate or deactivate the BL and BLB pre-charge, WL segments, sense amplifiers and the output circuity. Figure 5.6: Schematic of the divided wordline In order to manage the gating signals we added the co-ordinated mask aware (COMA) control unit to the register file. The bottom part of Figure 5.5 shows the block diagram of the register file with COMA integrated. When a read or write operation is issued, COMA loads the warp specific active mask. It then generates the appropriate control signals that can be used to gate the inactive registers. COMA consists of a 16 entry mask (same as the number of the collector units) table indexed by the collec- tor unit entry id. Whenever the two-level scheduler assigns a warp to a collector unit the associated active mask will be written in the designated entry in the mask table. Every register access request is routed to COMA along with the collector unit entry id. The active mask table is then ac- cessed to read the active mask values. The active mask will then be fed to 131 the gating logic to generate the appropriate gating signals. The total size of the active-mask table will be 16*32 = 512 bits. For the output register, the scoreboard will send the active mask of the output register to COMA when the scheduled instruction reaches the writeback stage. Power Efficiency of Modified DWL: In order to quantify the bene- fits of the DWL technique, we built a 128 byte register entry in Cadence. The register is implemented using the technology files for 90nm, 65nm and 32nm [1]. We extracted the WL resistance and the capacitance per unit length for different technologies [2] [30] [49]. Also we estimated the SRAM cell area as 146 2 [2] where is the feature size. The developed RC Model is augmented between every two cells in the 128 byte regis- ter. Even after accounting for the additional delay in the AND gate, the reduced RC effects of a long wire of the DWL approach result in a 55%, 31% and 23% reduction in the WL charging delay for 90nm, 65nm and 32nm technologies, respectively. Different register file organizations : There are other possible regis- ter file organizations than using a single wide register entry. One possible implementation is the one used by [42]. In that organization, the SIMT lanes are clustered into groups of four. Each cluster has its own regis- ter file which is only 16 bytes wide. The narrower register file provides 132 four 32-bit values to the four associated SIMT lanes. Another design op- tion used in [101] splits registers into 32 banks where each bank is just one word wide (4 bytes) and each bank provides data only to one thread within the warp. In fact, every bank is statically assigned to provide data to only one processor element. Irrespective of the register file organiza- tion the fundamental problem of activating all 32 registers associated with a single warp remains. With 32 register banks there is no need for DWL since the wide register is already split into multiple banks. However, even if DWL is not necessary for narrow width registers, the COMA unit can still be re-purposed to gate the bank accesses based on active mask. 5.5 Evaluation 5.5.1 Evaluation Methodology We evaluated our proposed techniques for leakage and dynamic power saving techniques using GPGPU-Sim v3.02 [21]. We performed our ex- periments using a Fermi-like GPU configuration. The simulator config- uration parameters are shown in Table 5.2. For the workloads selection, we covered different programming styles by selecting workloads from different benchmark suites. The workloads cover different scientific and computation domains that try to benefit from parallel architectures. We 133 used benchmarks from NVIDIA CUDA SDK [5], rodinia Benchmark suite [68] and Parboil benchmark suite [9]. The list of workloads used are listed in Table 5.1. Hardware Model Fermi Execution Model In-order no. of SM cores 16 no. of PE per SM core 32 Register file size 128 kB Register Width 128 Bytes no. of Banks 16 Warps/SM 48 Warp Scheduler 2-level Scheduler PTXPLUS Enabled Table 5.2: Simulation parameters 5.5.2 Leakage Power Savings with TRIC In this section we will evaluate the leakage power savings when us- ing TRIC as discussed in section 5.3. Using our read-aggressive,write- conservative switching policy, registers spend significant amount of time in drowsy mode. The unallocated registers are of course entirely turned OFF by TRIC. For the results presented in this section we assume a wakeup latency of three cycles, we also assume that during the wakeup duration the register is operating at full Vdd. The first bar in Figure 5.7 shows the leakage power saving as a percentage of total leakage power 134 consumed without TRIC which is the baseline. As shown , TRIC is able to save 91% of the leakage power of the register file. 0% 20% 40% 60% 80% 100% cutcp Mri-q sgemm blackschoes backprop pathfinder dct8*8 streamcluster hotspot heartwall nw nn lbm sad bfs AVG Saving Percentage Leakage Dynamic Total Figure 5.7: Leakage power, dynamic power and total power savings 5.5.3 Dynamic Power Savings with COMA The second column in Figure 5.7 shows the power saving as a percentage of total dynamic power consumed without COMA which is the baseline. The dynamic power savings using the COMA unit depends on the activity of the running benchmark. Using the categories presented in the motiva- tion section, it is clear that Category 1 workloads will not benefit from this technique since 100% of warps have 32 active threads. On the other hand, the savings obtained from applying the technique on Category 2 and Category 3 workloads depends on the number of active threads. Some workloads in Category 2 have limited power savings because most of the scheduled warps have only a few (0,1 or 2) inactive threads. As a result, the power saving ranges from 1% to 6%. Benchmarks such as nw and 135 nn have a large dynamic power savings because they have only 16 active threads out of 32 in all the scheduled warps. Benchmarks such as heart- wall have varying amount of thread level parallelism. But COMA can dynamically adjust the number of registers that are turned ON thereby re- ducing the dynamic power by 46% compared to the baseline. The average dynamic power saving through all Category 2 and Category 3 workloads is 19%. 5.5.4 Warped Register File We will refer the register file that combines the TRIC and the COMA techniques as warped-register file. The warped register file first uses TRIC to decide which register entry to bring to active state from drowsy state. Once the register is brought into active state COMA is used to de- cide which registers to activate for that given warp based on the active mask. For computing the relative importance of dynamic and static power, we measure the ratio of the leakage and dynamic power of the register file. We used the method proposed by [64] to measure leakage and the dynamic power in the register file organization under study. We extracted the SRAM cell dimensions and the bitline and the WL capacitances from [2] and [49]. The simulation results show that the dynamic power for 136 reading or writing a 128 byte register is twice the leakage power of the register file bank in 32nm technology. The third column in Figure 5.7 shows the total power savings as a per- centage of the total power. Total power is computed assuming dynamic power is twice the leakage power. Category 3 workloads have the highest power savings because they take advantage of both the leakage and the dynamic power saving techniques. On the other hand, category 1 work- loads take advantage of only the leakage power saving technique and do not gain from the dynamic power technique. As a result their power sav- ing is less than that of Category 2 and 3 workloads. We also computed total power savings assuming higher dynamic to leakage power ratio. The results show that on average the total power savings are 69%, 59% and 51.5% when the ratio of dynamic power to leakage power is 2 to 1, 5 to 1 and 8 to 1 respectively. 5.5.5 Area and Performance overhead The area overhead in our proposed techniques comes from the AND gates added to the register file to implement DWL, the COMA unit and the associated mask table, and the TRIC unit. The area overhead for all the added components is around 4% the size of the total register file area. 137 0.7% 1.0% -1% 0% 1% 2% 3% 4% cutcp Mri-q sgemm blackschoes backprop pathfinder dct8*8 streamcluster hotspot heartwall nw nn lbm sad bfs AVG Normalized Execution Time Drowsy_2_cycles Drowsy_3_cycles Figure 5.8: Performance degradation with drowsy wake-up latency of 2(Drowsy 2 cycle) and 3(Drowsy 3 cycle) cycles. As mentioned earlier, a one cycle wakeup latency for a drowsy reg- ister does not impact performance in our baseline since there is a one cycle slack between instruction scheduling and register read operation. In case the register wakeup latency is more than one cycle, we make the worst case assumption that wakeup latency delays warp’s execution. To quantify the effect of the wakeup latency on performance, we ran our workloads with two and three cycles wake-up latencies. Figure 5.8 shows the performance degradation with two and three cycles wake-up latency. The results show an average performance degradation of 1.02% in the case of three cycles wake-up latency. Since GPGPUs have large number of ready warps even if a warp is delayed there are other warps that can continue to execute as long as there is no serious contention on collec- tor units. Our results in fact show that collector unit contention is very limited. However, the opportunities to hide the wake-up latency will di- minish when there are not enough ready warps. For example, the number 138 of warps running in nn,nw and sgemm benchmarks is less than 20. As a result, these benchmarks suffer the most performance degradation. On the other hands, the benchmarks with many active warps see no perfor- mance degradation. Also, two benchmarks (heartwell and lbm) experienced performance improvement when the wake-up latency increased from one to two cycles. We analyzed the performance statistics for those benchmarks and it turns out that the additional wakeup delay lead to a different warp scheduling sequence. The newly scheduled warp sequence encountered fewer cache misses due to locality improvements. Hence the stalls due to the mem- ory contention were reduced. As a result these benchmarks saw a slight performance improvement due to scheduling perturbations. 5.6 Related Work Our proposed ideas rely on three important prior works. Drowsy caches have been proposed as an efficient technique to reduce the leakage power consumption [38]. Also, [75] proposed a tri-modal switch that can be used to switch the combinational or sequential logic into one of three states, namely ON, off and drowsy. [52] and [99] discussed the DWL technique to avoid charging the wordline for the SRAM cells that are 139 not part of the accessed word. In our study we modified these prior ap- proaches and applied them in the context of GPGPU register file accesses. CPU Register File and Cache Power Consumption : Dynamic and leakage power reduction techniques for CPU register file have been ex- tensively studied. Techniques have been proposed to reduce the register file power consumption at the software level [20] [76], the microarchitec- ture level [33] and the circuit level [50]. [47] used a compile time register file partitioning and code recompilation to reduce the number of active registers used. They divided the register file into active and inactive par- titions. They used the drowsy technique to put the unused partitions in the drowsy mode. Their approach needs code recompilation and explic- itly forces applications to use reduced register set to save power. Our approach neither needs code recompilation nor places any restrictions on register usage. Cache leakage power reduction received significant attention. [38] and [60] proposed using the drowsy leakage current reduction technique on the data and instruction caches. [16] and [72] studied the leakage power reduction in caches. They used the prior knowledge of the cache access patterns and cache line inter-access time to apply the drowsy or cache line turn off techniques to reduce leakage power. When a cache line is turned off they rely on lower levels of memory hierarchy to fetch the 140 data when needed. These techniques are applicable to cache designs in traditional CPUs but not for register files in GPGPUs. Turning off the register file after an extended idle period is not a viable option since there is no additional memory hierarchy levels to fetch the lost register data. In our proposed ideas, we studied the policies, the architecture and the circuit level modifications that are needed to reduce the GPGPU register file power consumption with negligible performance overhead. GPU Power Consumption: GPU leakage power consumption reduc- tion techniques have been proposed. In [90] the authors proposed saving the GPU leakage power through gating the unused processing elements. In [26] the working frequency and available SMs are dynamically mod- ified during the application run to minimize the overall power consump- tion. GPGPU Register file power consumption: In [42] the authors have studied the register file dynamic power reduction using a register file cache (RFC). RFC reduces the number of accesses to the main regis- ter file. RFC caches the registers that have been accessed recently by the active warps. This approach targets the dynamic power but not the leakage power. In [101] authors implemented the register file using the embedded DRAM(eDRAM). They divided the register file into a set of contexts. Each context holds the data for a set of warps. Every time they 141 switch from one context to the other one they switch in the registers into the SRAM part of the memory and switch out the unused part into the DRAM part of the eDRAM. These two prior works on GPGPUs focused on reducing dynamic power of GPGPU register file. But in a Fermi-like configuration there are even greater opportunities to save on static power which our work exploits. Given the size of the register file in GPGPUs it is clear that static power reduction techniques will become critical go- ing forward. Our work thus focuses on reducing the dynamic as well as static power consumption with negligible performance penalty and small hardware overhead. 5.7 Summary GPGPU register file power consumption is a significant concern. Using a detailed workload characterization we show GPGPU-specific opportu- nities that can be exploited to reduce the register file power savings. In this chapter we proposed two GPGPU-centric power saving techniques to reduce the static and dynamic power consumption of GPGPU register file. The first technique relies on a TRIC unit. TRIC uses a tri-modal switch to turn OFF registers that are not allocated for program execu- tion. It then places all allocated registers into drowsy state by default and brings them to active state only when they are being accessed. Given 142 the large inter-access distance to registers this aggressive drowsy state management mechanism suffers from 1% performance overhead on av- erage. The second technique relies on a COMA unit. COMA uses the active mask of a warp to eliminate the activation of the unused register segments in a wide register file organization. The warped register file franework that combines TRIC and COMA proposed techniques is able to reduce the total power consumption of the register file by 55% to 90%. 143 Chapter 6 Energy Efficient Partitioned Register File for GPGPUs 6.1 Introduction In the previous chapter we proposed the warped register file technique in order to reduce the leakage power and the dynamic power of the GPUs register file. In this chapter we look at more aggressive techniques that can further reduce the dynamic power and the leakage power of the reg- ister file at sub-10nm technology. One way to reduce the register file power is to design the register file to operate at a near-threshold voltage(NTV). Many research stud- ies [34, 37, 57] have shown that while NTV operation can dramatically reduce the power consumption it has two drawbacks. First, designing the register file to operate at NTV using planar CMOS technology, particu- larly at sub-16nm regime, is not feasible because of inadequate control of the short channel effects in transistors and high levels of variability (e.g., random dopant fluctuations and line edge roughness) in transis- tor I-V characteristics, which in turn result in high-leakage logic cells, 144 low static noise margin SRAM cells, as well as other reliability con- cerns [74,91,92]. Second, the performance overhead due to slower access latency of a NTV register file is significant. For example, as we show later in our results section, even if the NTV register file delay increases to just 3 cycles (from 1 cycle access latency at super-threshold operation) then the overall performance will degrade by 7.1%. Hence, the register file performance is critical to the overall performance of the GPUs. The first concern mentioned above, regarding stability and reliability at NTV operation, has shown to be addressed effectively using the Fin- FET technology [48, 91]. This arises mainly from the fact that FinFET devices offer superior scalability [94], lower gate leakage current [29], excellent control of short-channel effects [100], and relative immuniza- tion to the gate line-edge roughness [24]. As a result, industry has adopted FinFET technology for their chip production since 2012 [4, 10]. Thus it is feasible to build a large but stable register file for GPUs using FinFET devices that operate at NTV . But the second concern mentioned above, namely the performance impact of the slow device access time at NTV operation, cannot be addressed by simply changing the technology node. Figure 6.1 shows the impact of voltage scaling on the overall speed of an inverter built using 7nm FinFET technology (simulation model and details presented in section 6.6). 145 VDD Tmin (ps) 0.45 140.26 0.140263 0.44 145.03 0.145031 0.43 150.26 0.150262 0.42 156.08 0.156079 0.41 162.67 0.162674 0.40 170.00 0.169995 0.39 178.29 0.178289 0.38 187.78 0.187779 0.37 198.44 0.198438 0.36 211.06 0.211064 0.35 225.28 0.225283 0.34 241.65 0.241648 0.33 260.84 0.260841 0.32 283.40 0.283398 0.31 310.02 0.310018 0.30 342.43 0.342425 0.29 381.07 0.381072 0.28 429.44 0.429438 0.27 487.61 0.487615 0.26 560.12 0.560115 0.25 649.90 0.649904 0.24 764.78 0.764782 0.23 906.12 0.906124 0.22 1084.00 1.084 0.21 1301.00 1.301 0.20 1565.00 1.565 0.19 1892.00 1.892 0.18 2292.00 2.292 0.17 2781.00 2.781 0.16 3385.00 3.385 0.15 4125.00 4.125 0.14 5017.00 5.017 0.13 6113.00 6.113 0.12 7435.00 7.435 0.11 9036.00 9.036 0.10 10980.00 10.98 0.09 13300.00 13.3 0.08 16050.00 16.05 0.07 19290.00 19.29 0.235 0 0.3 0 0.06 22810.00 22.81 0.235 4.5 0.3 4.5 0.05 0.04 0.03 0.02 0.01 0.00 Device: STD (LVT) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 0.15 0.20 0.25 0.30 0.35 0.40 0.45 Delay (ns) VDD (V) Near-Threshold Voltage (NTV) Super-Threshold Voltage (STV) Sub-Threshold Region V th Figure 6.1: Delay of 40-stage FO4 inverter chain vs. Vdd for 7nm FinFET technology In this simulation we used devices with actual gate length of 7nm and 1.5nm underlap on each side, resulting in an effective channel length of 10nm [85]. As shown, the delay of the inverter will increase as we lower the voltage . In this work, we consider a supply voltage of 0.3v as NTV while 0.45v as super-threshold voltage. The delay at NTV is significantly smaller than the delay of operating the inverter in the sub- threshold regime (Vdd<v th ) and yet it is still significant when compared to the delay at super-threshold voltage (STV). Such slowdown may have a negative impact on the overall performance of any computing device. For instance, recent FinFET based NTV designs have shown that access delay of a 16-bit adder triples from .051ns at STV to .153ns at NTV [97]. Thus it is imperative to deploy microarchitectural solutions to tackle the slow access time of register files built using FinFETs operating at NTV . Our analysis of the register file access pattern in GPUs, presented in section 6.2, show that warps disproportionately access a smaller subset of 146 their total assigned registers. In other words the working set size of the active registers is much smaller than the total number of registers assigned to that application. We propose to exploit this observation to tackle the negative impact of the long access latency of a FinFET register file oper- ating at NTV . We divide the main register file into two partitions: highly utilized registers, which are placed in a fast register file (FRF) partition operating at super threshold voltage (STV) where they can be accessed quickly, while the remaining registers are placed in a large slow regis- ter file (SRF) partition operating always at near-threshold voltage (NTV), where an application may suffer from latency penalty for accessing them. Since we are partitioning the main register file (MRF) the total size of the SRF and the FRF will be equal to the size of the MRF. Hence, we are not adding additional storage overhead. Furthermore, the two partitions are designed with different FinFET operating parameters so as to achieve op- timal design goals (switching speed and access energy) for each partition. An alternative to this proposed design is to design a single register file always operating at NTV . Then one may dynamically switch the voltage of highly accessed registers to operate at super-threshold voltage (STV). However, it is well-known that, with standard FinFET technology (with- out independent back-gate control), it is not possible to design combina- tional logic cells, and for that matter SRAM cells and register files, that 147 exhibit optimal performance at both NTV and STV regimes. In other words, cells designed to yield optimal performance at STV will be sub- optimal at NTV and vice versa [66]. Another alternative is to cache a subset of the registers. The register file cache (RFC) approach was explored in [33,42]. In the RFC approach, the recently accessed register will be cached in an SRAM structure named the register file cache. RFC will reduce the number of accesses to the main register file and as a result reduce the register file access energy. A detailed comparison of our proposed design with register file caching is presented in the results section. But to highlight the main issue, GPU trends indicate that the number of warps issued concurrently is increasing with each generation. For example, in the kepler architecture [14] the SM has four schedulers and each scheduler can issue up to 2 instructions. In order for the schedulers to be able to issue 8 instructions each cycle they should be able to select from a larger pool of active warps. Accordingly, the RFC size should also grow with the issue width so as to cache the recently accessed registers from a larger active warp pool. Otherwise, the RFC hit rate suffers. In addition, RFC should have multiple read and write ports to service the tag check and register file access requests generated by up to 8 instructions each cycle. Hence, scaling the RFC (i.e. using larger cache and/or more read/write ports) will compromise the 148 energy efficiency of the RFC. Our results show that even with the ability to cache 6 registers for up to 16 active warps, the RFC hit rate does not go beyond 40%. Hence, still 60% of the accesses need to serviced by the MRF, leaving a large part of energy consumption concern intact. In this chapter we made the following contributions: We show a detailed analysis of the GPU register file access behavior across a wide range of workloads. Our analysis show that all the warps within the same cooperative thread array (CTA) and across different CTAs have the same register file access behavior. We introduced the concept of the pilot warp that is used to collect statistics at runtime about the register file access behavior from one warp. We then show that the coarse-grain statistical data collected from one pilot warp in one CTA can be used to identify the fre- quently accessed registers subset for all other CTAs in the same ker- nel. We proposed the partitioned register file design to reduce the dy- namic energy and the leakage power of the GPUs register file. The partitioned register file puts the highly accessed registers in the fast 149 register file (FRF), accessed within one cycle, and the remaining reg- isters will be stored in a slow register file (SRF) partition that always operates at NTV . We proposed a low overhead swapping mechanism to allocate the highly accessed registers in the FRF and the remaining registers in the SRF. The swapping table simplifies the task of moving register content between FRF and SRF and avoids the overhead of the tag- check used in recent work [42]. We present a detailed design analysis of FinFET register file design at the 7nm technology node. We show that the FinFET based register file has better read and write stability. We also show how FinFET’s binary back gate control enables fast and power efficient switching between two different speeds without the need for having multiple different voltage rails. 6.2 Register File Access Behavior in GPUs As we mentioned before in Chapter 2 each kernel in a GPU application is compiled into thousands of threads. These threads are then combined into hundreds of thread blocks, called cooperative thread arrays (CTAs) which are further sub-divided into warps of 32 threads. The GPU runtime then 150 assigns multiple CTAs to each SM. Only when all the warps in a CTA complete a new CTA is assigned by the runtime CTA scheduler. Benchmark Registers/ Threads/ Pilot Thread CTA CTA% bkprp 13 256 2.6% BFS 7 256 0.12% bfs 7 512 5.2% btree 15 508 0.7% CP 12 128 47% cutcp 20 128 3.2% gauss 7 512 0.3% hspot 27 256 3.6% kmns 9 256 7.5% lavaMD 2 128 0.2% lbm 26 120 0.14% lud 9 16 8.8% mri-q 12 512 14.3% MUM 15 256 37% NN 10 169 8.2% nw 21 16 0.48% pf 9 256 0.37% RAY 31 128 25% sgm 27 128 16.2% srad 12 256 0.6% stencil 15 1024 0.2% A VG 2.5% Table 6.1: Benchmarks runtime information. Table 6.1 shows the CTA information for a wide range of workloads. The first column shows the number of registers assigned for each thread. The second column shows the number of threads per each CTA. Each 32 threads will be grouped together as a warp and will be running in a look-step manner. In kepler architecture [14] up to 16 CTAs can run 151 concurrently inside each SM. Hence, up to hundreds of threads per SM and thousands of threads per GPU chip will be running concurrently. Since all the threads are generated from a single kernel, then essen- tially all the warps will execute the same code but with different input operands. As a result, each thread that belongs to the same kernel will be allocated the same number of registers. While it is difficult to predict at compile time the access count of each allocated register, our runtime analysis of a wide range of GPU workloads shows that not all the allo- cated registers are equally accessed. Some registers are accessed more than other registers during the application runtime. For example, register R 0 in the bkprp workload is accessed 6X more times than registerR 6 . 0 0.2 0.4 0.6 0.8 1 bkprp BFS bfs btree CP cutcp gauss htspot kmns lavaMD lbm lud Mri-q MUM NN nw pf RA Y sgm srad stencil AVG Registers Accesses Percentage Top 3 Top 4 Top 5 Figure 6.2: Percentage of accesses to the top N highly accessed registers In order to quantify such access variations across different workloads we calculated the percentage of accesses to the highly accessed registers in each workload. Figure 6.2 shows the number of accesses to the top 3 ,4 and 5 highly accessed registers as a fraction of the total access count of 152 all the registers. An access is defined as either a read or a write operation. As shown, the top 3 registers in each kernel are accessed 60% of the time on average across all the workloads. Moreover, the top 4 and 5 registers are accessed 69% and 76% of the time, respectively. The highly accessed registers for each workload are different across the kernels of the same workload and across different workloads. For ex- ample, registersR 0 , R 8 , andR 9 are the highly accessed registers in the first kernel of bkprp, whileR 4 ,R 5 , andR 6 are the highly accessed regis- ters in the second kernel. On the other hand, the highly accessed registers in CP workload areR 1 , R 9 , and R 10 . Hence there is no correlation be- tween the highly accessed registers between different kernels from the same workload and between kernels form different workloads. On the other hand, the highly accessed registers for all the warps for a given kernel are the same. This is expected because all the warps in the same kernel will be running the same code. Due to the control flow and the divergence in the GPU workloads, the execution flow for the threads may be different during divergence and may result in different access frequency between the warps. Our results show that while the absolute access count may vary the highly accessed registers are the same across all the warps that belong to the same kernel. 153 What is surprising is the fact that irrespective of the total number of allocated registers the top three accessed registers account for at least 40% of the total register accesses. For example, while bkprp has 13 registers per thread, and lbm has 26 registers per thread, both benchmarks have nearly 40% of the accesses going to the three most accessed registers. 6.3 Partitioned Register File The access bias towards the top three to five registers is exploited through partitioning the register file. Figure 6.3a shows the baseline register file where all the accesses will be forwarded to the main register file (MRF). On the other hand, figure 6.3b shows our proposed design where the reg- ister file will be physically partitioned into two partitions: One partition of the register file will operate at STV and will have a small subset of the registers and we will refer to it as the fast register file (FRF). The second partition will have the remaining registers and we will refer to it as the slow register file (SRF). The SRF will be designed to operate in the near threshold to save leakage and dynamic energy. Even though FRF and SRF have significantly different access laten- cies, the two register files still appear as a one register file to the compiler and they will also be transparent to the micro-architecture blocks sur- rounding the register file like the operand collectors. The architecture 154 arbiter MRF Warp Scheduler SIMT Lanes Operand Buffering (a) Baseline register file arbiter Warp Scheduler SIMT Lanes FRF SRF Operand Buffering (b) Proposed register file Figure 6.3: Baseline and proposed RFs register file is physically divided between the two register files. Our pro- posed microarchitecture automatically decides which subset of registers are moved to FRF and which registers are swapped back to SRF. The register file access mechanism accesses either the FRF or SRF, but never accesses both register files. Notice that this design is different than the hierarchal register file [33, 42] where the first level of the hierarchy will be accessed first and if the register does not exist then the register file in the second hierarchy will be accessed. Using the SRF will degrade the performance due to its longer access latency. In order to reduce the performance impact most of the accesses must be forwarded to the FRF. Simple allocation policies that statically allocates the first few architected registers in each thread,R 0 ;R 1 :::R n (as- suming FRF hasn registers), to the FRF, while allocating the remaining 155 registers to SRF is not beneficial. For example, if we use this assign- ment and assume that registers R 0 , R 1 , R 2 and R 3 will be allocated in the FRF then only 30% of the accesses in the sgm workload will be for- warded to the FRF. However, if we allocate the highly accessed registers to the FRF then 69% of the accesses will be forwarded to the FRF. Hence, we propose to allocate the highly accessed registers in the FRF and the remaining registers in the SRF. In addition, as discussed before, there is no correlation between the highly accessed registers between the kernels of the same workload and across different workloads. Hence we cannot assign the registers stati- cally to the FRF and the SRF based on register numbers either. Hence, we have to identify the highly accessed registers for each kernel inside each workload and find a low overhead mechanism to move these regis- ters to the FRF. In order to identify the highly accessed registers we propose to learn the register access counts by tracking the usage of the registers from a single warp. In the GPU execution model each kernel is executed by thousands of warps. We propose to use one of the workload’s running warps as the pilot warp. The pilot warp is one of the first running warps in each kernel. The pilot warp will be used to collect statistics about the registers access count. 156 Time Pilot_warp W1 W2 Wn Kernel Finishes execution 1. Pilot_warp Finishes execution 2. Apply changes based on collected statistics 1. Pilot_warp Finishes execution 2. Apply changes based on collected statistics Figure 6.4: Kernel execution timeline Figure 6.4 shows how the pilot warp technique is working. First, the pilot warp (marked in red) is selected from the first running CTA to col- lect the register access frequency statistics. Second, when the pilot warp finishes execution it will generate the access frequency for each accessed register. Third, the collected statistics will be sorted to identify the highly accessed registers. Finally, the registers’ mapping to the FRF and SRF will be changed accordingly as will be described shortly. While the pilot warp will be running the same code as the other re- maining warps in the kernel, the dynamic flow of the instructions may be different because of the branch divergence. In order to verify that the pilot warp is a good representation of the other warps and reflects the behavior of the entire CTA, we calculated the number of accesses to the registers in the register file across different warps from the same CTA and from different CTAs. Our results show that on average the number of accesses 157 to various registers differ by no more than 5% irrespective of which warp within any CTA is selected as a pilot warp. Even more encouraging is the fact that even when the absolute register access count is not identi- cal, the sorted list of registers based on access count is the same across the warps within the same CTAs and the warps across different CTAs. Static profiling techniques can be used to collect such statistics but in our proposed technique we eschew static profiling and instead we focus on implementing this techniques using a pilot warp to enable profiling at run time. Note that several hundreds of warps are assigned to run on each SM in a typical GPU kernel. For simplicity, we assume that initially archi- tected registersR 0 ;R 1 :::R n1 of a warp are allocated to FRF that hasn registers, and the remaining registers are allocated SRF. When the pilot warp is in the process of collecting register usage statistics there may be at most 63 concurrently executing warps in that SM. These concurrent warps may not benefit from the register access-driven allocation in FRF. In the worst case, the first few architected registers which are allocated to FRF, do not get accessed much. Instead, most of the register accesses are directed to the SRF, leading to some performance loss. But once the pilot warp completes execution, future warps will not suffer this performance loss. The last column in table 6.1 shows the percentage of time the pilot 158 warp runs normalized to the kernel execution time. As shown, in most of the benchmarks the pilot warp runs for less than 5% of the kernel exe- cution time and on average across all the workloads the average run time of the pilot warp is 2.5% of the kernel time . As a result, the overhead of the register file access profiling is very small and it will have small impact on the overall performance as we will show in the evaluation sec- tion. However, in some benchmarks like MUM that have very few warps the pilot warp runs for 37% of the kernel time. As a result, if the most frequently accessed registers in these benchmarks are allocated to SRF by default then these benchmarks suffer SRF access delays. However, if the frequently accessed registers are already allocated to FRF by default then the applications do not suffer any degradation. In order to allocate the highly accessed registers in the FRF and the remaining registers in the SRF, register swapping will be used. For ex- ample ifR n+2 (which is located in the SRF by default) is one of the highly accessed registers for that specific workload andR 0 (which is located in the FRF by default) is not one of the topn highly accessed registers, then R n+2 will swapped into FRF in place ofR 0 , whileR 0 will be placed at R n+2 . 159 R0 R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R11 R12 R13 FRF SRF (a) Default registers distribution R7 R12 R4 R10 R2 R5 R6 R0 R8 R9 R3 R11 R1 R13 FRF SRF (b) Registers distribution based on pilot warp information Figure 6.5: Registers distribution between the FRF and SRF Figure 6.5 shows an example of the distribution of the registers be- tween the FRF and the SRF while the pilot warp is running(top) and af- ter the pilot warp(bottom) finishes execution. As shown, when the pilot warp is running by default registersR 0 ,R 1 ,R 2 andR 3 will be allocated in the FRF and the remaining registers will be allocated in the SRF. In this case whenever a request toR 0 ,R 1 ,R 2 andR 3 is issued then the request will be forwarded to the FRF, all other requests are sent to the SRF. After the pilot warp finishes execution the registers in the FRF will be swapped with the highly accessed registers in the SRF as shown in Figure 6.5b. In this example registers R 7 , R 12 , R 4 and R 10 are the highly accessed registers. Hence, each register got swapped with one of the registers in the FRF. In order to keep track of the swapped register locations we inte- grated a swapping table that is shown in Figure 6.6. The swapping table will have entries for the old and new location of the moved registers. For 160 example if we want to access registerR 0 then we check the swapping ta- ble and find that registerR 0 will be stored in the location associated with registerR 7 . On the other hand, if the register does not have an entry in the swapping table this means that the old and the new locations are the same. To support GPUs running concurrent kernels we will make the pilot warp aware of the concurrently running kernels. So for each running kernel inside the SM we will define a pilot warp. The pilot warp will collect statistics about the registers accesses for that specific kernel. After the pilot warp finishes the highly accessed registers for each kernel will be defined. In order to deal with each kernel differently we will assign each kernel its own swapping table. The content of each table will be updated with its kernel designated highly accessed registers. Hence, each of the warps that belongs to a certain kernel will access its kernel swapping table. 6.3.1 Architectural Support In order to enable the design of the partitioned register file we should be able to determine relative access frequencies of registers. In order to collect that information a set of 63, 2-byte counters are added to each SM. The selection of 63 counters is influenced by the fact that each thread can 161 be allocated at most 63 registers in our simulated GPU design. However, based on the benchmark info shown in table 6.1 on average 14 registers were allocated for each workload. As a result, not all the counters will be active during the register access profiling phase. Additionally a one byte pilot warp-id and one-bit profile mask are added to an SM. Once the pilot warp is selected its id is stored in the warp-id register. The profile mask bit is initially set to one to indicate that a pilot warp is currently in the process of collecting the access counts. Once the pilot warp terminates it automatically resets the mask bit. The mask bit is set again only when a new kernel is launched on the SM. When a warp instruction is scheduled for register access the profile mask bit is checked. If the mask bit is reset then the register access pro- ceeds normally. But if the mask bit is set then the warp-id of the register access warp is compared against the warp-id stored in the pilot warp-id register. If that matches then the register access counters for the regis- ter that is about to be accessed is incremented. Since we have 63 ac- cess counters these are simply indexed by the register number in the pilot warp. Once the pilot warp completes execution the counters values will be compared to identify the topn most accessed registers. Sorting operation can be implemented efficiently on GPU itself using built-in ISA support 162 such as the SHFL instruction in Kepler [12]. The counters and the sorting logic will only be used when the pilot warp is active which counts only for 2.5% of the kernel execution time. Hence, the power of the additional sorting will be negligible because it will be active once per SM over the entire duration of the kernel. A swapping table is integrated in each SM. We explored two different designs for the swapping table. In one approach a 63 entry swapping table is used to indicate the current location of the specific architected register. Recall that at most 63 registers are assigned to a thread. The swapping table stores the 6-bit register id where the indexed architected register is currently stored. This structure occupies about 48 bytes of space for this design. During the register access phase this structure is indexed using the architected register number to find the new architected register number. We also explored an alternate design option. The swapping table can also be implemented as a small CAM structure that is responsible for holding the register mapping for the topn most accessed registers. There are2n entries in the table (i.e.n entries for the originally mapped register andn entries for the new top most accessedn registers). Each entry stores the original architected register number and the current mapping of that register in either SRF or FRF. Hence, if we are storing the top 4 accessed registers then the swapping table will have 8 entries and each entry will 163 have 13 bits (i.e. 6 bits for the original register id, 6 bits for the swapped register id and 1 valid bit) for a total size of 104 bits. We implemented both designs in a detailed RTL model that was syn- thesized to measure the access energy and latency. Given their extremely small size the access energy and latency for either of the two designs is extremely small compared to the register file access energy. Hence, with- out loss of generality, In our proposed techniques in this chapter we will use the CAM design. However, even if the indexed design is used the results will be unchanged. V A B v R0 R7 v R7 R0 v R1 R12 v R12 R1 v R2 R4 v R4 R2 v R3 R10 v R10 R2 V A B i i i i i i i i 1 Profile_mask 0 Profile_mask Figure 6.6: The swapping table content before and after the pilot warp finishes execution Figure 6.6 shows the content of the swapping table after the pilot warp finishes execution. As shown the table is initially empty and the profile mask is set to 1 which means that the pilot warp is running. When the pilot warp finishes execution the profile mask will be set to0 and the table is updated with the highly accessed registers new mapping. For instance, 164 As shown in the table sinceR 7 is the most accessed register then the first entry will store the information thatR 7 is mapped toR 0 and the second entry will store the information thatR 0 is mapped toR 7 and so on. During register access phase if the profile mask bit is set then the swap- ping table is simply bypassed and all register accesses go to the respec- tive register partition; the firstn architected registers go to FRF and the remaining register accesses go to SRF. If profile mask bit is reset then the swapping table is accessed to identify which register partition a register access is bound to. If the register cannot find its id in the swapping ta- ble this mean that there is no update on its mapping and the access will be forwarded to the SRF. In order to support concurrent kernel execution in each SM, each kernel will have its own set of counters, pilot warp id register, a profile mask and a swapping table. Since up to 8 instruction can be issued each cycle in each SM, all of the issued instructions will be concurrently checking the swapping table. In order to avoid such a bottleneck, in our proposed technique we will use four copies of the swapping table (104 bits each). Recall that once the swapping table is filled it is a read-only structure. The four copies will be statically partitioned between the schedulers (one copy per scheduler). Partitioning the register file into two register files will not add addi- tional banks. A bank in the original register file is horizontally split after 165 n entries to create a separate SRF and FRF partitions from the same bank as shown in Figure 6.3b. Thus only one request will be active for any bank. For example, if bank 0 in the SRF is assigned a read or write re- quest, then the arbiter cannot issue a read or write request to bank 0 in the FRF and vice versa. This approach guarantees the transparency of the new register file design to the operand collector logic. In the next sec- tion we will present some of the technological challenges in SRAM cell design using FinFETs operating in the NTV regime. 6.4 SRAM Cell Design in Sub-10nm 6.4.1 Design Space Exploration As technology scales the nominal Vdd will become smaller. However, operating the SRAM cells at lower voltage will cause stability issues es- pecially when we scale the voltage closer to the minimum data retention voltage,V DDMIN .V DDMIN is basically the minimum voltage required to make sure that the SRAM read and write operations are stable and will not cause any read or write failures. In addition at this supply voltage level, the SRAM cell can preserve the stored-bit reliably for as long as power is supplied (hold stability). Although the static noise margin (SNM) is important during hold, cell stability during active operation represents 166 a more significant limitation to SRAM operation. In order to eliminate stability concerns at low Vdd, SRAM cells can be sized up [105]. How- ever, using larger SRAM cells is expensive in terms of area especially when it comes to large register files integrated in the GPUs. For exam- ple, in [105] the authors increased the size of the SRAM cell by 58% to keep the same bit failure level. In order to solve the stability issue of SRAM cells with smaller overheads, different SRAM cell structures have been proposed [89], such as using an 8T SRAM cell [79]. The proposed SRAM structures have better read and write stabilities at low voltage than the 6T SRAM cells. In fact recent work [18] showed that at the 7nm technology node, it is not even feasible to build 6T SRAM cell, even with increased transistor sizes. Thus the authors in [18] showed that at the 7nm node a viable option to build an SRAM is to use the 8T cell design. In addition, the 8T SRAM cells have better hold stability [18], at smaller feature sizes, than the 6T SRAM cell [17, 28, 89]. The authors in [27, 65] proposed 9T and 10T SRAM cell designs to improve the cell stability and leakage power consumption of the SRAM cell at different voltage operation modes. On the manufacturing front, at the sub-10nm regime MOSFETs suffer from process variations due to random dopant fluctuation (RDF) [78,91]. Due to their high scalability and immunity to RDF, FinFET has attracted 167 the attention of industry as a reliable solution for building future chips. In fact Intel, Nvidia and Samsung [4, 7, 10] have already announced the production of FinFET chips or have placed them on their roadmap for the near future. FinFET devices are more immune to process variations compared with planar CMOS counterparts. This is due to the elimination of the random dopant fluctuation because of the un-doped channels in FinFET devices. However, FinFETs still suffer from line edge roughness (LER), which causes variations of the (effective) channel length L, as well as work func- tion variations (WFV), which cause variations of the gate material proper- ties [78]. Both types of variations will affect the threshold voltage and the sub-threshold slope of FinFET devices. The effect of process variations becomes much more significant for deeply-scaled FinFET technologies. The authors in [78, 91] claim that the standard deviation of L and the ef- fect of LER variations are 0.8nm and 80mV variations in threshold volt- age, respectively. Since we are proposing to build the register file using FinFET operating in the near-threshold region, for yield analysis under aforesaid process variations, we used Hspice inside a Monte Carlo sim- ulation loop. We relied on detailed FinFET device models presented in prior work [19]. Our results show that FinFET based 8T SRAM cells op- erate correctly with the following properties, the static noise margin from 168 hspice simulations is 0.144V at STV and 0.092V at NTV . On the other hand, the 6T SRAM cells with higher cell size than our 8T SRAM cells operating at STV have 0.088V SNM. Hence, these 8T FinFET SRAM cells are well suited for designing the proposed register file organization. In our FinFET-based 8T SRAM cell design, we connected the back gate of the PMOS pull up transistors to VDD, which simultaneously im- proves the write-stability of the cell by weakening the pull up transistors, and reduces the standby leakage power since the OFF current of pull up transistors has been decreased exponentially. In addition, 8T SRAM cell has a smaller layout area than those of 9T and 10T SRAM cells, which subsequently, reduces the capacitances of wordline and bitline. Hence, in our design we will assume that the 8T SRAM cells are the basic building blocks for the register file. As we will show later the 8T SRAM cell de- sign with FinFETs is exploited to design the binary back gate control of the FinFET devices. 6.4.2 Dual-Gate FinFET Figure 6.7 shows the basic structure of the dual gate (DG) FinFET tran- sistor. As shown, the gate straddles the channel (fin) rather than connect- ing to the channel through a planar surface. This design provides better electrical characteristics and provides better control over the leakage and 169 FG BG S D H FIN L FIN Gate Fin Contact Gate Oxide S D Figure 6.7: The structure of the FinFET device model the short channel affects. DG FinFET can be fabricated by removing the top part of the gate region as shown in figure 6.7. The DG FinFET has two gates that can be used to control the channel formation. For example, the channel formation in the DG device can be done by turning both the front and back gates ON by connecting them to VDD, or by turning ON just the front gate. However, the channel formation ability of one gate is highly dependent on the voltage status of the other gate. When both the front gate and the back gate are ON the channel will be formed and the gate capacitance will be C g . On the other hand, if the front gate is ON and the back gate is disabled, then only the channel of the front gate will be formed. As a result, the driving current will be smaller. However, the gate capacitance of the device will also be half of theC g . As a result, the dynamic energy of the device will be lower. In addition, with the disabled 170 back gate the threshold voltage of the device will be higher and this will reduce the leakage power of the device. Design Voltage ON current SNM (V) (A/um) (V) NTV 0.3 7.505E-04 0.092 STV , BG=Vdd 0.45 2.372E-03 0.144 STV , BG=0 0.45 2.427E-04 0.096 Table 6.2: characteristics of the used SRAM cells built in FinFET 7nm Table 6.2 shows the operating voltage, the ON current, and the SNM for three different 8T SRAM cells referenced in this chapter which were generated from the Hspice Monte Carlo simulations. The first row shows the details for the SRAM cell that operates at the NTV . The second and the third rows show the details for the SRAM cells operating at STV when the back gate control is enabled and disabled, respectively. When both the front and back gate are enabled (turned to ON state ) the current will be 9 times larger than enabling just the front gate. Hence, the delay of the gate can vary vastly depending on whether the back gate is enabled or disabled. However, the reduction in theC g will make the degradation in the device delay smaller. The last column shows the SNM for the three different cells. As shown, the SNM for the NTV SRAM cell and the STV SRAM cell with back gate voltage is set to 0 is smaller than the SNM of the STV SRAM cell. 171 6.5 Partitioned Register File Design Using FinFET 6.5.1 SRF Using FinfET The slow register file will be responsible for holding the registers that are not highly accessed. As a result, the demand on the SRF file will be small. Hence, we chose to operate the SRF at near-threshold voltage regime. In the SRF, both the front gate and the back gate will be connected to the near threshold voltage when the transistor is ON and to the GND when the transistor is OFF. The SRF operating at near-threshold will have 3X longer access delay than the FRF operating at super-threshold voltage. Thus SRF saves dynamic and leakage energy by reducing the operating voltage, but paying for it using additional access latency. 6.5.2 FRF Using FinFET The FRF will be responsible for holding the highly accessed registers. The FRF will be designed using FinFET base SRAM cells operating at nominal Vdd. The FRF will be accessed in one cycle. In the next subsec- tion we will introduce the adaptive FRF optimization technique that will help us to further reduce the access energy of the FRF when the applica- tion is in low utilization phase. 172 6.5.2.1 Adaptive FRF Using FinFET While the SRF will reduce the dynamic energy for almost 30% of the register file accesses, still 70% of the accesses will be forwarded to the FRF. But, applications running on the GPUs will go through some periods of low computations because of many reasons: First, the number of active warps is less than the maximum available warps because some warps will be waiting for long latency events like cache misses and main memory accesses. Second, some warps will be waiting for a short latency events like RAW data dependency. Third, some warps will not be issued because the designated resources are not available. We explore the possibility of further reducing the access energy by lowering the speed of FRF when the application is in a low computing phase. In order to achieve that we propose the adaptive FRF where the FRF will switch between two power modes, namelyFRF high andFRF low . In the adaptive FRF we will leverage the binary back gate control where we will be able to operate the FinFET front gate and back gate separately in order to switch between the FRF high and FRF low power modes. Figure 6.8 shows the FRF SRAM cell design. The red lines in the figure shows the extra connections that are needed to operate the back gate of the transistors. When operating in the FRF high mode the back 173 WWL RWL Rd/Wr Address Decoder Drivers WL M2 Q M1 M4 M3 Q M5 M6 WBL RBL M7 M8 WBL mode Figure 6.8: Schematic for the modified decoder and the SRAM cell structure gate control will be set toV dd and when operate in theFRF low mode the back gate control will be set toGND. In order to achieve that we have to modify the decode circuitry of the FRF as shown in figure 6.8. The mode signal determines if the FRF will operate in the FRF high or the FRF low modes. To enable mode switch, signal buffers are added to drive the control signals to all SRAM cells in the same row. When the FRF is operating in theFRF high mode it will be accessed in one cycle and when it is operating in the FRF low mode it will be accessed in two cycles. Thus the dual operating mode of FRF saves dynamic energy by reducing the capacitance of the SRAM cell. Note that while SRF saves energy by reducing voltage, dual-mode FRF saves energy by reducing capacitance. But the potential for energy savings with voltage reduction, as is done with SRF, is much larger than reducing just the capacitance. However, the latency impact due to capacitance reduction is not as severe as the 174 latency impact due to voltage. Hence, choosing to reduce capacitance in FRF, as opposed to reducing voltage, provides a good tradeoff between performance and power. The GPU pipeline uses the operand collectors in order to buffer the operands for the scheduled warps before issuing them to the execute stage. Hence, the GPU pipeline is designed to deal with the access time variation in the register file access. For example, in the base machine the registers that are accessing the same bank will be scheduled back to back to resolve the conflict. Hence some registers access will take longer. As a result, the multi-speed FRF will not complicate the GPU pipeline design. In order to detect when the application is in a low computing phase we use a simple epoch based phase detection. During every 50 cycles the total number of instructions that are scheduled for execution are counted. The counting is done by a nine bits counter which increments every time a warp is issued to the execution unit. For instance, in our baseline kepler architecture, at most 8 instructions can be issued every cycle. Hence, at most 400 instructions can be issued in a 50 cycles epoch. At the end of the epoch if the number of instructions issued is less than certain thresh- old, then we assume the GPU is in a low compute demand phase. Ac- cordingly, in the next epoch we simply use the back gate control of the FinFET device to put the FRF register in to low power mode by setting 175 the mode signal to ’0’. As will be shown in the evaluation section, this simple technique is sufficient in predicting when to switch FRF into high power or low power mode. When the FRF is operating in a low power mode the access energy will be reduced by 32% because the back gate control is disabled(i.e. set to GND). 6.6 Evaluation 6.6.1 Evaluation Methodology We evaluated our proposed techniques for performance and energy sav- ings using GPGPU-Sim v3.02 [21]. Table 6.3 shows the GPGPU-sim configuration that we used in our simulations. We selected 21 work- loads to cover wide range of scientific and computation domains from Rodinia [68] , Parboil [9] benchmark suites , and the workloads provided with the gpgpu-sim [21]. For the schedulers we used the two-level sched- uler proposed to support the RFC [42]. In addition we ran our exper- iments with the GTO, the CTA-based [55] and the fetch group sched- ulers [73]. Our technique show a consistent performance across all the schedulers. However in the results section we will show the results for the two-level warp scheduler and the GTO scheduler. 176 Category Value GPGPU-Architecture Architecture Kepler-GTX 780 SMs 15 Warps per SM 64 SM-Architecture SIMT clusters 6 SIMT lanes per SIMT cluster 32 Register File-Architecture Size 256KB Banks 24 Operand collector units 24 Table 6.3: Experimental setup 6.6.2 Proposed Register File Characteristics In order to evaluate the energy, timing and area characteristics of the pro- posed register file we updated the 7nm FinFET models used in [18] to model the DG FinFet devices and the back gate control used in our pro- posed register file. We used Synopsys Technology Computer-Aided De- sign (TCAD) tool suite [11] to generate the device characteristics like the gate capacitance and the ON/OFF currents. Also we modified the de- coder circuitry of the register file in FinCACTI [18] to include the mode signal required to switch between theFRF h igh andFRF l ow modes and the additional signal buffers as shown in Figure 6.8. We modeled the FinFET partitioned register file by mapping 4 registers per warp to the FRF and the remaining registers to the SRF. Since Kepler supports 64 warps [14], the FRF register file size is 32KB and the SRF register file 177 size is 224KB, which totals up to 256KB of register file, which is the size of register file in Kepler. Thus FRF is only 12.5% of the total register file size. We also evaluated the baseline with just one main register file (MRF) of 256KB and operating at super-threshold voltage. This base- line gives the highest performance and we will compare our partitioned register file performance against this baseline. RF type Access energy Leakage power Size (pJ) (mW) (kB) FRF low 5.25 7.28 32 FRF high 7.65 7.28 32 SRF 7.03 13.4 224 MRF 14.9 33.8 256 Table 6.4: Size, access energy and leakage power for the baseline and proposed register file Table 6.4 shows the size, access energy and leakage power for the par- titioned register file structure with the above mentioned sizes. As shown the SRF that operates continuously at NTV has an access energy of 7.03 pico Joules (pJ), compared to monolithic main register file that operates at STV which has an access energy of 14.98 pJ. Due to its smaller size the FRF access energy is 7.65 pJ even when it operates in theFRF high mode. When FRF low mode is enabled the capacitance reduction leads to a further reduction in access energy to 5.25 pJ. We measured the area overhead for our additional connections and structures. This includes the additional wire connections from the register file decoder to the back 178 gates of the FinFET transistors, the required buffers to drive the mode signal, and additional layout area overhead for back gate connections of the SRAM cells. Based on FinCACTI [18] simulations, the area of the baseline register file is 0.2mm 2 , whereas the area of the our proposed reg- ister file is 0.214mm 2 , resulting in less than 10% area overhead. Also we calculated the area of the additional counters and the sort logic used by the pilot warp, which are negligible compared to the area of the register file. 6.6.3 Energy Savings In order to quantify the benefits of our proposed techniques, we modi- fied the GPGPU-sim [21] to model the proposed register file. Based on the Hspice and FinCACTI simulation, theFRF high access time is 0.08ns. HenceFRF high can be accessed in a single clock cycle even at high fre- quencies. In addition, when the FRF operates in theFRF low mode then it will be accessed in two cycles and theSRF will be accessed in 3 cy- cles. In addition, we model the swapping table in GPGPU-sim in order to forward the accesses to the correct register file as described before. Figure 6.9 shows the distribution of the accesses based on the register file type and the power mode of the FRF. The data is generated assuming four registers are in the FRF and the remaining registers are in SRF. As 179 0 0.2 0.4 0.6 0.8 1 bkprp BFS bfs btree CP cutcp gauss htspot kmns lavaMD lbm lud Mri-q MUM NN nw pf RA Y sgm srad stencil AVG Access Percentage FRF_HIGH FRF_LOW SRF Figure 6.9: Proposed register file access distribution shown before in Figure 6.2 when the top four registers are mapped to the FRF about 30% of the accesses will be forwarded to the SRF while 70% will be forwarded to the FRF. In the proposed partitioned register file with back gate controlled FRF, the percentage of time the FRF is operating in the FRF low mode depends on the selected threshold. We did a detailed design space exploration of this threshold to see the energy savings versus potential performance penalties. Our results show that any threshold around 85 works well. In the interest of space we present results with threshold set at 85; namely if the number of issued instructions in a 50 cycle window is less than 85 out of a total of 400 issue slots the system is considered to be operating in a low compute phase. At this threshold 30% of the accesses to the FRF take place when the FRF operated in the FRF low mode. As expected, applications with high compute demand that do not waste issue slots, such as cutcp and htspot, do not enter into low 180 compute phase often and hence most of their FRF accesses occur during FRF high mode. 0 0.1 0.2 0.3 0.4 0.5 0.6 bkprp BFS bfs btree CP cutcp gauss htspot kmns lavaMD lbm lud Mri-q MUM NN nw pf RA Y sgm srad stencil AVG Normalized Dynamic Energy Partitioed RF Partitioned RF + adaptive Figure 6.10: Energy savings Figure 6.10 shows the dynamic energy for our proposed approach nor- malized to the baseline where a single large main register file operates at super-threshold voltage. The first column shows the normalized energy for the partitioned register file technique and the second column shows the normalized energy when the Adaptive FRF technique that uses an 85 threshold to put the FRF in low power mode is used. As shown our pro- posed techniques are able to save 55% of the register file dynamic energy across all the benchmarks. We also compared the partitioned register file design energy against the main register file that always operates at near- threshold voltage. Our results show that when the monolithic register file operates at NTV it saves 47% of the register file energy, which is lower than our proposed partitioned register file savings. The reason for that 181 is the extra savings from the adaptive FRF technique that allows us to reduce the access energy further. In addition to the dynamic energy, our technique is able to save leakage power. The leakage power savings come from the SRF where all the registers are operating in the near-threshold region. The leakage power of the SRF and the FRF are shown in table 6.4. The FRF leakage power is almost 21.5% of the MRF baseline leakage power since its size is smaller than the MRF. In addition, the SRF leakage power is almost 39.7% of the leakage power of the MRF register file because it is operating at NTV . Hence our proposed register file is able to save 39% of the register file leakage power. 6.6.4 Performance Overhead Figure 6.11 shows the performance overhead of the proposed techniques when using the greedy-then-oldest (GTO), and the two-level (TL) [42] schedulers. The numbers show the normalized performance to the base- line with MRF running at super-threshold voltage and using the same scheduler (i.e. GTO shows the normalized performance to the base machine using the GTO scheduler and TL shows the normalized per- formance to the base machine using TL scheduler). As shown, across different schedulers our technique have only 0.5% and 2% performance 182 overhead when using GTO and TL schedulers, respectively. On the other hand, when the MRF in the baseline machine operates at NTV all the time then it will suffer from 7.1% performance overhead when the two- level scheduler is used. Hence our proposed register file is able to achieve better energy savings with negligible loss of performance. 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 bkprp BFS bfs btree CP cutcp gauss htspot kmns lavaMD lbm lud Mri-q MUM NN nw pf RA Y sgm srad stencil AVG Normalized Execution Time TL GTO Figure 6.11: The execution time of the proposed ideas In order to see the impact of the SRF speed on the overall perfor- mance in our proposed technique, we ran our workloads with longer SRF access latency . Our results show only 0.3% and 1.5% degradation in performance when the access delay to the SRF is 4 cycles and 5 cycles ,respectively. As mentioned before we used an epoch length of 50 cycles in our adaptive FRF technique. To verify that our results are not sensitive to the epoch length we ran our simulations for different epochs length. In our experiments we used the same threshold ratio. For example, when the epoch length is 50 cycles we used 85 out of 400 as the threshold value(i.e 183 the workload issued less than 85 instructions out of the maximum possi- ble issues). So we set the threshold to be 20% in all the threshold sen- sitivity simulations. The results show that the epoch length has a small impact on performance. 6.6.5 Partitioned vs Hierarchal Register Files One way to avoid the access to the MRF is to use the hierarchal regis- ter file where the a multi-level register file will be accessed similar to caches. The first level of the hierarchy will include the recently accessed registers. If the register cannot be found in the first level then the sec- ond level of the hierarchy will be checked. The work proposed in [42] applied the hierarchal register file in GPUs. Since multiple warps will be concurrently running at the same time the authors proposed the two level scheduler [42] to reduce the size of the register file cache. However, with the increase in the number of the schedulers/SM and the number of issued instructions each cycle, the RFC must handle all the concurrent requests efficiently in future designs. The authors in [42] pro- posed using a multi-ported register file in order to be able to read all the operands in one cycle. This solution works well with one scheduler per SM issuing two instructions per cycle. However, when applied to newer architectures that issue up to 8 instructions per cycle the number of cache 184 ports must also be scaled. To understand just the energy impact of adding more ports, we simulated RFC with varying number of read(R)/write(W) ports using the FinCacti [18]. When (R=2, W=1) ports are used the ac- cess energy of RFC that can hold 6 registers in the RFC per warp is 0.37X of the MRF. But when the port count is increased (R=8, W=4) to support 4 instruction issue per cycle the access energy of RFC is 3X the access energy of MRF. Hence, the energy savings from accessing the RFC will be compromised as we use more ports. Another approach to reduce port contention is to use a banked RFC design. Our simulations show that the access energy of an 8 banked RFC is nearly the same as the access energy of MRF. We also explored dis- tributing the RFC between the schedulers where each scheduler will have its own RFC. However, the distributed RFC design will complicate the design significantly since the register writeback stage must then write to different RFC banks each associated with different scheduler. Since increasing the number of ports is expensive, we did a quantita- tive comparison between the RFC and the partitioned register file when we increase the number of banks only. Figure 6.12 shows how the dynamic energy and the execution time of the RFC and the partitioned register file scales as we increase the num- ber of banks without increasing the number of read/write ports. In the 185 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2 RFC Proposed RFC Proposed RFC Proposed RFC Proposed (1,1,8,MRF_NTV) (2,2,16, MRF_NTV) (4,4,32,MRF_NTV) (4,4,32,MRF_STV) Normalized Execution Time Dynamic Energy RFC=6kB RFC=12kB RFC=24kB RFC=24kB Figure 6.12: Scalability of the RFC and partitioned register file experiments we used four different configuration parameters that follow the trend of GPUs scaling. These four parameters are listed under each set of bars in the figure insides the parenthesis. The first parameter in the configuration shows the number of schedulers/SM. The second shows the number of RFC banks and the third shows the number of active warps. The last parameter shows the operating region of the MRF when the RFC is used. The size of the RFC in each configuration is shown on top of the RFC bars in the figure. Note that as we increase the number of active warps the RFC size increases accordingly to be able to accommodate the registers of all active warps. For example, the RFC size is 6kB, 12kB and 24kB when the number of active warps is 8, 16 and 32, respectively. The RFC and the partitioned register file techniques are evaluated for each configuration. For fair comparison we assumed that the MRF is operat- ing in NTV . As the first set of bars show, the dynamic energy of the RFC is close to the dynamic energy savings of the partitioned register file. 186 However, as we increase the number of schedulers/SM and the number of active warps the RFC energy savings decreases as shown in the second and third sets of bars. On the other hand, the partitioned register file has constant energy savings all the time since its energy savings does not rely on the number of active warps. We can correlate the low energy savings when the RFC is used to the low RFC hit rate. Our simulations show that when we have 32 active warps the RFC hit rate is still lower than 45%. Hence, more than half of the accesses are forwarded to the MRF. The lines on top of each configuration shows the execution time of the RFC and the partitioned register file. As shown the partitioned register file has lower performance overhead compared to the RFC. The RFC has 9.5%, 3.8% and 3.3% performance overhead when the number of active warps is 8,16 and 32 respectively. The last set of bars shows the dynamic energy savings of the RFC when the MRF is operating at STV . In this region, the RFC and the MRF are accessible in one cycle. In such scenario there will be no performance overhead. However, the RFC will save only 10% of the register file dynamic energy. 187 6.7 Related Work GPU power efficiency and performance have been widely studied. The proposed techniques tackle the inefficiency in the GPUs at the register file level, execution units level, and at the memory system level. Several works [42, 101] proposed techniques to save dynamic and static power of the GPUs register files using circuit level and micro- architectural techniques. We already provided a detailed comparison of our work with register file cache. The authors in [45] proposed a power efficient design for the execution units and the register file. The proposed design takes advantage of the operand similarities in the GPUs executed warps. In our design we rely only on access counts to separate registers and do not look at their content. The authors in [55, 59, 73, 83] proposed different scheduling schemes that can be used to improve the performance of the GPUs and improve the memory system. GPU power efficiency:Since the scheduling decision can have a great impact on the GPUs performance and power, scheduler has been the tar- get for different types of optimizations. The authors in [55, 59, 73, 83] proposed different scheduling schemes to improve the performance of the GPUs and the memory system. Some schedulers are developed solely to improve performance [55, 56, 59, 73]. Some other schedulers are de- veloped in the context of an another proposed idea that works best when 188 altering the warp schedule [42, 83, 101]. We showed how the split regis- ter file design can work well with many well known performance-centric schedulers. power efficient SRAM cells: Several works have been proposed to improve the power efficiency of the SRAM structures. The authors in [18, 28] show that the 8T SRAM cells and the FinFET technology will enable building a stable SRAM at low operating voltage. In addition, the authors in [43] proposed using different SRAM cells sizes to enable the design of multi-Vcc caches. In addition, multiple techniques have been proposed to reduce the failure probability of the SRAM cells at low volt- age [43, 87]. In this work we take advantage of the proposed techniques in order to enable the design of the SRF in our partitioned register file. Near threshold computing: Near threshold computing has been widely deployed to build different micro-architectural blocks and proces- sors. The authors in [37, 58, 103] designed complete parallel processors that operates at NTV . In addition, the authors in [35] proposed mixing slow cores operating at NTV with faster L1 caches to improve the en- ergy efficiency. Also, the authors in [34] proposed building the caches using mix of the near-threshold tolerant ways to reduce energy and the traditional ways to maintain performance. The access policy of the cache will be changed at runtime to minimize the performance overhead. The 189 authors in [97] did a broad analysis on designing the execution units at the near threshold using different technologies nodes. In our proposed ideas we focused on designing a low power register file for high through- put processors like GPUs. In our proposed architecture we divided the register file into two register files; SRF and FRF. The SRF will be used to hold the low accessed registers and the small FRF will be used to hold the most accessed registers. In order to reduce the performance impact of operating part of the register file at NTV we propose a runtime adaptive technique that control the placement of the registers in the FRF and the SRF. 6.8 Summary In this chapter we proposed dividing the register file into fast register file (FRF) and slow register file (SRF). The partitioning is done based on a simple runtime access counting mechanism. The FRF will hold the highly accessed register and will run in high performance mode. The SRF will hold the remaining registers and will be running in the near threshold mode. We target our register file design for future sub-10nm technology where MOSFETs are shown to be difficult to build. As such there is a strong industry trend to move towards FinFET designs. We build the partitioned register file using FinFET 7nm technology. SRF is 190 built to operate always at NTV , while FRF is built to operate at STV . We also take advantage of the back gate control in FinFETs to reduce the capacitance of FRF by operating in high and low capacitance modes. Our proposed techniques are able to save 39% of the leakage energy and 55% dynamic dynamic with less than 2% performance overhead. 191 Chapter 7 Conclusion In this dissertation we focus on improving the energy efficiency of GPUs in the presence of dynamically varying resource utilization. We also demonstrate that GPU execution model has unique features that can be exploited to tackle varying resource utilization concern. Accordingly, we presented several techniques to improve the energy efficiency of the GPUs. In Chapters 3 and 4 we focused on reducing the leakage power of the execution units in GPUs using the Warped-Gates and the Origami tech- niques, respectively. Our analysis shows that leakage energy is around 50% of the GPU total energy and the leakage energy of the execution units is around 10% of the total GPU energy. Power gating techniques can be used to reduce the leakage energy of the execution units. How- ever, for the power gating to be effective the targeted block should have long idle periods. In Chapter 3 we analyzed the idle periods for the GPU execution units and found that the idle periods are too short. The primary reason is that the warp scheduler in a GPU that schedules the warps is agnostic to the 192 current power gating state of the execution units. As a result execution units frequently move between active and inactive states thereby curtail- ing power gating opportunities. Thus, we proposed two techniques to enable the power gating in GPUs effectively. Both techniques targets in- creasing the length of the idle periods to enable the power gating. First we proposed the GATES scheduler. The GATES scheduler gives a higher priority for the warps that use the same type of execution resources so as to elongate the active and idle times of each execution unit type. Second, we modified the power gating state machine to eliminate the scenarios where the power gated unit will switch to the ON state before compen- sating for the power gating overhead. Both solutions combined are able to enhance the power gating capabilities in GPU execution units. In Chapter 4 we further improved the power gating capabilities in the GPU execution units by exploiting the short pipeline bubbles and the vari- ation in the idleness of the SIMT lanes. Hence, we propose the Origami technique to enable power gating at a fine grain level. Origami relies on warp folding to create power gating opportunities. Warp folding essen- tially splits a single block of 32 threads into sub-warps each with fewer than 32 threads. When warp folding is enabled multiple sub-warps are scheduled to run only on the lower order execution lanes, thereby leaving higher order execution lanes to be idle for longer periods of time. Origami 193 significantly reduces the leakage power of the GPUs execution units with negligible performance overhead. In conclusion, the proposed techniques are able to save 50% of the GPU execution units leakage energy with negligible performance over- head. At the GPU level the proposed techniques are able to save 5% of the total GPU power. In Chapter 5 and 6 we targeted the power efficient of the register file. The register file in GPUs is responsible for holding the context of thou- sands of threads that will be running concurrently. Hence, improving the energy efficiency of the register file without degrading the perfor- mance is challenging. In Chapter 5 we made the observation that the registers inter-access time is in the order of hundreds of cycles. Also we showed that not all the registers assigned to each warp is used during the application runtime. Inspired by these observations we proposed the Warped-Register File design where each register in the GPU register file is augmented with a tri-modal switch. The tri-modal switch enables the register to switch between the OFF, Drowsy and ON states based on the register usage mode. In addition, we enhanced the register file design to allow narrow-width register access where only a subset of the 32 thread registers are activated on a single read operation. The set of active reg- isters is determined based on the active state of each thread in the warp. 194 Warped register file is able to significantly reduce the leakage energy of the register file by 91%. Also the Warped-Register File is able to reduce the dynamic energy by 19% for the divergent applications. In Chapter 6 we further improved the register file energy efficiency by reducing the register file access energy. Based on the observation that a small portion of the registers assigned to each thread are accessed major- ity of the time we proposed dividing the register file into two partitions and allocate the highly accessed register in a small and fast partition, we referred to this partition as the Fast Register File (FRF), and the remain- ing register in a large but slow partition, we referred to this partition as the Slow Register File (SRF). The slow register file will be operating at NTV and the FRF will be operating at STV . In order to identify the slow and fast registers we proposed the pilot warp technique which is based on the feature that all the threads in the same kernel will be executing the same code. Hence, the pilot warp will collect statistics from the early running warp and use this information to optimize the register allocation of future warps to improve the power efficiency. We used FinFET technologies in order to enable the design of the SRF and FRF at sub-10nm technologies. The proposed register file is able to reduse the register file access energy by 50%. 195 In conclusion, our proposed techniques in Chapters 4 and 5 are able to reduce the leakage energy and the dynamic energy of the register file by more than 90% and 50% respectively. 196 Bibliography [1] Arizona state university predictive technology model. , http://ptm.asu.edu. [2] Cacti 6.0: A tool to understand large caches. http://www.cs.utah.edu/ rajeev/cacti6/. [3] The freepdk process design kit. http://www.eda.ncsu.edu/wiki/FreePDK. [4] Intel and ibm to lay out 14nm finfet strategies on competing substrates at iedm 2014. http://electroiq.com/blog/2014/10/intel- and-ibm-lay-out-14nm-finfet-strategies-on-competing-substrates- at-iedm-2014/. [5] Nvidia cuda sdk 4.2. developer.nvidia.com/cuda/cuda-downloads. [6] Nvidia, fermi white paper v1.1. http://www.nvidia.com/content/PDF/fermi white papers/ NVIDIA Fermi Compute Architecture Whitepaper.pdf. [7] Nvidia says tegra 6 ’parker’ will be a 64-bit chip using finfet tran- sistors. http://www.theinquirer.net/inquirer/news/2255915/nvidia- says-tegra-6-parker-will-be-a-64bit-chip-using-finfet-transistors. [8] Opencl. [9] Parboil benchmark suite. http://impact.crhc.illinois.edu/parboil.php. [10] Samsung shows 14nm chip. http://www.eetimes.com. 197 [11] Synopsys technology computer-aided design (tcad). http://www.synopsys.com/tools/tcad. [12] Uda pro tip: Do the kepler shuffle. http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-kepler- shuffle/. [13] Amd graphics cores next (gcn) architecture. Technical report, AMD, 06 2012. [14] Nvidias next generation cuda compute architecture: Kepler tm gk110. Technical report, Nvidia, 2012. [15] Mohammad Abdel-Majeed, Daniel Wong, and Murali Annavaram. Warped gates: Gating aware scheduling and power gating for gpg- pus. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, 2013. [16] Jaume Abella, Antonio Gonz´ alez, Xavier Vera, and Michael F. P. O’Boyle. Iatac: a smart predictor to turn-off l2 cache lines. ACM Transactions on Architecture and Code Optimiization, 2005. [17] A Agarwal, S. Hsu, S. Mathew, M. Anders, H. Kaul, F. Sheikh, and R. Krishnamurthy. A 32nm 8.3ghz 64-entry; 32b variation tol- erant near-threshold voltage register file. In Proceedings of IEEE Symposium on VLSI Circuits, 2010. [18] Xue Lin Alireza Shafaei, Yanzhi Wang and Massoud Pedram. Fin- cacti: Architectural analysis and modeling of caches with deeply- scaled finfet devices. In IEEE Computer Society Annual Sympo- sium on VLSI, 2014. [19] Yanzhi Wang Alireza Shafaei, Shuang Chen and Massoud Pedram. A cross-layer framework for designing and optimizing deeply- scaled finfet-based sram cells under process variations. In The 20th Asia and South Pacific Design Automation Conference, 2015. 198 [20] Jos´ e L. Ayala, Alexander Veidenbaum, and Marisa L´ opez-Vallejo. Power-aware compilation for register file energy reduction. Inter- national Journal of Parallel Programming, 2003. [21] A. Bakhoda, G.L. Yuan, W.W.L. Fung, H. Wong, and T.M. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. In Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software, 2009. [22] D. Bautista, J. Sahuquillo, H. Hassan, S. Petit, and J. Duato. A sim- ple power-aware scheduling for multicore systems when running real-time applications. In Proceedings of IEEE International Sym- posium on Parallel and Distributed Processing, pages 1–7, 2008. [23] Andr´ e; R. Brodtkorb, Trond R. Hagen, Knut-Andreas Lie, and Jostein R. Natvig. Simulation and visualization of the saint-venant system using gpus. Comput. Vis. Sci., 2010. [24] A.R. Brown, A. Asenov, and J.R. Watling. Intrinsic fluctuations in sub 10-nm double-gate mosfets introduced by discreteness of charge and matter. IEEE Transactions on Nanotechnology, 2002. [25] Nicolas Brunie, Sylvain Collange, and Gregory Diamos. Simul- taneous branch and warp interweaving for sustained GPU perfor- mance. In Proceedings of the 39th Annual International Sympo- sium on Computer Architecture, 2012. [26] Juan M. Cebri’n, Gines D. Guerrero, and Jose M. Garcia. Energy efficiency analysis of gpus. In Proceedings of IEEE 26th Interna- tional Parallel and Distributed Processing Symposium Workshops PhD Forum, 2012. [27] Ik Joon Chang, Jae-Joon Kim, Sang Phill Park, and K. Roy. A 32kb 10t subthreshold sram array with bit-interleaving and differential read scheme in 90nm cmos. In Proceedings of IEEE International Solid-State Circuits Conference, Digest of techninal papers, 2008. 199 [28] L. Chang, R.K. Montoye, Y . Nakamura, K.A Batson, R.J. Eick- emeyer, R.H. Dennard, W. Haensch, and D. Jamsek. An 8t- sram for variability tolerance and low-voltage operation in high- performance caches. IEEE Journal of Solid-State Circuits, 2008. [29] Leland Chang, K.J. Yang, Yee-Chia Yeo, Yang-Kyu Choi, Tsu-Jae King, and Chenming Hu. Reduction of direct-tunneling gate leak- age current in double-gate and ultra-thin body mosfets. In Pro- ceedings of the International Electron Devices Meeting, 2001. [30] Tsu-Jae Changhwan Shin, King, Borivoje Liu, Eugene Nikolic, and Haller. Advanced mosfet designs and implications for sram scaling. Technical Report, 2011. [31] Lizhong Chen and T.M. Pinkston. Nord: Node-router decoupling for effective power-gating of on-chip routers. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchi- tecture, 2012. [32] NVIDIA Corp. Nvidia cuda: Compute unied device architecture, 2007. [33] Jos´ e-Lorenzo Cruz, Antonio Gonz´ alez, Mateo Valero, and Nigel P. Topham. Multiple-banked register file architectures. In Proceed- ings of the 27th annual international symposium on Computer ar- chitecture, 2000. [34] R.G. Dreslinski, G.K. Chen, T. Mudge, D. Blaauw, D. Sylvester, and K. Flautner. Reconfigurable energy efficient near threshold cache architectures. In Proceedings of the 41st IEEE/ACM Inter- national Symposium on Microarchitecture, 2008. [35] R.G. Dreslinski, Bo Zhai, T. Mudge, D. Blaauw, and D. Sylvester. An energy efficient parallel architecture using near threshold op- eration. In Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, 2007. 200 [36] Steven Dropsho, V olkan Kursun, David H. Albonesi, Sandhya Dwarkadas, and Eby G. Friedman. Managing static leakage en- ergy in microprocessor functional units. In Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture, 2002. [37] D. Fick, R.G. Dreslinski, B. Giridhar, Gyouho Kim, Sangwon Seo, M. Fojtik, S. Satpathy, Yoonmyung Lee, Daeyeon Kim, N. Liu, M. Wieckowski, G. Chen, T. Mudge, D. Blaauw, and D. Sylvester. Centip3de: A cluster-based ntc architecture with 64 arm cortex- m3 cores in 3d stacked 130 nm cmos. IEEE Journal of Solid-State Circuits, 2013. [38] K. Flautner, Nam Sung Kim, S. Martin, D. Blaauw, and T. Mudge. Drowsy caches: simple techniques for reducing leakage power. In Proceedings of the 29th Annual International Symposium on Com- puter Architecture, 2002. [39] Wilson W. L. Fung and Tor M. Aamodt. Energy efficient gpu trans- actional memory via space-time optimizations. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microar- chitecture, 2013. [40] Wilson W L Fung, Ivan Sham, George Yuan, and Tor M Aamodt. Dynamic Warp Formation and Scheduling for Efficient GPU Con- trol Flow. In Proceedings of the 40th Annual IEEE/ACM Interna- tional Symposium on Microarchitecture, 2007. [41] W.W.L. Fung and T.M. Aamodt. Thread block compaction for effi- cient simt control flow. In Proceedings of IEEE 17th International Symposium on High Performance Computer Architecture, 2011. [42] Mark Gebhart, Daniel R. Johnson, David Tarjan, Stephen W. Keck- ler, William J. Dally, Erik Lindholm, and Kevin Skadron. Energy- efficient mechanisms for managing thread context in throughput processors. In Proceedings of the 38th annual international sym- posium on Computer architecture, 2011. 201 [43] H.R. Ghasemi, S.C. Draper, and Nam Sung Kim. Low-voltage on-chip cache architecture using heterogeneous cell sizes for high- performance processors. In Proceedings of IEEE 17th Interna- tional Symposium on High Performance Computer Architecture, 2011. [44] Syed Zohaib Gilani, Nam Sung Kim, and Michael J. Schulte. Ex- ploiting gpu peak-power and performance tradeoffs through re- duced effective pipeline latency. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, 2013. [45] S.Z. Gilani, Nam Sung Kim, and M.J. Schulte. Power-efficient computing for compute-intensive gpgpu applications. In Proceed- ings of IEEE 19th International Symposium on High Performance Computer Architecture, 2013. [46] S.Z. Gilani, Nam Sung Kim, and M.J. Schulte. Power-efficient computing for compute-intensive gpgpu applications. In High Per- formance Computer Architecture (HPCA2013), 2013 IEEE 19th International Symposium on, pages 330–341, 2013. [47] Xuan Guan and Yunsi Fei. Register file partitioning and recompila- tion for register file power reduction. ACM Transactions on Design Automation of Electronic Systems, 2010. [48] Zheng Guo, S. Balasubramanian, R. Zlatanovici, Tsu-Jae King, and B. Nikolic. Finfet-based sram design. In Proceedings of the 2005 International Symposium on Low Power Electronics and De- sign, 2005. [49] Ron Ho. On-chip wires: Scaling and efficiency. PhD Dissertation, Department of Electrical Engineering, Stanford University, 2003. [50] Jianping Hu, Tiefeng Xu, and Hong Li. A lower-power register file based on complementary pass-transistor adiabatic logic. IEICE - Transactions on Information and Systems, 2005. 202 [51] Zhigang Hu, A. Buyuktosunoglu, V . Srinivasan, V . Zyuban, H. Ja- cobson, and P. Bose. Microarchitectural techniques for power gat- ing of execution units. In Proceedings of the 2004 International Symposium on Low Power Electronics and Design, 2004. [52] K. Itoh, K. Sasaki, and Y . Nakagome. Trends in low-power ram circuit technologies. Proceedings of the IEEE, 1995. [53] Hyeran Jeon and M. Annavaram. Warped-dmr: Light-weight error detection for gpgpu. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture, 2012. [54] A. Jog, O. Kayiran, A. Mishra, M. Kandemir, O. Mutlu, R Iyer, and C. Das. Orchestrated scheduling and prefetching for gpgpus. In Proceedings of the 40th Annual International Symposium on Com- puter Architecture, 2013. [55] Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiap- pan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravis- hankar Iyer, and Chita R. Das. Owl: Cooperative thread array aware scheduling techniques for improving gpgpu performance. In Proceedings of the Eighteenth International Conference on Archi- tectural Support for Programming Languages and Operating Sys- tems, 2013. [56] Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. Orchestrated scheduling and prefetching for gpgpus. SIGARCH Comput. Archit. News, 2013. [57] U.R. Karpuzcu, A. Sinkar, Nam Sung Kim, and J. Torrellas. En- ergysmart: Toward energy-efficient manycores for near-threshold computing. In Proceedings of IEEE 19th International Symposium on High Performance Computer Architecture, 2013. [58] H. Kaul, M.A. Anders, S.K. Mathew, S.K. Hsu, A. Agarwal, R.K. Krishnamurthy, and S. Borkar. A 300mv 494gops/w reconfigurable 203 dual-supply 4-way simd vector processing accelerator in 45nm cmos. In Proceedings of IEEE International Solid-State Circuits Conference - Digest of Technical Papers, 2009. [59] Onur Kayiran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. Neither more nor less: Optimizing thread-level parallelism for gpgpus. In Proceedings of the 22Nd International Conference on Parallel Architectures and Compilation Techniques, 2013. [60] Nam Sung Kim, K. Flautner, D. Blaauw, and T. Mudge. Drowsy in- struction caches: Leakage power reduction using dynamic voltage scaling and cache sub-bank prediction. In Proceedings of 35th An- nual IEEE/ACM International Symposium on Microarchitecture, 2002. [61] M. Kondo and H. Nakamura. A small, fast and low-power register file by bit-partitioning. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture, 2005. [62] Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gi- lani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi1. Gpuwattch: Enabling energy optimizations in gpgpus. In Proceed- ings of the 40th Annual International Symposium on Computer Ar- chitecture, 2013. [63] Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. Mcpat: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, 2009. [64] Xiaoyao Liang, K. Turgay, and D. Brooks. Architectural power models for sram and cam structures based on hybrid analytical/em- pirical techniques. In Proceedings of IEEE/ACM International Conference on Computer-Aided Design, 2007. 204 [65] Sheng Lin, Yong-Bin Kim, and Fabrizio Lombardi. A low leakage 9t sram cell for ultra-low power operation. In Proceedings of the 18th ACM Great Lakes Symposium on VLSI, 2008. [66] Xue Lin, Yanzhi Wang, and M. Pedram. Joint sizing and adaptive independent gate control for finfet circuits operating in multiple voltage regimes using the logical effort method. In Proceedings of IEEE/ACM International Conference on Computer-Aided Design, 2013. [67] Anita Lungu, Pradip Bose, Alper Buyuktosunoglu, and Daniel J. Sorin. Dynamic power gating with quality guarantees. In Pro- ceedings of the 14th ACM/IEEE international symposium on Low power electronics and design, 2009. [68] M., A., M. Goodrum, J., A. Trotter, S. Aksel, T., K. Acton, and Skadron. Parallelization of particle filter algorithms. In 3rd Work- shop on Emerging Applications and Many-core Architecture, 2010. [69] Niti Madan, Alper Buyuktosunoglu, Pradip Bose, and Murali An- navaram. A case for guarded power gating for multi-core proces- sors. In Proceedings of the 2011 IEEE 17th International Sympo- sium on High Performance Computer Architecture, 2011. [70] David Meisner, Brian T. Gold, and Thomas F. Wenisch. Power- nap: Eliminating server idle power. In Proceedings of the 14th In- ternational Conference on Architectural Support for Programming Languages and Operating Systems, 2009. [71] David Meisner and Thomas F. Wenisch. Dreamweaver: Architec- tural support for deep sleep. In Proceedings of the Seventeenth In- ternational Conference on Architectural Support for Programming Languages and Operating Systems, 2012. [72] Yan Meng, Timothy Sherwood, and Ryan Kastner. On the limits of leakage power reduction in caches. In Proceedings of the 11th 205 International Symposium on High-Performance Computer Archi- tecture, 2005. [73] Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. Improving gpu performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Sympo- sium on Microarchitecture, 2011. [74] S. Nassif, K. Bernstein, D.J. Frank, A. Gattiker, W. Haensch, B.L. Ji, E. Nowak, D. Pearson, and N.J. Rohrer. High performance cmos variability in the 65nm regime and beyond. In Processings of IEEE International Electron Devices Meeting, 2007. [75] E. Pakbaznia and M. Pedram. Design and application of multi- modal power gating structures. In International Symposium on Quality Electronic Design, 2009. [76] Sanghyun Park, A. Shrivastava, N. Dutt, A. Nicolau, Yunheung Paek, and E. Earlie. Register file power reduction using bypass sensitive compiler. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2008. [77] Sanghyun Park, Aviral Shrivastava, Nikil Dutt, Alex Nicolau, Yun- heung Paek, and Eugene Earlie. Bypass aware instruction schedul- ing for register file power reduction. In Proceedings of the 2006 ACM SIGPLAN/SIGBED conference on Language, compilers, and tool support for embedded systems, 2006. [78] Kedar Patel, Tsu-Jae King Liu, and Costas J. Spanos. Gate line edge roughness model for estimation of finfet performance vari- ability. IEEE Transactions on Electron Devices, 2009. [79] M. Qazi, M.E. Sinangil, and A.P. Chandrakasan. Challenges and directions for low-voltage sram. IEEE Design Test of Computers, 2011. 206 [80] Minsoo Rhu and M Erez. The dual-path execution model for effi- cient GPU control flow. Proceedings of IEEE 19th International Symposium on High Performance Computer Architecture, 2013. [81] Minsoo Rhu, Michael Sullivan, Jingwen Leng, and Mattan Erez. A locality-aware memory hierarchy for energy-efficient gpu architec- tures. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, 2013. [82] Timothy G. Rogers, Daniel R. Johnson, Mike O’Connor, and Stephen W. Keckler. A variable warp size architecture. In Pro- ceedings of the 42Nd Annual International Symposium on Com- puter Architecture, 2015. [83] Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. Cache- conscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitec- ture, 2012. [84] K. Roy, S. Mukhopadhyay, and H. Mahmoodi-Meimand. Leak- age current mechanisms and leakage reduction techniques in deep- submicrometer cmos circuits. Proceedings of the IEEE, 2003. [85] X. Lin Q. Xie S. Chen, Y . Wang and M. Pedram. Performance pre- diction for multiple-threshold 7nm-finfet-based circuits operating in multiple voltage regimes using a cross-layer simulation frame- work. IEEE SOI-3D-Subthreshold Microelectronics Technology Unified Conference, 2014. [86] Martin Lilleeng Sætra and Andr´ e Rigland Brodtkorb. Shallow wa- ter simulations on multiple gpus. In Proceedings of the 10th Inter- national Conference on Applied Parallel and Scientific Computing, 2012. [87] S.E. Schuster. Multiple word/bit line redundancy for semiconduc- tor memories. Solid-State Circuits, IEEE Journal of, 1978. 207 [88] Claudio Scordino and Giuseppe Lipari. Using resource reservation techniques for power-aware scheduling. In Proceedings of the 4th ACM international conference on Embedded software, 2004. [89] V . Sharma. SRAM Bit Cell Optimization. Springer Science and Business Media, 2013. [90] Po-Han Wang, Chia-Lin Yang, Yen-Ming Chen, and Yu-Jung Cheng. Power gating strategies on gpus. ACM Transactions on Architecture and Code Optimimization, 2011. [91] Xingsheng Wang, A.R. Brown, Binjie Cheng, and A. Asenov. Sta- tistical variability and reliability in nanoscale finfets. In Proceed- ings of IEEE International Electron Devices Meeting, 2011. [92] Xingsheng Wang, A.R. Brown, N. Idris, S. Markov, G. Roy, and A. Asenov. Statistical threshold-voltage variability in scaled de- cananometer bulk hkmg mosfets: A full-scale 3-d simulation scal- ing study. IEEE Transactions on Electron Devices, 2011. [93] Yue Wang, S. Roy, and N. Ranganathan. Run-time power-gating in caches of gpus for leakage energy savings. In Proceedings of the Design, Automation Test in Europe Conference Exhibition, 2012. [94] C.H. Wann, K. Noda, T. Tanaka, M. Yoshida, and Chenming Hu. A comparative study of advanced mosfet concepts. IEEE Transac- tions on Electron Devices, 1996. [95] Daniel Wong and Murali Annavaram. Knightshift: Scaling the energy proportionality wall through server-level heterogeneity. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, 2012. [96] Bin; Hsu Meichun Wu, Ren; Zhang. Gpu-accelerated large scale analytics, 2009. [97] Qing Xie, Xue Lin, Yanzhi Wang, M.J. Dousti, A. Shafaei, M. Ghasemi-Gol, and M. Pedram. 5nm finfet standard cell library 208 optimization and circuit synthesis in near-and super-threshold volt- age regimes. In Proceedings of IEEE Computer Society Annual Symposium on VLSI, 2014. [98] Qiumin Xu and Murali Annavaram. Pats: Pattern aware schedul- ing and power gating for gpgpus. In Proceedings of the 23rd In- ternational Conference on Parallel Architectures and Compilation, 2014. [99] M. Yoshimoto, K. Anami, H. Shinohara, T. Yoshihara, H. Takagi, S. Nagao, S. Kayano, and T. Nakano. A divided word-line structure in the static ram and its application to a 64k full cmos ram. IEEE Journal of Solid-State Circuits, 1983. [100] Bin Yu, L. Chang, S. Ahmed, Haihong Wang, S. Bell, Chih-Yuh Yang, C. Tabery, Chau Ho, Qi Xiang, Tsu-Jae King, J. Bokor, Chenming Hu, Ming-Ren Lin, and D. Kyser. Finfet scaling to 10 nm gate length. In Proceedings of the International Electron Devices Meeting, 2002. [101] Wing-kei S. Yu, Ruirui Huang, Sarah Q. Xu, Sung-En Wang, Ed- win Kan, and G. Edward Suh. Sram-dram hybrid memory with ap- plications to efficient register files in fine-grained multi-threading. In Proceedings of the 38th annual International Symposium on Computer Architecture, 2011. [102] Siyu Yue, Lizhong Chen, Di Zhu, Timothy M. Pinkston, and Mas- soud Pedram. Smart butterfly: Reducing static power dissipation of network-on-chip with core-state-awareness. In Proceedings of the 2014 International Symposium on Low Power Electronics and Design, 2014. [103] Bo Zhai, R.G. Dreslinski, D. Blaauw, T. Mudge, and D. Sylvester. Energy efficient near-threshold chip multi-processing. In Proceed- ings of ACM/IEEE International Symposium on Low Power Elec- tronics and Design, 2007. 209 [104] Jishen Zhao and Yuan Xie. Optimizing bandwidth and power of graphics memory with hybrid memory technologies and adaptive data migration. In Proceedings of IEEE/ACM International Con- ference on Computer-Aided Design, 2012. [105] Shi-Ting Zhou, S. Katariya, H. Ghasemi, S. Draper, and Nam Sung Kim. Minimizing total area of low-voltage sram arrays through joint optimization of cell size, redundancy, and ecc. In Proceedings of IEEE International Conference on Computer Design, 2010. 210
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Resource underutilization exploitation for power efficient and reliable throughput processor
PDF
Enabling energy efficient and secure execution of concurrent kernels on graphics processing units
PDF
Energy proportional computing for multi-core and many-core servers
PDF
Efficient memory coherence and consistency support for enabling data sharing in GPUs
PDF
Thermal modeling and control in mobile and server systems
PDF
SLA-based, energy-efficient resource management in cloud computing systems
PDF
Energy efficient design and provisioning of hardware resources in modern computing systems
PDF
Architectural innovations for mitigating data movement cost on graphics processing units and storage systems
PDF
Variation-aware circuit and chip level power optimization in digital VLSI systems
PDF
Design of low-power and resource-efficient on-chip networks
PDF
Performance improvement and power reduction techniques of on-chip networks
PDF
Energy-efficient computing: Datacenters, mobile devices, and mobile clouds
PDF
A framework for runtime energy efficient mobile execution
PDF
Hardware techniques for efficient communication in transactional systems
PDF
Thermal management in microprocessor chips and dynamic backlight control in liquid crystal diaplays
PDF
Improving the efficiency of conflict detection and contention management in hardware transactional memory systems
PDF
Efficient processing of streaming data in multi-user and multi-abstraction workflows
PDF
Compiler and runtime support for hybrid arithmetic and logic processing of neural networks
PDF
Improving efficiency to advance resilient computing
PDF
Low cost fault handling mechanisms for multicore and many-core systems
Asset Metadata
Creator
Abdel-Majeed, Mohammad
(author)
Core Title
Demand based techniques to improve the energy efficiency of the execution units and the register file in general purpose graphics processing units
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Publication Date
04/20/2016
Defense Date
02/04/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
dynamic power,execution units,GPUs,graphics processing units,leakage power,OAI-PMH Harvest,power gating,register file,SIMT execution model,technology scaling
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Annavaram, Murali (
committee chair
), Nakano, Aiichiro (
committee member
), Pedram, Massoud (
committee member
)
Creator Email
abdelmaj@usc.edu,mhm.rajab@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-236507
Unique identifier
UC11276918
Identifier
etd-AbdelMajee-4315.pdf (filename),usctheses-c40-236507 (legacy record id)
Legacy Identifier
etd-AbdelMajee-4315.pdf
Dmrecord
236507
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Abdel-Majeed, Mohammad
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
dynamic power
execution units
GPUs
graphics processing units
leakage power
power gating
register file
SIMT execution model
technology scaling