Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Resource underutilization exploitation for power efficient and reliable throughput processor
(USC Thesis Other)
Resource underutilization exploitation for power efficient and reliable throughput processor
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
RESOURCE UNDERUTILIZATION EXPLOITATION
FOR POWER EFFICIENT AND RELIABLE THROUGHPUT PROCESSOR
by
Hyeran Jeon
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
August 2015
Copyright 2015 Hyeran Jeon
Abstract
The continuing march of Moore’s law, in spite of many prior dire predictions, enables
chip designs with tens of billions of transistors today. But as Dennard’s scaling slows
irrefutably, power consumption has become the first order design constraint. Further-
more, with device scaling, reliability has also come to the forefront of design consid-
erations. To avoid excessive power consumption, chip industry has shifted away from
high performance single threaded designs to high throughput multi-threaded designs.
Nowhere is this design trend so starkly visible than in a Graphics Processing Unit (GPU)
design. GPUs are provisioned with hundreds of execution units and mega bytes of reg-
ister file to run thousands of threads concurrently. Their high throughput and excellent
performance per watt has attracted efforts to port general purpose applications to run
on GPUs. Hence, a new computing paradigm called general purpose computing on
GPUs (GPGPU computing) has emerged. When GPUs execute general purpose code
with irregular parallelism, the massive on-chip resources available for concurrent thread
execution become underutilized. This dissertation presents two mechanisms that exploit
the resource underutilization for improving power efficiency and reliability.
The first mechanism proposes register file virtualization. This approach is motivated by
the observation that at any given instance during an application execution, only a fraction
of the total allocated registers carry live data. By eagerly deallocating registers with dead
ii
data, these registers can then be reassigned to new threads. Our scheme takes advantage
of register liveness information to allow a flexible mapping between architected registers
and their corresponding physical register allocation. Register virtualization tackles the
inefficiency of existing GPU register management method that is the root cause of power
and imbalanced wearleveling problems. By exploring different mapping algorithms,
register virtualization can improve power efficiency or improve GPU reliability. Our
results show that the register virtualization effectively reduces the register demand and
imbalanced wearleveling problem.
Inspired by the reduced demand on register file when using register virtualization, we
also proposed a more aggressive mechanism, GPU-Shrink, that under-provisions the
register file by as much as 50% of the current GPU register file size. GPU-Shrink guar-
antees deadlock-free application execution with a slightly modified warp scheduler. The
new warp scheduler reserves minimum number of available registers to guarantee the
progress of at least one thread block within an application. Our results show that GPU-
Shrink effectively reduces register file’s dynamic and static power with negligible per-
formance overhead.
The second mechanism exploits execution unit underutilization to improve GPU relia-
bility. Due to branch and memory divergence, several execution lanes in a GPU are left
idle. We proposed Warped-DMR to reuse the idle cores to verify the execution on active
lanes. Dual modular redundancy (DMR) has been long used for execution verification
in CPUs. However, unlike traditional DMR that adds a dedicated checker core for each
core to be verified, Warped-DMR repurposes idle execution lanes for opportunistic exe-
cution verification. Hence, Warped-DMR needs zero extra execution lanes. Our results
show that the Warped-DMR can verify almost all the instructions’ execution without
significant performance and power overhead.
iii
To my family..
iv
Acknowledgements
First and foremost, I express my sincere gratitude to my advisor, Professor Murali
Annavaram. He was always open to be a great mentor for research, career, and even
life. I especially appreciate him to give me a freedom to grope for a new idea at my
pace. He helped me to see the big picture whenever I stuck at some details. His keen
analysis always enriched the ideas. As finishing my Ph.D., I hope to be an advisor like
him to my future students. I also thank Sok for her warm greetings and delightful talks
in the house parties.
I would like to thank Dr. Gabriel Loh, who led me to this journey of study. Before
taking his advanced architecture course when I was a master student, I didnt know that
I would want to pursue a doctoral degree. Every single lecture of his class was full
of interesting concepts and ideas, which ended up with my decision to study more on
computer architecture. While doing my doctoral study, I was lucky to have him to be
my internship mentor and dissertation committee. His enthusiastic attitude for research
and life inspired me a lot.
I am grateful to my dissertation committees, Professor Sandeep Gupta and Professor
William Halfond. They provided valuable feedback on various aspects of my research
and helped enrich the content of my dissertation. I also thank Professor Pinkston for his
v
persistent support for being my screening exam committee, qualifying exam committee,
and reference. Without his help, I was not able to finish this journey.
I also want to thank my former advisor, Professor Viktor Prasanna. During my first two
years in my doctoral study, I learned from him how to approach and clearly describe a
research problem from an abstract view. I also thank Professor Monte Ung. He always
greeted and encouraged me with warm smile. I am also grateful to Professor Nam Sung
Kim. He was always open to give advice for my study and future career and share his
experiences whenever I meet him in the conferences. He also enriched my last research
idea in my doctoral study with valuable feedback.
I am grateful to my master thesis advisor, Professor Sung Woo Chung. I learned how to
write a technically strong paper when I wrote my very first paper with him. Without his
support, I could not begin this journey.
I would like to thank my internship mentors, Dr. Kaoutar Maghraoui and Dr. Gokul
Kandiraju. I was lucky to have opportunity to work on solid state disk simulator with
them. I especially appreciate their support to be my references. With their help, I could
find my dream job that I will voyage in the next phase of my life. I was fortunate to meet
Dr. Jeonghee Shin when I went for internship to IBM Research. During the summer that
I spent for the internship, I bugged her a lot for coffee break and she always greeted me
and did not hesitate to share her experiences in both research and career.
I thank members of the SCIP group, SMART group, and P group: Kumar, Daniel,
Qiumin, Gunjae, Mehrtash, Mohammad, Waleed, Jinho, Krishna, Sangwook, Bardia,
vi
Melina, Abdulaziz, Lizhong, Ruisheng, Yinglong, Edward, Qingbo, Nam, Lucia, Thi-
lan, Mike, Qunzhi and Hoang. They all were great friends, teachers, mentors, and col-
laborators for me. I cannot imagine my past six years without the times that I laughed
and discussed with them. I also thank Korean friends in USC: Jungyeon, Dr. Son, Woo-
joo, Joongheon, Jangwon, Jaeyong, and the late Jinhee. Talking and drinking with them
was a great source of my energy to get through tough times during my journey.
I cannot thank enough my life-long friends in Korea. Junghwa, Hyeonjoo, and Hyun (in
Canada) have been my best friends and great supporters with whom I can even share my
frustration. Without their encouragement and humor that always transformed a stressful
situation to a funny situation, I couldnt finish this journey. I thank my childhood friends
Soojung, Bokyung, and Yurim. After almost thirty years of friendship, I feel them as a
part of my family. Their enthusiasm for study and lofty ideals always inspire me and
inflame my passion for learning. I would like to thank Suyoon and Hyeyeon for their
abiding friendship. Suyoon has been a great mentor for both life and career since when
I first met her in Samgsung. Her encouragement was one of the main sources to power
me through the end of this journey.
Last but not least, I owe my deepest gratitude to my parents Kwangsoo Jeon and Jeyoung
Sohn, and sisters Hyejin, Hyekyoung, and brother Shin. I always feel so proud to be part
of these amazing people. Words are never enough to express how much I appreciate
their love and trust. I cannot thank my husband Sangwon Lee enough for his bottomless
consideration, patience and love. Without his humor and positive attitude, my Ph.D.
life would have been very different. I am also grateful to my in-laws, Soonkyu Choi,
Jonghoon Shin, Sookyoung Min, Youngsoon Kim, and Jaegeol Lee.
vii
Table of Contents
Abstract ii
Dedication iv
Acknowledgements v
List of Figures xi
List of Tables xiii
1 Introduction 1
1.1 The Problem: The Quest for Power Efficiency and Reliability . . . . . . 1
1.2 A Brief Introduction to GPU Computing . . . . . . . . . . . . . . . . . 4
1.2.1 CUDA computing . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 GPU architecture . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.3 GPU register file . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.4 GPU pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.1 Register file . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.2 Execution unit . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.6 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Register File Virtualization: Underutilization Exploitation for Register File 16
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Motivational Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 GPU register file underutilization . . . . . . . . . . . . . . . . 19
2.2.2 Imbalanced register accesses . . . . . . . . . . . . . . . . . . . 20
2.3 Register Usage Patterns in GPU . . . . . . . . . . . . . . . . . . . . . 22
2.4 Register Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Compiler and Architectural Support . . . . . . . . . . . . . . . . . . . 27
viii
2.5.1 Register lifetime analysis . . . . . . . . . . . . . . . . . . . . . 27
2.5.2 Per-instruction release flag generation . . . . . . . . . . . . . . 29
2.5.3 Per-branch release flag generation . . . . . . . . . . . . . . . . 30
2.5.4 Renaming table . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5.5 Flag instruction decoding . . . . . . . . . . . . . . . . . . . . . 35
2.6 A Register Allocation Example . . . . . . . . . . . . . . . . . . . . . . 37
2.7 Use-Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.7.1 Static power reduction . . . . . . . . . . . . . . . . . . . . . . 39
2.7.2 Wear-leveling of register usage . . . . . . . . . . . . . . . . . . 42
2.8 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.8.1 Register size savings . . . . . . . . . . . . . . . . . . . . . . . 44
2.8.2 Static power saving with power gating . . . . . . . . . . . . . . 45
2.8.3 Static and dynamic code increase . . . . . . . . . . . . . . . . 48
2.8.4 Renaming table size . . . . . . . . . . . . . . . . . . . . . . . 48
2.8.5 Wear leveling . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.8.6 Comparison with hierarchical register file approach . . . . . . . 51
2.9 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.10 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3 GPU-Shrink: A GPU Design With Half-Sized Register File Using Register
Virtualization 57
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2 Register Throttling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.3 Hardware Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.4.1 Register file utilization . . . . . . . . . . . . . . . . . . . . . . 64
3.4.2 Performance overhead . . . . . . . . . . . . . . . . . . . . . . 65
3.4.3 Energy savings . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4 Warped-DMR: Underutilization Exploitation for Execution Units 72
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.1.1 Exploiting opportunity . . . . . . . . . . . . . . . . . . . . . . 74
4.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.2.1 Underutilization of GPU resources . . . . . . . . . . . . . . . . 76
4.3 Warped-DMR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3.1 Intra-warp DMR . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3.2 Inter-warp DMR . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.3.3 Error coverage . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.3.4 Advantages of Warped-DMR . . . . . . . . . . . . . . . . . . . 83
4.4 Architectural Support for Warped-DMR . . . . . . . . . . . . . . . . . 85
ix
4.4.1 Register forwarding unit . . . . . . . . . . . . . . . . . . . . . 85
4.4.2 Thread-Core mapping . . . . . . . . . . . . . . . . . . . . . . 87
4.4.3 ReplayQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.4.4 Effective size of ReplayQ . . . . . . . . . . . . . . . . . . . . 92
4.5 Enhancing Warped-DMR for 100% Error Detection . . . . . . . . . . . 93
4.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.6.1 Settings and workloads . . . . . . . . . . . . . . . . . . . . . . 94
4.6.2 Error coverage and overhead . . . . . . . . . . . . . . . . . . . 96
4.6.3 Enhanced Warped-DMR . . . . . . . . . . . . . . . . . . . . . 98
4.6.4 Power consumption . . . . . . . . . . . . . . . . . . . . . . . . 99
4.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.8 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5 Conclusion 105
Reference List 109
x
List of Figures
1.1 Transistor and core count increase alongside GPGPU evolution . . . . . 2
1.2 (a) Power consumption under max load [1] and (b) estimated FIT . . . . 3
1.3 Warp Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 GPU Chip architecture and a SIMT cluster . . . . . . . . . . . . . . . . 6
1.5 GPU register file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 Sub-array structure of register file . . . . . . . . . . . . . . . . . . . . 8
1.7 GPU Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Fraction of live registers among compiler reserved registers captured
during the execution (X-axis: cycle, Y-axis: utilization(%)) . . . . . . . 19
2.2 Access frequency difference between most accessed register and least
accessed register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Access intensity difference across the register cells in a SM (X-axis:
Registers per warp, Y-axis: Warps) . . . . . . . . . . . . . . . . . . . . 21
2.4 (a) Register accesses during the execution time and (b) The idea of reg-
ister sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5 Register lifetime analysis of CUDA SDK matrixMul . . . . . . . . . . 24
2.6 Register release time w.r.t. lifetime analysis . . . . . . . . . . . . . . . 28
2.7 Two release flag instructions . . . . . . . . . . . . . . . . . . . . . . . 31
2.8 Register renaming table and release flag cache . . . . . . . . . . . . . . 34
2.9 An example of the proposed register allocation: 1) physical register
availability flag is looked up to find an available physical register, 2)
renaming table is updated, and then 3) the reserved physical register is
written . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.10 Effect of register renaming: (a) without power gating (b) with power
gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.11 Leakage under various technologies (P: Planar, F: FinFET) . . . . . . . 40
2.12 Register allocation reduction . . . . . . . . . . . . . . . . . . . . . . . 45
2.13 (Left) static power reduction with power gating and register renaming.
(Right) dynamic power increase (%) due to renaming table accesses over
the register file access power. . . . . . . . . . . . . . . . . . . . . . . 45
2.14 Sensitivity on subarray wakeup latency . . . . . . . . . . . . . . . . . . 46
xi
2.15 Static code increase and dynamic code increase w.r.t. # entries in a
release flag cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.16 Per SM renaming table size without constraints and normalized register
saving with 1KB constraint . . . . . . . . . . . . . . . . . . . . . . . . 47
2.17 Register access frequency variation when proposed register manage-
ment method is applied (X- and Y-axis are the same with Figure 2.3) . . 49
2.18 Register access frequency variation . . . . . . . . . . . . . . . . . . . . 50
2.19 Comparison with three level register file approach [2]. (a) Register allo-
cation reduction and (b) normalized static power of MRF when our
approach is used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.1 Power versus register file size . . . . . . . . . . . . . . . . . . . . . . . 58
3.2 Number of warp level registers allocated when using register virtualization 59
3.3 An example GPGPU kernel code (matrixMul of NVIDIA CUDA SDK
3.0 [3]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.4 Fraction of live registers in the under-provisioned (50%) register file
sub-arrays that are not power gated, captured during the execution (X-
axis: cycle, Y-axis: utilization(%)) . . . . . . . . . . . . . . . . . . . . 64
3.5 Performance degradation when using half-sized (64KB) register file . . 65
3.6 Total register file energy breakdown . . . . . . . . . . . . . . . . . . . 68
4.1 Execution time breakdown with respect to the number of active threads 74
4.2 Example of underutilization of homogeneous units and Intra-Warp DMR 77
4.3 Example of underutilization of heterogeneous units and Inter-Warp DMR 80
4.4 Execution time breakdown with respect to the instruction type . . . . . 84
4.5 Register Forwarding Unit and Comparator for Intra-Warp DMR . . . . 85
4.6 ReplayQ and Replay Checker for Inter-Warp DMR . . . . . . . . . . . 88
4.7 Two key factors to determine effective ReplayQ size . . . . . . . . . . . 92
4.8 Error coverage and Overhead of Warped-DMR . . . . . . . . . . . . . 97
4.9 Performance overhead when enhanced Warped-DMR is used with a 10-
entry ReplayQ. Error coverage is 100%. . . . . . . . . . . . . . . . . . 98
4.10 Normalized Energy Consumption in execution unit level and overall
device level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
xii
List of Tables
2.1 Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.2 Register renaming table and register bank energy in 40nm technology . 43
4.1 Priority table of RFU MUXs . . . . . . . . . . . . . . . . . . . . . . . 85
4.2 Experimental Environment . . . . . . . . . . . . . . . . . . . . . . . . 95
4.3 Simulation Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.4 Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
xiii
Chapter 1
Introduction
1.1 The Problem: The Quest for Power Efficiency and
Reliability
As transistor density scales while the power budget becomes tighter, the chip industry is
moving away from high frequency, power hungry monolithic cores towards multi- and
even many-core chip multi processors (CMPs) with lower frequency cores. Graphics
processing unit (GPU) is one of the most promising many-core architectures for power
efficient throughput computing. General purpose computing on GPUs (GPGPU com-
puting) is a new paradigm, which can process scientific applications that require general
purpose computing capability as well as traditional graphics applications. With hun-
dreds of simple in-order cores that can run thousands of threads in parallel, GPUs derive
several tera-flops of peak performance. The massive parallelism combined with pro-
grammability made GPUs one of the most attractive choices in supercomputing centers.
Despite the performance benefit, there are two critical hurdles that current GPUs must
overcome. Traditional GPUs that mainly process graphics applications are known to be
inherently fault tolerant [4] because a few pixel errors are acceptable if those are not
perceivable by human eye. However, reliability becomes a critical concern in GPGPU
1
computing to output accurate results from scientific applications execution. Power effi-
ciency is another challenge given that GPUs typically run several thousand threads on
hundreds of processing elements by using mega-bytes of register files to execute a kernel
program.
0
1
2
3
4
5
6
7
8
Transistors (Billion)
3
GT200 Tesla (2008)
1.4B Transistors
240 Cores
GF110 Fermi (2010)
3.0B Transistors
512 Cores
GM200 Maxwell (2015)
8B Transistors
3072 Cores GK110 Kepler (2013)
7.1B Transistors
2880 Cores
Intel Ivy Bridge-EX (2014)
4.3B Transistors
Figure 1.1: Transistor and core count increase alongside GPGPU evolution
As shown in Figure 1.1, the number of transistors used for each generation of GPU
increases almost exponentially. NVIDIA Kepler uses 7.1 billion transistors which is
almost double the transistor count of Intel’s state-of-the-art Ivy bridge that is released
two years later than Kepler. The concern for excessive power consumption has curtailed
new generation GPUs such as NVIDIA Maxwell from growing transistor count.
Figure 1.2(a) shows the peak power consumption measured when maximum computa-
tion load is assigned to the three GPU models, Tesla, Fermi, and Kepler [1]. In spite of
some innovations introduced to tackle power consumption, GPU’s power consumption
2
200
250
300
350
400
450
500
Peak power at max load (W)
1
http://www.legitreviews.com/
(a)
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Estimated Normalized Failure In Time
(FIT/bit of technology node x transistor count)
2
65nm
40nm
28nm
SEU FIT rates of various technology nodes are borrowed from
Anand Dixit and Alan Wood, “The Impact of New Technology on Soft Error Rates”, SELSE’11
(b)
Figure 1.2: (a) Power consumption under max load [1] and (b) estimated FIT
is continuing to grow significantly. On the reliability front, Figure 1.2(b) shows the esti-
mated failure in time (FIT) rate calculated by using single error upset (SEU) FIT rate
of various technology nodes [5] on the three GPU models, Tesla, Fermi, and Kepler at
three different technology nodes. Newer generation GPUs that have more transistors
show higher FIT rate.
The staggering growth in transistor count is typically used to scale the execution
resources in GPUs, namely the number of execution units and the size of the register
file. This thesis makes the observation that not all the available resources are fully uti-
lized during an application’s run time. For instance, as we will show later in Chapter 4,
nearly 32% of the execution lanes are idle during a typical benchmark execution on a
GPU. Thus, the main goal of this thesis is to exploit the resource underutilization to
improve power efficiency and/or reliability in GPUs.
Among many components of GPGPU, this dissertation targets the two largest (area-
wise) components, execution unit and register file, to provide solutions for the reliability
and power efficiency.
3
1.2 A Brief Introduction to GPU Computing
Before diving into the details of the proposed solutions, we present an overview of the
GPU architecture.
1.2.1 CUDA computing
warp scheduler/
instruction dispatcher
Warp 8 instruction 11
Warp 2 instruction 42
Warp 8 instruction 12
Warp 4 instruction 96
Warp 2 instruction 43
Warp 4 instruction 95
time
. . .
Figure 1.3: Warp Scheduling
Compute Unified Device Architecture (CUDA) is the parallel computing platform devel-
oped by NVIDIA [6]. CUDA application’s main function is called kernel. Kernel func-
tions are invoked by host CPU to GPU. Communication between CPU and GPU is
performed via PCI-Express v2 bus. A kernel function is run by several independent
4
cooperative thread arrays (CTAs). Each CTA is executed on an streaming multiproces-
sors (SM) by running multiple Warps. Warp is a group of threads that are scheduled
together. In Fermi, up to 48 warps (or 1536 threads) and up to eight CTAs are con-
currently executed by each SM. The details of an SM architecture is described in the
following subsection.
CUDA platform provides a large, but slow off-chip global memory that can be accessed
by all CTAs and a host CPU, while providing small but fast on-chip shared memory that
is individually shared among threads in the same CTA. All CTAs may communicate with
each other through the global memory, but they can only reliably synchronize with each
other at the end of the kernel execution. Warps within a CTA can communicate with
each other through the shared memory. CUDA provides a primitive, syncthread(),
that enables the warps within a CTA explicitly synchronize.
Warps are scheduled interleavingly in an SM, controlled by warp scheduler(s). As illus-
trated in Figure 1.3, threads in a warp execute the same instruction in a lock step man-
ner. And different warps are scheduled every cycle. This mechanism helps in hiding the
memory access latency. For example, in Figure 1.3 warp 8 encounters a long memory
operation. Then, while warp 8 is waiting for a data from off-chip memory, many other
warps, such as warp 2 and 4 in the figure, can proceed their execution on the other exe-
cution units because warp 2 and 4 do not have any dependency with warp 8. The number
of threads in a CTA and the number of CTAs per kernel is determined by application
developer while the warp size is specified by the hardware. For example, the warp size
of NVIDIA GPUs is 32 [7]. The details of the warp scheduling policy is explained in
the following subsection.
5
1.2.2 GPU architecture
Operand buffering
Register File
4x128-bit Banks
(1R1W)
SPs SFUs LD/STs
Shared Memory
8 SIMT Clusters
SM
SIMT Cluster
Figure 1.4: GPU Chip architecture and a SIMT cluster
The GPU architecture varies depending on vendors and models. In this thesis, we use
the basic architecture of NVIDIA’s Fermi [7] as our GPU model. A GPU consists of
a scalable number of SMs, each comprising of shader processor (SP) cores for arith-
metic operations, LD/ST units for memory operations, Special Function units (SFUs)
for instructions such as sine, cosine, and square root, several register file banks, and
shared memory. Figure 1.4 shows the internals of one SM. The shared memory is acces-
sible by all the SPs within an SM and part of this memory is configurable as an L1
cache.
As explained in the previous subsection, threads in a warp execute the same instruction
in lock step manner. The threads in a warp all share one program counter (PC) but
access different data operands. Such an execution approach is called Single Instruction
Multiple Threads (SIMT) execution. Each thread in a warp may use a different regis-
ter bank within a SM to access its data operands. Each individual thread execution is
referred to as SIMT lane. The execution of threads within a warp is controlled by active
6
mask, which has as many bits as the number of threads in a warp. If a bit is set, the
corresponding thread is allowed to execute the instruction pointed by PC value. Oth-
erwise, the associated thread should not execute the current instruction. Such masking
allows GPU to execute conditional branches even though all the diverging paths should
be sequentially traversed by all the threads in a warp.
1.2.3 GPU register file
th3.r0
th3.r1
.
th2.r0
th2.r1
.
th1.r0
th1.r1
.
th0.r0
th0.r1
.
Register Bank
Warp 0’s Registers
Warp 1’s Registers
.
.
Warp N’s Registers
Operand buffering
Register File
4x128-bit Banks
(1R1W)
SPs SFUs LD/STs
Figure 1.5: GPU register file
Within each SM, four SIMT lanes make a SIMT cluster [8]. Thus each SM has eight
SIMT clusters. As shown in Figure 1.4, each SIMT cluster has 4 SPs and 4 banks of
register files. Each entry of a register bank is 128-bit wide and contains four 32-bit reg-
isters, each associated to one SIMT lane [8]. As each entry of the register bank consists
of 4 registers having the same name but associated with 4 different threads, loading an
entry from a register bank can feed all 4 SIMT lanes at once. Most common instruc-
tions that read 2 operands, write 1 result (2R1W), as well as the special instructions
like MULADD that read 3 operands, write 1 result (3R1W) can access the four regis-
ter banks to read their input operands and write output data concurrently without any
register port stalls most of the time. However, if an instruction fetches operands from
7
the same bank, the operands cannot be fetched concurrently. To handle bank conflicts,
GPUs use operand buffering logic that hides the latency of multi-cycle register fetch.
Each register bank is shared by multiple warps by allocating different register regions
to each warp as illustrated in Figure 1.5. Each warp is assigned its own set of regis-
ters from the beginning of the program execution and the region is released when the
corresponding CTA terminates.
Operand
buffering
A 128-bit
Bank
(1R1W/4KB)
4 128-bit
Subarrays
(1R1W/1KB
Each)
Figure 1.6: Sub-array structure of register file
Due to the access latency as well as dynamic power, GPU register files are typically
partitioned into several small subarrays as illustrated in Figure 1.6 [9]. A 128KB register
file in Fermi is partitioned into 32 banks of each 4KB [9]. To curtail the performance
overhead of bank conflict, GPUs use operand collector buffering logic that hides the
latency of multi-cycle register fetch. Also, compilers are designed to reduce the register
bank conflict by assigning registers that are expected to be allocated in different register
banks to each instruction [10, 11].
8
1.2.4 GPU pipeline
LD/ST
SFU
ALU
ALU
Inst.
Cache
SP
BANK0
BANK1
BANK2
BANK3
w0 add
1 r3 1
1 r7 0
0 - 0
. . .
. . .
Operand
Collector
w3 mul
1 r3 1
1 r7 1
0 - 0
Arbitrator
i-buffer
V W1 inst R
V W2 inst R
-
Fetch/Decode
Warp
Sched./Issue LD/ST
SFU
Figure 1.7: GPU Pipeline
Figure 1.7 shows the pipeline of NVIDIA Fermi GPU. Fetch/Decode stage uses an
instruction buffer to determine the next instruction to fetch. An instruction buffer main-
tains the decoded instructions of all the active warps. A scoreboard logic checks each
instruction if the instruction has any unresolved read-after-write (RAW) and write-after-
write (WAW) dependencies. If an instruction’s dependencies are resolved, the instruc-
tion becomes ready to be issued. Then, the next instruction of the associated warp will
be fetched from the instruction cache.
Among the warps having instructions that are ready to issue, a warp is selected and
issued in the Warp Sched./Issue stage according to warp scheduling algorithm. The
very basic warp scheduling algorithm is round-robin, which schedules different warps
every cycle. The round-robin scheduler is known to be good for fair scheduling. How-
ever, several studies revealed that the round-robin scheduler may cause cache thrashing
because a warp is scheduled again only when all the other warps are scheduled [12].
Also, given that warps assigned to a kernel function are likely to execute the same code,
9
when a warp issues a memory operation, the other warps tend to execute the same mem-
ory operations in the following cycles, which leads to a memory congestion [8]. To
resolve the round-robin scheduler’s negative impact on memory performance, two-level
scheduler [8] is proposed. Two-level scheduler uses two warp queues, ready queue and
pending queue. The scheduler only schedules the warps in the ready queue in round-
robin fashion. When a warp in a ready queue encounters a long memory operation, the
warp is moved to pending queue and a ready warp in the pending queue is enqueued to
ready queue. Warps stay in the pending queue while waiting for the data from mem-
ory. This two-level scheduling effectively resolves the memory congestion issue by
distributing memory operations across time. To reserve data locality, cache conscious
scheduler [12] is proposed. The cache conscious scheduler prioritizes some warps that
are likely to have hits in L1D cache. To do that, a locality scoring system is used to
track each warp’s locality information in L1D. If a data is expected to be evicted soon,
the associated warp is assigned higher priority so that the warp can consume the data
before the data’s eviction.
Once an instruction is issued, Operand Collector stage collects operand values for the
instruction. An instruction can use up to three source operands in GPU. If any of the
operands are in the same register bank, those registers should be sequentially read.
Therefore, it may take multiple cycles to read all the operands for an instruction. Hence,
instructions are buffered in the Operand Collector stage until all the operands are read
from the register file. Then, the ready instruction is issued to one of the three execution
units, SP, LD/ST, and SFU.
10
1.3 Thesis Statement
Underutilized resources in GPUs can be repurposed or even eliminated for improving
reliability in execution unit and power efficiency in register file without compromising
performance.
1.4 Solution
1.4.1 Register file
As described in the prior section, GPUs rely on massive number of execution resources
(SIMT lanes) supported by an extremely large register file to achieve high thread level
parallelism. To support massive parallel thread contexts, GPUs use a huge register file,
which is responsible for a large fraction of GPU’s total power. The conventional belief
is that large register file is inevitable for accommodating more parallel thread contexts
and technology scaling makes it feasible to incorporate ever increasing size of register
file. The first contribution of this thesis is to demonstrate that the register file size does
not need to be increased to accommodate more threads context. We first characterize the
useful lifetime of a register and show that register lifetimes vary drastically across vari-
ous registers that are allocated to a kernel. While some registers are alive for the entire
duration of the kernel execution, some registers have a short lifespan. Then, we propose
register virtualization that allows multiple warps to share physical registers similar to
CPU virtualization. By using register virtualization, we shrink the architected register
space to a smaller physical register space. By under-provisioning the physical register
11
file to be smaller than the architected register file we reduce dynamic and static power
consumption. We then develop a new register throttling mechanism to run applications
having high register usage demand that exceed the size of the under-provisioned regis-
ter file without any deadlock. Our evaluation shows that the applications successfully
run with a half-sized register file with negligible performance overhead and significant
register file energy reduction by using our proposed register file management scheme.
We also demonstrate that the register virtualization also resolves imbalanced wearout
problem in GPU register file by slightly modifying the architected-to-physical register
mapping policy. We measured the access imbalances to various registers in a GPU reg-
ister file and show that only a few registers are excessively accessed while many other
registers are rarely accessed. Under the current register management approach when
a CTA completes, the new CTA that belongs to the same kernel starts using the same
architected registers that are statically mapped to the same physical registers, thereby
exacerbating wearout issues. By mapping architected registers to different physical reg-
isters at runtime, the register virtualization automatically performs wear-leveling so that
all the registers can be evenly worn out.
1.4.2 Execution unit
The second part of the thesis focuses on exploiting underutilization of GPU execu-
tion units (or SIMT lanes) for improving reliability. Commercial GPUs have already
started addressing reliability concerns. Recently, NVIDIA’s Fermi GPGPU added ECC
for the memory components such as register file, caches, and main memory [7]. As
described in Section 1.2, GPUs have hundreds of hardware thread contexts today and
in the near future, they will have thousands of contexts. Each thread context contains a
12
relatively simple processor pipeline with very minimal resources to support speculation,
if any. Hence the vast majority of the chip area is dedicated to execution units, such
as ALUs. In the presence of hundreds (or even thousands) of thread contexts, even a
tiny probability of a logic error in each thread context adds up to an exponentially high
probability of errors at the chip level. Traditionally, logic errors were detected through
dual-modular redundant (DMR) execution [13]. But replicating hundred or thousands of
execution units in GPUs for DMR execution is an impractical solution due to high area
and power overhead. Therefore, this thesis proposes Warped-DMR, which basically
repurposes underutilized SIMT lanes to verify active SIMT lanes’ execution without
adding extra cores for verification purpose. Two types of underutilization are identified
in GPU computation, namely intra-warp underutilization and inter-warp underutiliza-
tion. Intra-warp underutilization is caused by the lock-step execution of warp. When
a warp encounters a diverged control flow, the threads within the warp should be acti-
vated only when the program counter is within their own control flow. Thus, the SIMT
lanes that are associated to the inactive threads become idle. Inter-warp underutiliza-
tion is caused by the limitations in the scheduler feeding three different execution units.
As explored in Section 1.2, GPUs have three different types of execution units, SPs,
LD/ST units, and SFUs. All three different types of execution units are fed by one to
two instruction dispatcher units [7]. Hence, during any given cycle, all three execution
units are barely active concurrently, which leads to idle execution units. sWarped-DMR
is designed to exploit both of them to verify computations under the two scenarios. The
evaluation results show that the Warped-DMR verifies almost all the instruction execu-
tions with zero additional SIMT lanes and small power increase.
13
1.5 Contributions
This dissertation shows a quantitative analysis of resource underutilization in GPU
computing. Then, precise reasonings of the resource underutilization are pro-
vided.
This dissertation presents an umbrella solution, namely register virtualization,
that roots out the power and reliability concerns of GPU register file. Register vir-
tualization basically decouples architected registers and physical registers. Then,
an architected register is maped to a physical register according to the architected
registers’ lifetime information. We first use data flow analysis within compiler to
identify when a register’s value is guaranteed to be dead. We then adapt existing
software-hardware interfaces available in modern GPUs to convey the dead/live
register information to the GPU register management hardware. Then the register
management hardware releases the space assigned to a guaranteed dead register
and opportunistically re-allocates that register to other warps. By exploiting dif-
ferent architected-to-physical register mapping policy, we show that the register
virtualization can improve GPU power efficiency or GPU reliability.
Because register virtualization reduces physical register pressure, we exploit it to
design a GPU with half the number of the physical registers, while transparently
allowing the applications to use the full architected register space. This disserta-
tion presents a simple warp scheduler, called GPU-Shrink, that throttles register
shortage that may occur when under-provisioning GPU register file. The com-
bination of register virtualization and GPU-Shrink realizes a deadlock-free GPU
14
register file under-provisioning. As total register file size in one GPU chip is com-
parable with a shared last level cache in multi-core CPU [14], halving the register
file has significant economic and yield impact. For instance, it is well known
that chip yields are inversely correlated with its area, while chip costs are directly
proportional to its area [15]. We present a detailed quantitative evaluation of our
design to show that our design reduces the total register file energy and the degree
of register access imbalance significantly.
This dissertation presents a method, called Warped-DMR, that tackles reliabil-
ity problem in GPGPU execution unit. We defined two types of execution unit
underutilization and proposed a DMR to exploit each of the two underutilization
types. Warped-DMR demonstrates that the underutilized resources can be effec-
tively repurposed to improve GPU reliability. Without any additional SIMT lanes,
Warped-DMR effectively verifies over 90% of all instructions execution.
1.6 Dissertation Organization
The remainder of this dissertation is organized as follows. Chapter 2 and 3 describe the
register file underutilization exploitation mechanisms for power efficient and reliable
register file design. Chapter 4 shows the execution unit underutilization exploitation
mechanism for reliable execution unit design. Then, we conclude in Chapter 5.
15
Chapter 2
Register File Virtualization:
Underutilization Exploitation for
Register File
2.1 Introduction
As described in Chapter 1, GPUs provide massive register file to quickly switch between
thousands of thread contexts. To run thousands of threads concurrently, GPUs need to
save and restore the architecture state of the threads on each thread switch. Since GPUs
potentially can switch between threads every cycle, they can ill-afford to save the thread
context to an off-chip memory or even on an on-die cache. Instead, GPUs have a register
file that is even bigger than the second level cache on traditional CMPs. The trend in
GPU design indicates that with technology progression more thread contexts will be
supported in future designs. For instance, the size of the register file per each streaming
multiprocessor (SM) doubled from Fermi to the Kepler architecture [16]. GPU register
file is the largest SRAM structure on die and is third most power hungry structure [17].
In a GPU, each warp has its own set of dedicated architected registers, indexed by the
warp id, and each architected register has a corresponding physical register allocated in
16
the register file [18]. Usually a kernel that is marked for execution on a GPU is compiled
into thousands of threads which are then grouped into several cooperative thread arrays
(CTAs). Each CTA is further sub-divided into dozens of warps where a warp typically
comprises of 32 threads. In many cases the number of threads that can be grouped per
CTA is limited by the number of architected registers needed by each thread. Once a
kernel is launched on the GPU the total number of registers used by all the warps within
that CTA are pre-allocated and those registers are not released until all the warps within
the CTA are complete. Therefore, all the warps within a CTA can read and write their
own register values without interfering with other warps. This register management
policy avoids resource contention among the warps. However, we show that this simple
management policy leads to register file usage inefficiencies.
As each warp occupies its own register space during the entire CTA execution duration,
the register file can be significantly underutilized during the execution. We measured
the lifetime of a register, defined as the time from when a new value is written into
the register until the last use of a register value is complete, and identified that register
lifetimes vary significantly across various registers that are allocated to a kernel. Thus
the number of live registers (the registers having a value that is going to be read by
at least one future instruction) at any point in time is much less than the total number
of registers used in that kernel. In addition to variable register lifetimes the second
problem is how CTAs are executed in current GPUs. Even though some warps finish
their execution early and never use the registers again, the register space allocated to a
CTA cannot be used by other CTAs until all the warps in the CTA finish their work. Note
that the warp’s execution time varies significantly especially in the state-of-the-art two
level schedulers because warps are scheduled in two different queues. Such a wastage
17
of register resources unnecessarily consume leakage power without any contribution to
the program performance.
The second and a growing concern for large SRAM structures is wearout caused due
to excessive usage [19–22]. We measured the access imbalances to various registers in
a GPU register file and show that only a few registers are excessively accessed, par-
ticularly registers R0 to R4, while many other registers are rarely accessed. Under
the current register management approach when a CTA completes, the new CTA that
belongs to the same kernel starts using the same architected registers that are statically
mapped to the same physical registers, thereby exacerbating wearout issues.
Rather than solve each of these problems in isolation we believe the root cause for these
problems is the existing register management approach in GPUs. In this chapter, we
propose a novel register file management method named Register Virtualization that
can handle the register usage inefficiency concerns. Register Virtualization decouples
the usage of architected registers and physical registers by mapping them dynamically.
We first use data flow analysis within a compiler to identify when a register’s value is
guaranteed to be dead. We then adapt existing software-hardware interfaces available in
modern GPUs to convey the dead/live register information to the GPU register manage-
ment hardware. Then the register management hardware releases the space assigned to
a guaranteed dead register and opportunistically re-allocates that register to other warps.
An architected register is mapped to one of the available physical register spaces only
when a new value is written to the register, unlike the basic register management scheme
that maps all the architected registers to physical registers statically at the beginning of a
CTA launch. The proposed scheme thus reduces the active register file size significantly,
18
0 5000 10000
0
50
100
(a) MatrixMul
0 5000 10000
0
50
100
(b) Reduction
0 5000 10000
0
50
100
(c) VectorAdd
0 5000 10000
0
50
100
(d) Blacksc.
0 5000 10000
0
50
100
(e) LPS
0 5000 10000
0
50
100
(f) BackProp
0 5000 10000
0
50
100
(g) Hotspot
0 5000 10000
0
50
100
(h) Gaussian
Figure 2.1: Fraction of live registers among compiler reserved registers captured during
the execution (X-axis: cycle, Y-axis: utilization(%))
while the inactive registers can be completely power gated to reduce static power. Fur-
thermore, mapping architected registers to different physical registers at runtime auto-
matically provides us the ability to perform wear-leveling so that all the registers can be
evenly worn out.
2.2 Motivational Data
2.2.1 GPU register file underutilization
GPU register file is not fully underutilized in some benchmarks due to limited paral-
lelism or small code footprint. For instance, in prior work [23, 24] the authors showed
that nearly 46% of the available registers in a GPU were not allocated for any compu-
tation by the compiler. However, what we observed in this study is that even when a
register is allocated only a fraction of those allocated registers are live registers. We
19
define a live register as a register that stores a value that may be consumed by any of
the future instructions until the end of program execution. Figure 2.1 shows the fraction
of live registers over all the compiler allocated registers for a sample 10K cycle execu-
tion window for a representative set of eight applications (more experimental details are
presented in Section 3.4). X-axis denotes time in cycles and Y-axis is the fraction of all
the allocated registers that are alive at a given time. Except VectorAdd, the remaining
seven applications barely use half of the allocated registers to carry live data. In case
of VectorAdd 100% of the allocated registers are live around 2K cycle mark due to the
short program code and relatively small number of registers used. Note that while this
data is shown for 10K cycles execution window for clarity, there is no significant change
in the fraction of live registers used over the entire application execution window. If one
can release the dead registers and reuse them immediately then it is possible to build
a GPU with smaller physical registers while transparently allowing the applications to
utilize the entire architected register file space.
2.2.2 Imbalanced register accesses
We also measured the access frequencies of each architected register in an SM during
the entire execution of a set of benchmarks. Figure 2.2 shows the access count differ-
ence between most accessed register and the least accessed register collected at the end
of each benchmark execution. The most frequently accessed registers are accessed on
average 1800 times more than the least frequently accessed registers. In some applica-
tions such as MUM, the difference becomes even greater than 100K accesses.
Figure 2.3 presents a visual representation of the register access imbalance within an
SM. The access counts are scaled to a value between 0 and 1; 0 represents the minimum
20
1,777.52
1.0E+0
1.0E+1
1.0E+2
1.0E+3
1.0E+4
1.0E+5
1.0E+6
MatrixMul
BlackScho.
DCT8x8
Reduction
VectorAdd
BackProp
BFS
Heartwall
HotSpot
LUD
Gaussian
LIB
LPS
NN
MUM
ScalarProd
AVG
Normalized Access Count
Figure 2.2: Access frequency difference between most accessed register and least
accessed register
0 21
0
47
(a) MUM
0 21
0
47
(b) BlackSc.
0 21
0
47
(c) matrixMul
0 21
0
47
(d) LPS
0 21
0
47
(e) dct8x8
0 21
0
47
(f) NN
0 21
0
47
(g) reduction
0 21
0
47
0
1
(h) LIB
Figure 2.3: Access intensity difference across the register cells in a SM (X-axis: Regis-
ters per warp, Y-axis: Warps)
number of accesses and is represented by black color, and 1 represents the maximum
number of accesses and is denoted by the white color. Fermi supports 32768 registers
(organized as 1024 registers each 32 register elements wide) across 48 warps. Thus
each warp can have at most 22 (=1024/48) registers when all 48 warps are active. The
X-axis in the figure represents the registers numbered from 0-21 and the Y-axis repre-
sents the warps numbered from 0-47. No matter which warp we consider the register
21
access counts are nearly (but not exactly) identical, which is to be expected given that
all warps are executing the same code except when they diverge. Second, only a few
registers are accessed frequently (white stripes) while the vast majority of the registers
are accessed minimally (black stripes). These stark register usage imbalances lead to
significant imbalances in register wearout and may lead to electromigration problems in
the frequently accessed registers [19–21].
2.3 Register Usage Patterns in GPU
Figure 2.4(a) shows three representative register usage patterns seen in GPU applica-
tions. The pattern is taken from the benchmarkmatrixMul of CUDA SDK used in our
experimental evaluation. The corresponding assembly code is shown in Figure 2.5. We
captured the lifetime of three registers,r0,r1 andr3. The X-axis represents time, and
the Y-axis represents the register liveness. A dead value is represented with a Y value
of zero and live register is represented as a next step up in the Y-axis.
In this programr1 is written at the beginning of the program and is read at the end of the
program execution. Asr1’s value is read at the end of the program, the register is alive
for the entire program duration. Clearlyr1 is a long lived register. On the other hand,
r0 is actively produced and consumed within a loop. Each group of spikes is a loop
iteration and there are a total of five loop iterations shown in the figure. On each entry
into the loop,r0’s value is loaded from global memory and then a series of read-write
sequences are performed. This is an example of a register that has multiple lifetimes but
each lifetime is relatively short. In this figure,r3 has the shortest lifetime. It is only used
for a short time window at the beginning of the program and the end of the program.
22
8969 8030 6792 5380 3913 1050
R3 R0 R1
LIVE
DEAD
time
(r3) Last Read before loop
(r0) Several Writes and Reads within a loop
(r1) Writes in the beginning of the program and Reads after loop
(r3) Writes after loop
Overlapped lifetime
(a)
8969 8030 6792 5380 3913 1050
R3 R0 R1
LIVE
DEAD
P3
LIVE
DEAD
Warp0-R3 Warp1-R3 WarpN-R3
LIVE
DEAD
time time
register sharing
. . .
(b)
Figure 2.4: (a) Register accesses during the execution time and (b) The idea of register
sharing
After it was consumed just before the start of loop,r3 is never used within the loop and
then redefined after the loop.
It is worth noting that the register space waste due to short lived registers liker3 cannot
be eliminated by compiler optimization because there is a short time window when all
three registers are active concurrently, as marked with red line in the figure. Figure 2.5
is the compiler generated kernel code from which the register usage patterns shown in
Figure 2.4 are captured. The bidirectional arrows in the figure indicate the lifetimes of
the three registers. In two time windows where the code between the offset at 0x90 and
0x108 and between the offset at 0x308 and 0x378 are executed, the lifetime of the three
23
Loop body
R0 LT
R0 LT
PC CODE
/*0038*/ IADD R1, g [0x7], R4;
/*0088*/ MOV R0, g [0x8];
/*0090*/ IMUL.U16.U16 R3, R2L, R0H;
/*00a8*/ IMAD.U16 R3, R2L, R0L, R3;
/*0108*/ IADD R0, R3, R0;
/*0120*/ IADD32 R0, R1, R0;
/*0160*/ GLD.U32 R0, global14 [R4];
/*0168*/ R2G.U32.U32 g [A2+0x109], R0;
/*0170*/ GLD.U32 R0, global14 [R2];
/*0178*/ R2G.U32.U32 g [A2+0x9], R0;
/*02e8*/ BRA C0.NE, 0x160;
/*0300*/ MOV R0, g [0x8];
/*0308*/ IMUL.U16.U16 R3, R2L, R0H;
/*0378*/ IADD32 R0, R3, R0;
/*037c*/ IADD32 R0, R1, R0;
R1 LT
R0 LT
R3 LT
R3 LT
R0 LT
R0 LT
R3 LT
Figure 2.5: Register lifetime analysis of CUDA SDK matrixMul
registers are overlapped. Therefore,r0 could not be replaced by simply reusingr3 by
the compiler.
As shown in this example it is unavoidable to have short lived registers with overlapped
lifetime when compiler is unable to simply reclaim and reuse a register. Even in a single
threaded application, wasting a single register is an inefficient use of space. But in the
GPU context, all the threads’r3 registers have the same lifetime pattern. For instance,
matrixMul assigns 6 concurrent CTAs per SM and each CTA is executed by 8 warps.
Therefore, total of 1280 (6 8 32) copies of register r3 are dead for a long time
before they are redefined.
24
2.4 Register Virtualization
We propose a new register management method that effectively reduces the wasted reg-
ister space. The key idea is to share the register space across the warps by flexibly
mapping architected registers to the physical register space. In GPUs, warps are sched-
uled to execute the same code at different points in time. For instance, using the two-
level scheduler [8] that is used in our baseline architecture, the schedule time difference
among the warps reaches several hundred cycles because the warps in the pending queue
can be scheduled only when the warps in the ready queue encounter long latency mem-
ory operations or pipeline stalls. Therefore, when a register’s life time ends in one warp,
that register space can be allocated to a different warp which is beginning a new register
life cycle.
Figure 2.4(b) shows an example of register reuse for the same code sequence shown
in Figure 2.4(a). If there are N warps, say warp 0 to N, that execute the same code
but are scheduled in different points in time, it is inefficient to have separate register
spaces for the N r3s. Instead, we propose to share the physical register space for the
short lived registers that do not have overlapped lifetime, as illustrated asp3 in the right
hand side figure. By sharing the register space, we only need one physical register for
accommodating Nr3s.
To enable register sharing across warps, it is necessary to separate architected registers
from the physical register space they occupy. CPUs have long used virtualization to
enable efficient sharing of resources by providing more virtual resources than the actual
number of physical resources [25]. In this work we will rely on the same virtualiza-
tion principle to enable register file sharing across multiple warps. To enable such a
25
virtualization we will rely on register renaming. Register renaming has also been used
in CPUs to avoid false data dependency by mapping an architected register’s multi-
ple value instances to distinct physical registers. But as explained in our related work
section 2.9 we adapt CPU renaming techniques to register file virtualiztion within the
context of a GPU execution model. and apply them to the GPU execution model. We
will rely on compile-time lifetime analysis to identify dead registers in the code. Sec-
tion 2.5 describes how the register lifetime information is collected by extending the
existing compiler analysis and how the information is conveyed to the hardware using
the metadata instructions that are already deployed in recent GPUs [11]. The metadata
instruction that is used in recent GPUs is explained in the following paragraph.
Metadata instruction: As power efficiency becomes a top design priority, vendors
have developed novel approaches to convey compile time information to hardware to
improve power efficiency. NVIDIA’s Kepler architecture conveys compile time infor-
mation to scoreboard logic to track data dependencies. Instead of tracking dependencies
at runtime, Kepler relies on compiler to generate the dependency information and this
information is conveyed to the hardware using metadata instructions that are embed-
ded in the code [16]. A recent study found that one metadata instruction is added per
seven instructions [11], and the format and the operation of the metadata instruction
is like explicit-dependence lookahead instruction used in Tera computer system [26].
The information contained in the metadata instruction is used for indicating the cycles
that the seven following instructions should wait until the dependencies are resolved.
These metadata instructions are fetched and decoded to generate control information
for upcoming instructions. To pre-process the metadata instruction, the fetch stage in
Kepler is partitioned into two separate stages: Sched: info and Select. Sched. info
stage pre-processes the metadata instruction and Select stage selects an instruction to
26
issue according to the metadata. In this dissertation, we leverage this mechanism to
interface compiler generated information with the hardware using metadata instructions.
2.5 Compiler and Architectural Support
2.5.1 Register lifetime analysis
Intra-Basic Block: The register management logic has to track register lifetime and
only when the register is guaranteed to be dead, it can release that register. We will
rely on compiler analysis to statically identify the start and end points of the life cycle
of each register. Figure 2.6 shows five representative code examples that should be
considered by the compiler in register lifetime analysis. Each rectangle represents a
basic block. In the first scenario shown in Figure 2.6(a) an intra-basic block analysis
can be done trivially to determine lifetime. In this scenario a register is written first and
read multiple times within the basic block. It is then eventually redefined before the
end of the basic block. Here the first lifetime of the register starts and ends within the
basic block. Whenever a register is used as a destination operand of an instruction, the
previous instruction that uses the register as a source operand can release the register
after reading the value. We add one meta data flag bit per each instruction operand
to indicate whether that register can be released after that read operation. As CUDA
instructions have maximum of three operands, three bits are used per instruction and
these metadata bits are called per-instruction release flag (pir). When a bit is 1, the
corresponding operand storage register can be released after it is read by the current
27
Write A
Read A
.
.
Read A
Release A
.
.
Write A
(a)
Write A
Read A Read A
Release A
(b)
Write A
Read A
Release A
(c)
Write A
Release A
Read A
Write A
(d)
Write A
Read A
Release A
(e)
Figure 2.6: Register release time w.r.t. lifetime analysis
instruction. More details about these metadata bits and their organization are described
shortly.
Diverged flows: In the presence of a branch divergence, the register release information
must be conservatively set. Figure 2.6(b) and (c) show two scenarios. In both cases, the
register can be safely released at the reconvergence point because of the warps’ lock-
step execution. In the example of Figure 2.6(b), the two branch paths are traversed by a
warp sequentially. If the release information is put in each flow of the branch, the flow
that is executed first will release the registers and the threads within the same warp that
execute the other flow may get incorrect results if they use any of the released registers.
However, the main problem is that unlike in the intra-basic block case, here the register
release is not associated with the actual last use instruction of the register. Instead, it
is associated with an instruction that happens to start at the reconvergence point. It
is also possible that multiple registers may need to be released at the reconvergence
point. Hence, rather than adding meta data to an existing instruction, we add a new
per-branch release flag (pbr). The flag contains the list of architected register IDs that
28
can be released at the start of the reconvergence block. The details and overhead are
discussed in Section 2.5.3.
Loop: Figure 2.6(d) shows a loop where a register produced in one iteration is used
in another iteration. In this scenario, clearly there is no option to statically determine
the last use and hence the compiler can release the register only when all iterations are
complete. On the other hand, if there is no loop carried dependence on registers across
loop iterations, then it is possible to release the register after the last consumption within
the loop body as shown in Figure 2.6(e).
2.5.2 Per-instruction release flag generation
The register lifetime information is generated at compile time and embedded in the
code. As mentioned earlier, each instruction has a three-bit per-instruction release flag,
where each bit indicates one of the maximum of three source operands that can be
released. If the bit is 1, the corresponding operand register can be released after it is
read by the instruction. But embedding a 3-bitpir in each instruction requires significant
modification on the instruction fetch and cache access logic. To avoid this concern, we
use a 64-bit flag-set meta instruction that is present at the beginning of each basic block
as shown in Figure 2.7(a). The selection of 64-bit flag is to accommodate the fact that
CUDA code is already 64-bit aligned. To keep the metadata instruction comply with
existing CUDA instruction set that uses 10-bit opcode we simply reserve a 10-bit value
as register release opcode, and then use the remaining bits to store 18 three-bit flags that
can cover 18 consecutive instructions within the basic block. If a basic block is larger
than 18 instructions long, a flag set is inserted every 18 instructions. If the basic block
has fewer than 18 instructions then some of the flag bits are simply unused. Note that
29
the 10-bit register release opcode is split into two sets of four and six bits to follow the
Fermi instruction encoding format [27, 28].
In the example of Figure 2.7(a), thepir
0
s first three bits represent the release information
for each of the input operands of the first add instruction. Let us assume that r0 is
determined to be dead after the execution of add instruction according to the register
lifetime analysis. Sincer0 is the first input operand, the correspondingpir flag bit is set
to one. The secondr5 is still alive and hence the corresponding flag is set to zero. There
is no third input operand for theadd instruction and hence the corresponding bit in the
pir is a don’t-care.
2.5.3 Per-branch release flag generation
At the diverged flows, we do a conservative release. The registers that are referenced
across multiple flows or loop iterations are only released when the diverged flows are
converged. At the reconvergence point, a pbr is added. As shown in Figure 2.7(b),
the format is similar topir. The only difference is that every six bits present a register
number to release. Note that each thread in Fermi can use up to 63 registers which
can be identified by six bits. Total of nine registers can be covered by apbr. If more
than nine registers are to be released, morepbrs are added. However, according to our
evaluation, the average number of registers that are released bypbr is just 2.
30
0x80 .pir 001000...000010000
0x88 add $r4 $r5 $r0
... ...
0xE8 mov $r0 $r2
0xF0 shl $r4 $r0
0xF8 mad $r3 $r5 $r2 $r1
001 000 … 000 000 110
54-bit Release Flag
####
10-bit Per Instruction
Release Flag OP code
- $r5 $r0
release keep
######
(a) Per instruction release flag
010010 … 001100 000001
54-bit Release Flag
####
10-bit Per Branch
Release Flag OP code
######
reg1 reg9
(b) Per branch release flag
Figure 2.7: Two release flag instructions
2.5.4 Renaming table
To enable register virtualization, each architected register is mapped to a physical reg-
ister whenever it is written. When the architected register value is no longer used, the
mapping is released. The release point is provided either as a part of the pir or pbr.
Once a physical register is released, it is marked as available that can then be remapped
to another architected register in the future. The physical register availability marking is
explained shortly.
To maintain the mapping information, a register renaming table is added to each SM.
Since registers are allocated and released per warp, the renaming table is operated per
warp. Each renaming table is indexed by combining warp id and architected register
id and contains the corresponding physical register id. In our baseline design, each
SM has 128KB register file, which holds a total of 1024 physical registers (324-
bytes each); hence, each entry of the renaming table stores the 10 bit physical reg-
ister number. The total size of renaming table per SM is (max warps per SM
max regs per thread log
2
(max physical register count)) bits. In our baseline,
each SM can support 48 warps and each warp can have a maximum of 63 registers. Thus
31
the total renaming table size is 3.8KB. We will present a simple optimization to reduce
the renaming table size and provide power and timing analyses later. In addition, we use
a single bit register availability vector per each register (1024 bits in total) to indicate if
a given physical register is currently assigned or free.
Preserving compiler provided bank information: As described in Section 1.2.2, the
128KB register file in each SM is divided into 32 banks and each bank provides data to
a four-lane SIMT cluster. Thus a single warp instruction accesses eight register banks,
one bank per SIMT cluster, to read a register needed for a warp execution. Therefore,
maximum of three register banks are accessed per SIMT cluster for a warp execution.
Note that CUDA ISA uses maximum of three operands per instruction. If any of the
operands used by an instruction are in the same register bank, bank conflict occurs. GPU
compilers strive to avoid register bank conflicts by distributing the input operands across
the four banks within each SIMT cluster. Kim et al. [10] articulated that the compiler
is responsible for reducing the register bank conflict and Lai and Seznec [11] proposed
compiler-level optimizations that show the throughput improvement by reducing the
register bank conflicts. Hence, it is prudent not to ignore the compiler allocated bank
information while renaming. To preserve the compiler assigned bank information, we
restrict register renaming to find a register within the same bank as the original bank that
the compiler intended to assign.
To reflect this restriction, when finding an available physical register as mapping a new
register, instead of a single 1024-entry register availability vector, we use four 256-bit
physical register availability flag vectors; a 256-bit vector per bank. In order to rename
an architected register, we first identify the bank that the architected register maps to in
32
the absence of renaming, and then restrict searching for an available register within that
bank while preserving the main goal of power efficiency.
Reducing renaming table size: While 3.8KB of renaming table space can completely
rename all the available registers in an SM, it is possible to shrink the size of the renam-
ing table by using two observations. First, renaming a long lived register is not benefi-
cial since that register cannot be released and reused most of the time. Second, if any
two registers have similar lifetime per value instance, the register that has more value
instances during the program execution tends to produce less register reuse opportunity
since it lives longer by using multiple value instances. For example, if registerr2 and
r3 have the same lifetime, but ifr2 is defined more often in the program thanr3 then
we callr2 as having more value instances. In this case we selectr3 as a short lived reg-
ister instead ofr2 when we cant include both registers in the renaming table due to size
limitation. Thus register renaming can avoid renaming long lived registers and those
registers that have more value instances. Using this approach we can limit the register
renaming table size and then select only those registers that benefit most from renaming
to fit within the renaming table size. In our approach we limit the renaming table size to
1KB, which is 25% of the full renaming table.
To determine which registers are candidates for renaming, we calculate the esti-
mated register value lifetime at compile time. The value lifetime can be calculated
at compile time by counting the number of instructions between the write point and
the next release point in the code. Then, the registers are sorted using their life-
time. Once the registers are sorted based on lifetime, the compiler only picks the
topd
1024B
10bits#warps per CTA#max concurrent CTAs4B
e registers for renaming. Only for the
selected registers the compiler insertspir andpbr flags to release these registers while
33
pc
Operand Release
Flag
0x80 00 10 0 0 …0 00 0 1 0 00 0
0x92 0 0 1 0 0 1 …0 0 0 0 1 0 0 1 1
. . .
. . .
- -
Inst.
Cache
Sched. Info Select Decoder Issue
CHECK pc & FETCH
Fetch
ALU
MEM
Release Flag Cache
BANK0
BANK1
BANK2
BANK3
w3 add
1 r3 1
1 r7 1
0 - 0
Reg. Renaming
. . .
. . .
Operand
Collector
Release Flag
PROBE & FETCH FLAG
Arbitrator
w0 add
1 r3 1
1 r7 0
0 - 0
ADD
Figure 2.8: Register renaming table and release flag cache
non-renamed architected registers are never released. The renaming-exempted registers
are assigned the lowestN register ids and thenN is given to the hardware so that the
hardware only stores the mapping information for the registers with id that is higher than
N. The lowestN registers are directly mapped to the lowestN physical registers.
Renaming table organization: The renaming table consists of four banks so that the
operands can lookup the physical register index concurrently. When there is a bank
conflict, the name lookup may be serialized. According to our detailed gate level sim-
ulation of this register renaming table, the access latency of the optimized renaming
table of 1KB is 0.22ns. In our evaluations we conservatively assume that this renaming
operation cannot be absorbed in the existing pipeline delays, and hence one extra cycle
pipeline latency may be taken for renaming table lookup.
34
2.5.5 Flag instruction decoding
To provide the register lifetime information, compiler adds release flag metadata instruc-
tions. As noted in Section 1.2.2, modern GPUs support metadata instruction decod-
ing for different power efficiency considerations. We rely on the same mechanisms to
decode the release flag information. The 10-bit opcode is used to first determine if the
instruction type is pir or pbr. Once this determination is made the remaining 54 bits
are simply used to determine which registers are dead when each of the following 18
instructions complete their execution.
Even though the flag instructions can be simply fetched and decoded as normal metadata
instructions, to reduce the potential power overhead due to the repeated flag instruction
fetches from the regular instruction cache, we useReleaseFlagCache to store 54 bit
flags of pir instructions. Figure 2.8 shows the interaction between register renaming
table and the release flag cache. The release flag cache is a direct mapped cache and
is shared across the warps. It is indexed by the PC of the pir instruction. Multiple
warps that are part of a single CTA all execute the same kernel. Since warps within a
CTA are scheduled closely in time by most existing warp schedulers, such as round-
robin and two-level schedulers, the sequence of instructions executed by different warps
exhibit strong temporal locality. Thus, not every warp needs to maintain an exclusive
copy of the samepir. We add a release flag cache access logic to the fetch stage that
selectively fetches the pir instructions from the instruction cache only when there is
a miss in the release flag cache. Every cycle the fetch stage probes the release flag
cache and if the PC is a hit in release flag cache, thepir instruction is not fetched from
the regular instruction cache and the program counter is incremented to fetch the next
instruction. If the instruction fetch width is bigger than one instruction,pir may end up
35
getting fetched anyway alongside the regular instructions. Nevertheless, the redundantly
fetchedpir can be easily detected and it is not fed to the decoder logic. In this scenario,
while the fetch bandwidth is not altered, at least the decoding cost can be avoided.
When a regular instruction is fetched in select stage, the corresponding three bit flag is
also fetched from the release flag cache to determine if any of the operand registers are
dead after the current instruction reads the register. The release flag cache is maintained
as a direct mapped cache and whenever a PC misses in the release flag cache then the
instruction is fetched from the regular instruction cache and decoded. At the decode
stage if the instruction was determined to bepir then 54 bit release flags are stored in
the cache by replacing the existing entry.
To determine the appropriate number of entries in a release flag cache, we measured
the dynamic code increase while varying the size of the release flag cache. According
to our results, the dynamic code increase is reduced as more entries incorporated to a
release flag cache and the reduction is saturated when ten entries are used. Therefore,
we design a release flag cache to have 10 entries. As each release flag is 54 bits long,
the total release flag cache size is 68B. Detailed results are discussed in Section 2.8.3.
The benefits of cachingpbr are not significant as these instructions do not control the
release of individual register operands of other instructions. Rather they simply release
the specified registers. Whenpbr is fetched, the register mapping table is looked up and
the mapping information is removed and the corresponding bit of the physical register
availability flag is cleared.
36
2.6 A Register Allocation Example
SM
. . .
SIMT Cluster 0
.
.
.
P0
P4
P1020
.
.
.
P1
P5
P1021
.
.
.
P2
P6
P1022
.
.
.
P3
P7
P1023
SIMT Cluster 0
.
.
.
P0
P4
P1020
.
.
.
P1
P5
P1021
.
.
.
P2
P6
P1022
.
.
.
P3
P7
P1023
SIMT Cluster 7
.
.
.
P0
P4
P1020
.
.
.
P1
P5
P1021
.
.
.
P2
P6
P1022
.
.
.
P3
P7
P1023
Bank0 Bank1 Bank2 Bank3
PHYS.
REG ID
. . .
-
P8
P2
P5
. . .
0 0 1 1 1 1 1 0 1 0 0 0 . . . . 0
P0 1 2 3 4 5 6 7 8 9 10 11 . . . . 1023
PHYSICAL REGISTER AVAIL. FLAG
RENAMING
TABLE
Occupied
Available
⓿ Warp 5 wants
to write to $r0
w5: mov $r0 $r2
(a)
SM
. . .
SIMT Cluster 0
.
.
.
P0
P4
P1020
.
.
.
P1
P5
P1021
.
.
.
P2
P6
P1022
.
.
.
P3
P7
P1023
SIMT Cluster 0
.
.
.
P0
P4
P1020
.
.
.
P1
P5
P1021
.
.
.
P2
P6
P1022
.
.
.
P3
P7
P1023
SIMT Cluster 7
.
.
.
P0
P4
P1020
.
.
.
P1
P5
P1021
.
.
.
P2
P6
P1022
.
.
.
P3
P7
P1023
Bank0 Bank1 Bank2 Bank3
0 0 1 1 1 1 1 0 1 0 0 0 . . . . 0
P0 1 2 3 4 5 6 7 8 9 10 11 . . . . 1023
PHYSICAL REGISTER AVAIL. FLAG
Occupied
Available
❶ Lookup Availability Flag
and find a free physical
register for $r0
$P0 is selected
w5: mov $r0 $r2
PHYS.
REG ID
. . .
-
P8
P2
P5
. . .
RENAMING
TABLE
(b)
SM
. . .
SIMT Cluster 0
.
.
.
P0
P4
P1020
.
.
.
P1
P5
P1021
.
.
.
P2
P6
P1022
.
.
.
P3
P7
P1023
SIMT Cluster 0
.
.
.
P0
P4
P1020
.
.
.
P1
P5
P1021
.
.
.
P2
P6
P1022
.
.
.
P3
P7
P1023
SIMT Cluster 7
.
.
.
P0
P4
P1020
.
.
.
P1
P5
P1021
.
.
.
P2
P6
P1022
.
.
.
P3
P7
P1023
Bank0 Bank1 Bank2 Bank3
1 0 1 1 1 1 1 0 1 0 0 0 . . . . 0
P0 1 2 3 4 5 6 7 8 9 10 11 . . . . 1023
PHYSICAL REGISTER AVAIL. FLAG
Occupied
Available
❷ Modify Renaming table
max # regs
per thread
X +
w5: mov $r0 $r2
PHYS.
REG ID
. . .
-
P8
P2
P5
. . .
RENAMING
TABLE
(c)
SM
. . .
SIMT Cluster 0
.
.
.
P0
P4
P1020
.
.
.
P1
P5
P1021
.
.
.
P2
P6
P1022
.
.
.
P3
P7
P1023
SIMT Cluster 0
.
.
.
P0
P4
P1020
.
.
.
P1
P5
P1021
.
.
.
P2
P6
P1022
.
.
.
P3
P7
P1023
SIMT Cluster 7
.
.
.
P0
P4
P1020
.
.
.
P1
P5
P1021
.
.
.
P2
P6
P1022
.
.
.
P3
P7
P1023
Bank0 Bank1 Bank2 Bank3
1 0 1 1 1 1 1 0 1 0 0 0 . . . . 0
P0 1 2 3 4 5 6 7 8 9 10 11 . . . . 1023
PHYSICAL REGISTER AVAIL. FLAG
❸ $r0 of warp 5 is
written to $P0 at WB stage
Occupied
Available
w5: mov $r0 $r2
PHYS.
REG ID
. . .
P0
P8
P2
P5
. . .
RENAMING
TABLE
(d)
Figure 2.9: An example of the proposed register allocation: 1) physical register avail-
ability flag is looked up to find an available physical register, 2) renaming table is
updated, and then 3) the reserved physical register is written
Figure 2.9 shows an example sequence of the proposed register allocation. A SM has
a renaming table, eight register files (one per SIMT cluster), and a physical register
availability flag. Each register file consists of four banks. Assume thatwarp5 tries to
store a value to $r0 and the source register $r2 is not released at this instruction. In
Ê, the physical register availability flag is looked up to find a free physical register for
$r0. There can be various architectural to physical register mapping policies depending
on how register virtualization is exploited, either for power efficiency or reliability. For
37
instance, virtualization can be used to improve power efficiency by power gating unused
registers. Various mapping policies are discussed in detail in Section 2.7 and depending
on the policy a given physical register $P0 is found by looking up the physical register
availability flag vector. InË, the physical register availability flag is modified so that
the bit for $P0 is set. Then, the renaming table needs to be modified so that thewarp5
can find $r0’s value from $P0 in the following instructions. The renaming table entry
is found by using combined warp Id and register Id as illustrated inË. Finally, inÌ,
the $r0’s new value is written to $P0 at the writeback stage. Note that all the $P0s
in the eight SIMT clusters are written together as the register renaming is done at the
granularity of a warp.
In a traditional out-of-order CPU a new physical register is assigned for every instruc-
tion’s destination register. However, in our proposed register virtualization a physical
register mapping of an architected register stays the same until that mapping is explicitly
removed through a register release flag. Hence, the register mapping described in the
example above occurs only when the architectural register does not have a valid map-
ping information. The mapping is invalid only in two cases: 1) at the first access to
the architectural register and 2) at the next definition of an architected register after the
architected register is released by a pir or pbr instruction.
38
BANKs
0 1 2 3
SUB-
ARRAYs
0
1
2
3
BANKs
0 1 2 3
SUB-
ARRAYs
0
1
2
3
W/ Renaming W/O Renaming
Vcc Vcc
(a) W/O Power Gating
BANKs
0 1 2 3
SUB-
ARRAYs
0
1
2
3
BANKs
0 1 2 3
Occupied Unused Shut-down
array_sleep_en
SUB-
ARRAYs
0
1
2
3
W/ Renaming W/O Renaming
Vcc Vcc
array_sleep_en
(b) W/ Power Gating
Figure 2.10: Effect of register renaming: (a) without power gating (b) with power gating
2.7 Use-Cases
2.7.1 Static power reduction
In this section we show how virtualizaion can also be effectively used for reducing the
static power consumption without altering the physical register file size.
The register lifetime analysis allows us to power gate all the dead registers. We explored
a conventional subarray level power gating approach [29] in the evaluation section. Tra-
ditionally large register files are designed using subarrays of SRAM. In this design each
subarray has its own access logic and can be easily modified to independently provide
power to each subarray. The subarray level power gating shuts down whole subarray
when there is no active register in the subarray. Figure 2.10 shows an example when
subarray level power gating is used. The two large blocks in Figure 2.10(a) represents
the two register files, one without register renaming and the other with register renaming
enabled, and we show four columns in each block to denote four register banks. The
white entries are the active registers and the gray ones are unused register entries. The
four horizontal partitions separated by dotted lines show the four subarrays. The left
39
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Register file leakage power
fraction normalized by 40nm
Technology
Figure 2.11: Leakage under various technologies (P: Planar, F: FinFET)
hand side register file shows the active registers distribution when the default register
allocation approach is used and the right hand side register file shows register usage
when the proposed register renaming is used. By using the register lifetime information
given by the compiler, the number of active registers can be reduced. Then, by using
the architected register to physical register mapping, the active registers are consoli-
dated into fewer number of subarrays in each bank. Using a single sleep transistor for
the entire subarray in Figure 2.10(b), all three unused sub arrays can be power gated.
Subarray-level power gating can be enabled by simply changing the register allocation
policy. Whenever a new register is allocated, we search the available register pool within
each subarray range first so that a new subarray is turned on only when the already active
subarrays are filled up.
Register file leakage power with FinFET transition: Leakage power is expected to
considerably increase every technology generation [30]. Researchers have investigated
various device-, circuit-, and architecture-level techniques such as devices with multiple
40
threshold voltages [31–33], power-gating [34, 35], and voltage scaling [36–38] to mini-
mize the leakage power consumption. In particular, almost every technology generation
has to introduce an innovative device (e.g., high-K/metal-gate strain-enhanced transis-
tors in 45nm technology) just to maintain a constantI
off
/m [39]; otherwise, it would
have been impossible to improve device speed without substantially increasing leakage
power. The same goes with the technology transition from 32nm planar MOSFET [40]
to 22nm FinFET devices [41]. 22nm FinFET devices somewhat improveI
off
/I
on
(but
not significantly) compared to 32nm MOSFET devices as seen in [41]. The same result
is also confirmed in Figure 2.11 where we use GPUWattch [17]
1
to plot the fraction
of leakage power over total GPU chip power when a GPU is designed with 32nm and
22nm MOSFET, and 22nm/16nm/10nm FinFET technologies, normalized to a GPU
with 40nm technology. Without the introduction of FinFET devices, a much larger frac-
tion of power consumed by the 22nm MOSFET GPU would have been leakage power.
FinFET brings the leakage power back to the baseline, but the climb continues from
the new reset point as seen in Figure 2.11. In other words, minimizing leakage power
through various circuit- and architecture-level techniques will continue to be important
even in current and future (FinFET) technology generations. Lastly, the GPU register
file has been responsible for a large fraction of total power in GPUs (e.g., 15% from our
estimation and [17, 43]).
1
We take PTM [42] for 32nm and 22nm MOSFET, and 22nm, 16nm and 10nm FinFET to update
the technology related parameters in GPUWattch with some assistance from the PTM developers. The
estimated power consumption numbers also agree with the numbers reported in another more recent GPU
power estimation model [43].
41
2.7.2 Wear-leveling of register usage
As observed in Section 2.2.2, registers have significant imbalances in access counts
during a kernel execution. One of the major cause of wearout failure is electromigration.
The mean time to failure due to electromigration is inversely proportional to activity
factor [19, 20]. Therefore, frequently accessed registers are likely to have more failures
in a given time period.
The current GPU’s register management scheme statically maps each architectural reg-
ister to a physical register. Register virtualizaion can release the mapping between an
architectural register and a physical register whenever the architectural register becomes
dead. To reduce access imbalances our proposed register allocation scheme gives the
lowest priority to the just released physical register during the next reassignment request.
To enable this scheme when a new register is allocated later the renaming logic uses a
simple round-robin mechanism to search the register availability flag vector thereby giv-
ing lowest priority to the most recently released registers.
Note that round-robin mapping method is not compatible with sub-array level power
gating. Sub-array level power gating tries to consolidate the live registers to as few sub
array as possible while reducing access imbalance requires the live registers to be dis-
tributed across many sub-arrays. However, if we provision the register file such that
individual register entries can be independently power gated then even when using wear
leveling mapping method, the individual register level power gating can be used. There-
fore, under this mapping method, we can improve the wearout issue as well as power
efficiency by employing an individual register level power gating.
42
Name # CTAs # Thrds/CTA # Regs/Kernel Conc. CTAs/Core
MatrixMul 64 256 14(7) 6
HotSpot 1849 256 22(20) 3
Blackscholes 480 128 18(16) 8
ScalarProd 128 256 17(11) 6
DCT8x8 4096 64 22(19) 8
NN 168 169 14(8) 8
Reduction 64 256 14(8) 6
LUD 15 32 19(12) 6
VectorAdd 196 256 4(3) 6
Gaussian 2 512 8(6) 3
BackProp 4096 256 17(12) 6
LIB 64 64 22(17) 8
BFS 1954 512 9(6) 3
LPS 100 128 17(16) 8
Heartwall 51 512 29(23) 2
MUM 196 256 19(17) 6
Table 2.1: Workloads
Parameter Renaming table Register bank
Size 1KB 4KB
# Banks 4 1
Vdd 0.96V 0.96V
Per-access energy 1.14 pJ 4.68 pJ
Per-bank leakage power 0.27 mW 2.8 mW
Table 2.2: Register renaming table and register bank energy in 40nm technology
2.8 Evaluation
We used GPGPU-Sim v3.2.1 [44] to evaluate the proposed register virtualization
scheme. We assumed that a GPU has 16 SMs and an SM has 128KB register file which
is partitioned to 32 banks, as in Fermi [7]. Two-level warp scheduler is used and the
ready queue size is set to six warps. For the compiler, nvcc v4.0 and gcc v4.4.5 are
used. Two schedulers concurrently issue two instructions at every cycle. The renaming
43
table and the register bank power parameters shown in Table 2.2 are calculated by using
CACTI v5.3 by assuming 40nm technology.
For the workloads, we used 16 applications from NVIDIA CUDA SDK [45], Parboil
Benchmark Suite [46], and rodinia [47]. The number of CTAs, threads per CTA, regis-
ters used per kernel, and the number of concurrent CTAs per SM are listed in Table 2.1.
The value in the parenthesis of # Regs/Kernel field is the minimum required number
of registers that can avoid register spills. The values that are outside of the parenthesis
in the same field are the register counts that include the address registers and condition
registers. We used PTXPlus for register analysis. We modified the ptx parser code in
GPGPU-Sim for analyzing the register lifetime and inserting the two new flag instruc-
tions. GPGPU-Sim provides a detailed ptx parsing code that includes basic block recog-
nition and control flow analysis. We traced the source and destination operands of each
instruction to figure out the release points for each register accurately.
2.8.1 Register size savings
Figure 2.12 shows the percentage of reduced register allocations with register virtual-
ization. We counted the number of physical registers that were actually used (touched
at least once) during the renaming process; this is essentially the maximum number of
concurrently live registers during any instance in the program execution. We then sub-
tract this count from the total allocated registers to find the total reduced register counts
and plotted that a fraction of the total registers allocated by the compiler. Register allo-
cation is reduced up to 43%, and on average 16% of the register space is eliminated from
register allocation. Applications with short kernel size (such as VectorAdd) saw smaller
register savings. There is less chance for the dead registers to be reused for other warps
44
16
0
10
20
30
40
50
Register Allocation
Reduction (%)
Figure 2.12: Register allocation reduction
due to short execution time. Applications that have longer execution time derive higher
register savings and hence our approach is particularly appealing to large kernels.
2.8.2 Static power saving with power gating
47
0
20
40
60
80
100
Static Power Rduction (%)
0.0002
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
AVG
MAX
MIN
Dynamic Power
Increase(%)
Figure 2.13: (Left) static power reduction with power gating and register renaming.
(Right) dynamic power increase (%) due to renaming table accesses over the register
file access power.
We explored the expected static power savings by applying subarray level power gating.
We follow the power gating model of CACTI-P [29]. The left graph in Figure 2.13
shows the static power reduction when the new register management method is used
45
1
1.005
1.01
1.015
1.02
1 3 10
Normalized total
simulation cycle
Sub-array wakeup latency (cycle)
Figure 2.14: Sensitivity on subarray wakeup latency
with the power gating. MUM used many subarrays even with register renaming and
hence provided limited improvements with subarray level power gating. Overall, the
average power saving of the register renaming is 47%.
We also measured the performance degradation with regard to subarray wakeup delay
as shown in Figure 2.14. We used CACTI-P [29] to measure the wakeup delay for
our register file subarray structure. CACTI-P estimated the wakeup delay to be less
than one cycle. Nonetheless, for exploration purpose, we used a wakeup delay of 1,
3 and 10 cycles. The performance overheads are less than 2% even with a wakeup
delay of 10 cycles. The reason for this low performance impact is that during program
execution the number of subarray wakeup events were negligibly small compared to the
total execution cycles.
We also measured the static and dynamic power overhead of the renaming table as
shown in Table 2.2. A four banked 1KB renaming table consumes 38% of a 4KB regis-
ter bank static power. Therefore, the total static power overhead due to renaming table
is 1.2% as we use one renaming table per SM while there are 32 register banks. The
dynamic power overhead of the renaming table is shown in the right hand side graph
46
0
5
10
15
20
25
Code Increase (%)
Static Dynamic-0 Dynamic-1 Dynamic-2 Dynamic-5 Dynamic-10
Figure 2.15: Static code increase and dynamic code increase w.r.t. # entries in a release
flag cache
0.94
0.96
0.98
1
HeartWall
Lud
MUM
Normalized Register
Saving (%)
0
1,024
2,048
MatrixMul
BlackScho.
DCT8x8
Reduction
VectorAdd
BackProp
BFS
Heartwall
HotSpot
LUD
Gaussian
LIB
LPS
NN
MUM
ScalarProd
AVG
Renaming table size
(B)
Figure 2.16: Per SM renaming table size without constraints and normalized register
saving with 1KB constraint
in Figure 2.13. The dynamic power of the renaming table and the register file is calcu-
lated by accumulating the total accesses to the renaming table and then applying the per
access power consumption values. The dynamic power overhead due to the renaming
table is 0.02% of the register file’s dynamic power.
47
2.8.3 Static and dynamic code increase
Figure 2.15 shows the static and dynamic instruction increase when using register
renaming due to the new metadata instructions that were added to the code. The dynamic
instruction count was measured as the number of instructions decoded by varying the
number of entries in a release flag cache. The integer value next to Dynamic- indicates
the number of release flag cache entries used for the evaluation (i.e. Dynamic-5 uses a
five-entry release flag cache). As pbr and pir do not issue any instruction to the exe-
cution units, the only overhead that is caused by the two added instructions occurs in
decoder logic. However, aspir is shared across multiple warps, a newpir is fetched and
decoded only when it is not in the release flag cache. Therefore, the dynamic instruction
increase is much less than the static instruction growth when more entries are added to a
release flag cache. Overall, the increased dynamic code (11% without release flag cache
as shown by Dynamic-0) is almost eliminated when using a ten-entry release flag cache
(0.2% dynamic code increase as shown by Dynamic-10).
2.8.4 Renaming table size
The left hand side chart of Figure 2.16 shows the renaming table size without constrain-
ing the size of the table. Almost all the workloads used in the evaluation can rename
the registers by using 1KB renaming table except the three workloads: MUM, Heart-
Wall, and LUD. Our worst case renaming table size estimation assumes 48 warps, each
accessing 63 architected registers, but this is only an upper limit that is not reached.
Thus when we constrain the renaming table size to 1KB only, these three benchmarks
were forced to eliminate a few long lived registers from the renaming process. The total
48
number of exempted registers is 2 out of 19 in MUM and LUD, and 4 out of 29 in Heart-
Wall. These exempt registers are assigned a physical register and were never renamed.
The right hand side graph of Figure 2.16 shows the impact of constraining the renaming
table size. Since some of the registers were exempt from renaming a few opportunities
to reuse those long lived registers (after dead) were lost. As expected, HeartWall’s reg-
ister saving is reduced the most among them because it can not rename 13% of total
registers.
2.8.5 Wear leveling
0 21
0
47
(a) MUM
0 21
0
47
(b) BlackSc.
0 21
0
47
(c) matrixMul
0 21
0
47
(d) LPS
0 21
0
47
(e) dct8x8
0 21
0
47
(f) NN
0 21
0
47
(g) reduction
0 21
0
47
0
1
(h) LIB
Figure 2.17: Register access frequency variation when proposed register management
method is applied (X- and Y-axis are the same with Figure 2.3)
We also measured the effectiveness of the proposed register management method on
wear leveling. Note that for wear leveling experiments, as described earlier, we changed
the architectural to physical register mapping algorithm to round-robin to enable uni-
form allocation across all available registers. The coefficient of variation of register
accesses is measured for each application and compared with those measured using
49
0
0.5
1
1.5
2
2.5
3
Variation of Register Access
Frequencies
Basic Management
Proposed Management
Figure 2.18: Register access frequency variation
default GPU register management without any renaming. Figure 2.18 shows the com-
parison. The coefficient is greater than one in most of the applications when the basic
register management is used, which means there is a significant imbalance in access
frequency. On the other hand, with the proposed register management method the coef-
ficient of variation reduces on average to 0.2, which implies that all the registers are
accessed more uniformly. Only in vectorAdd which uses only 4 registers we did not see
much improvement since the opportunity to rename within four registers was limited.
Figure 2.17 shows the new visual representation of the register access imbalance within
an SM. As in Figure 2.3, the accesses are scaled to the values within 0 to 1 to show
the relative access count difference amongst all the registers. Clearly, with the proposed
register management method the strong white and black bands that were seen in Fig-
ure 2.3 were eliminated. Accesses are distributed more evenly across all the registers.
In some applications such as NN, the darker and lighter colored cells are mixed. This
may indicate a high variance but it is only because the absolute access counts on the
50
registers are much smaller than the other applications that have all lighter colored cells
(i.e. reduction, BlackScholes, and dct8x8) and hence even a small variation is magnified
due to this relative measure.
2.8.6 Comparison with hierarchical register file approach
0
10
20
30
40
50
60
MatrixMul
BlackScho.
DCT8x8
Reduction
VectorAdd
BackProp
BFS
Heartwall
HotSpot
LUD
Gaussian
LIB
LPS
NN
MUM
ScalarProd
Register File Allocation
Reduction in MRF (%)
(a)
0
0.2
0.4
0.6
0.8
1
MatrixMul
BlackScho.
DCT8x8
Reduction
VectorAdd
BackProp
BFS
Heartwall
HotSpot
LUD
Gaussian
LIB
LPS
NN
MUM
ScalarProd
Normalized Static Power
of MRF
(b)
Figure 2.19: Comparison with three level register file approach [2]. (a) Register alloca-
tion reduction and (b) normalized static power of MRF when our approach is used
We also compared our register saving and static power reduction with prior art. Gebhart
et al. [2] proposed to use three level register file structure. By adding a small operand
register file (ORF) and a last value register file (LRF), they reduced dynamic power
consumption of short lived register accesses. As short lived registers are stored in ORF
or LRF, their approach also can reduce the usage of main register file (MRF). They
defined strand as a code segment in which all dependences on long latency instructions
are from operations issued in a previous strand. Then, they categorized registers as long
lived registers when the value lifetime spans across a strand. The long lived registers are
allocated in MRF and the others are allocated in either ORF or LRF. According to their
evaluation, 3 ORF entries per thread and 1 LRF entry per thread are the power optimal
51
configuration. By allocating only the registers that have lifetime spanning across a strand
to MRF and adding 4 small register file entries per thread for each warp in ready queue,
we measured the total allocated registers and static power consumption of three level
register file approach.
Figure 2.19(a) shows the reduced register allocations when using our proposed method
than the register allocation measured when three level register file approach is used.
Even though structural register approach only maintains long lived registers in MRF,
register virtualization requires 3 to 50% fewer register allocations than MRF. Register
virtualization does not restrict the lifetime boundary when renaming registers whereas
structural register approach categorizes all the registers that lives across strands as long
lived registers. Figure 2.19(b) shows the static power consumption of MRF. Note that for
this experiment, we assume that MRF can be power gated, which was not proposed in
their work, but we assume that such an extension can be made easily. In this experiment,
individual register level power gating is applied. For the structural register file approach,
we assumed that register cells that are not allocated by any registers are never turned on.
Our approach consumes up to 68% less static power than simply applying power gating
to MRF, as virtualization proactively releases dead register spaces.
Even though three level register file approach can effectively reduce the dynamic power
of short lived register accesses, as the size of MRF is much larger than ORF or LRF
(MRF is 42 times larger than combined structure size of ORF and LRF), efficient register
utilization mechanism like our proposed method is necessary. to get the full benefits of
static power savings
52
2.9 Related Work
Register renaming in CPUs: In this work we propose to rely on register renaming to
achieve register virtualization. Prior register renaming enhancements in CPUs can be
broadly categorized into two categories: hardware-assisted early release of dead regis-
ters and hardware-software cooperative release of registers.
Hardware-assisted eager register release in CPUs: Moudgill et al. [48] proposed a
hardware-only method to release dead registers early. The scheme primarily relies on
the register renaming stage in out-of-order processors to detect how many instructions
are going to read a particular architected register before that register is redefined. This
value is dynamically computed and stored as a counter associated with each physical
register. After each use the counter is decremented and when it reaches zero, the physical
register is eagerly released. Several other approaches use this baseline scheme to reduce
power [49], improve reliability [50], and to implement fast checkpointing [51, 52].
Associating a counter for each physical register has very high overhead since GPU phys-
ical register file size is much larger; 64K registers in Maxell GPU [53], versus about 400
registers in Intel Haswell [14]. Furthermore, counter-based reclaiming is ineffective in
GPU due to smaller instruction window size per warp.
Hardware-software cooperative register release in CPUs: Martin et al. [54], Jones
et al. [55] and Lo et al. [56] proposed software-hardware cooperative register renaming.
Martin et al. [54] uses dead value information (DVI) instructions to release the regis-
ters that are dead at the boundary of procedure calls. Their approach primarily takes
advantage of the semantics of procedure calls to identify the two types of dead registers:
callee saved register and caller saved register. An explicit DVI instruction is inserted
53
to the code to deallocate those dead register spaces.They further proposed to avoid sav-
ing/restoring dead registers from the calling convention. In Jones’ approach [55], com-
piler identifies single use registers, that are read only once during the execution, and then
marks the last use instructions of the single use registers. To support precise exceptions,
Jones [55] proposed a checkpoint whenever the single use register is released. In Lo’s
method [56], compiler marks the instruction that lastly uses a register value in CPU to
release the register as soon as that instruction is executed. Given that our work focuses
on GPUs there are new challenges and opportunities when renaming is used in the con-
text of GPUs. In a CPU, only one path of a diverged flow is executed, while in the
context of GPUs warps traverse all the possible flows sequentially. Hence, GPU register
release points differ from CPUs, which is properly dealt with in our work. Furthermore,
in [56] they do not consider how to reduce the overhead of release instruction. In our
work, we exploit the fact that warps in a GPU execute the same code segment. Hence,
the register release metadata instructions that are shared across the warps can be cached
to effectively reduce the fetch and decode overhead of these instructions. We also show
how early release of physical registers can be used to design a GPU with fewer phys-
ical registers than the architected registers without curtailing thread level parallelism.
We further curtail the static power consumption by proposing to gate unused register
subarrays that were created due to renaming.
Other orthogonal approaches to improve register efficiency in CPUs: Previous stud-
ies leveraged the properties of data stored in registers to improve register file efficiency.
Jourdan et al. [57] exploit the value redundancy in the register file to map several logical
registers to the same physical register thereby saving physical register space. Ergin et
al. [58] proposed register packing inspired by an observation that a large percentage of
instructions produce narrow-width results. Lozano and Gao [59] used register renaming
54
to avoid unnecessary commit in an out-of-order CPU by checking whether the last use
instruction already consumed the data.
GPU register file renaming: To the best of our knowledge, there has not been any
detailed study that explored the benefits of register reclaiming for GPU. An NVIDIA
patent [60] proposed a hardware-only dynamic register allocation and deallocation
approach. Once a register space is allocated, the space is deallocated when a new value
is written to the architected register. Hence, their patent claims that registers may be
reused at the redefinition time of that register. Their approach does not use any com-
piler knowledge or lifetime analysis. By using register lifetime, we can provide more
aggressive register release that leads to less register demand.
Power efficiency in GPU: To reduce the dynamic power, [8] proposed to use small
register file cache. They store any newly written register values to a small register cache
so that the registers can be read from the cache rather than the huge register file. In
[2], the authors enhance the register file cache by adding two small register files. More
recently, [61] proposed to split the 128-bit wide MRF bank into two 64-bit wide bank
slices to reduce the dynamic power overhead due to the large register access size. [23]
proposed a tri-modal register file to reduce static power where registers are pushed to the
drowsy mode after each access. But the fundamental approach to register assignment
and release were unperturbed in all the approaches. We show that by using lifetime
analysis the demand on the physical register file can be significantly curtailed using
register renaming. We use register renaming as a foundational enabler that allows us
cut the register file size into half reducing both dynamic and static power. Then using
subarray level power gating we can further cut the static power.
55
2.10 Chapter Summary
In this chapter, we first show that the lifetimes of different registers vary dramatically
in GPUs. Many dead registers continue to be powered thereby wasting static energy.
Furthermore, there are significant imbalances in register usage counts across different
registers. To address these drawbacks with the default register management scheme
we propose register virtualization. It flexibly maps architectural registers to physical
registers based on compiler generated register lifetime information. Then, by simply
tuning the register allocation algorithm, the spare register space is reused for multiple
purposes such as power efficiency and improving reliability.
56
Chapter 3
GPU-Shrink: A GPU Design With
Half-Sized Register File Using Register
Virtualization
3.1 Introduction
In the previous chapter, we showed that the register virtualization effectively reduces the
register usage demand as plotted in Figure 2.12. In this chpater, we propose a new GPU
design that uses more aggressive register resource management, which cuts the size of
register file up to 50%. The new design, called GPU-Shrink, enables to under-provision
the physical register file to be smaller than the architecturally defined register file size
needed to support multiple concurrent thread contexts. In other words, the GPU-shrink
of 50% under-provisioning has only 64KB register file per SM, compared to the 128KB
register file used in our baseline.
Figure 3.1 plots the benefits of shrinking the register file, both in terms of reduc-
ing dynamic power and static power. These results were generated by using
GPUWattch [17] starting with the 128KB banked register file organization as the base-
line. Compared to baseline with no shrinkage, reducing the register file by half reduces
57
0.5
0.6
0.7
0.8
0.9
1
1.1
0 10 20 30 40 50
Normalized Power and Area(%)
Register file size reduction (%)
RF Dyn Power
RF Lkg Power
Total RF Power
Figure 3.1: Power versus register file size
dynamic power consumption by 20% and reduces the overall power (leakage and
dynamic) by nearly 30%.
But the unique aspect of our reduced register file design is that it is transparent to the
application/compiler layer. The compiler is free to use all the registers in the baseline
without any restrictions. The register management hardware simply renames registers
using the reduced physical register size. Hence, the availability vector of 1024 registers
is now reduced to 512 registers only. If the cumulative live registers across all the CTAs
concurrently running on an SM is less than 512, then there is no application perceived
difference between GPU-shrink and regular GPU with renaming.
However, as shown in Figure 3.2, seven among 16 benchmarks allocate more than 512
registers even when using register virtualization. Those benchmarks that require more
than 512 registers need a register throttling mechanism. Otherwise, there will be no
58
0
512
1024
Physical register usage under
register virtualization
Figure 3.2: Number of warp level registers allocated when using register virtualization
available physical registers once all the physical registers are already assigned, which
results in deadlock situation.
3.2 Register Throttling
To guarantee forward progress, one approach is to reserve a minimum set of registers
required for executing one warp. When the minimum register availability reaches this
limit then no further warps will be allowed to execute. However, reserving just enough
registers to enable a single warp to execute is insufficient due to the current GPU com-
putation model that supports barrier operations. Figure 3.3 is an example GPGPU kernel
code, which is taken from matrixMul application of NVIDIA CUDA SDK [3]. Two bold
style syncthreads() statements in the kernel code are barrier function calls. When
syncthreads() is called, threads wait until all the other threads within the same CTA
reaches the barrier. syncthreads() is typically called after data movement from global
memory to shared memory and vice versa. Without barrier, a warp may read from a
shared memory location before the data is not yet copied from global memory to the
59
__global__ void matrixMul(float* C, float* A, float* B, int wA, int wB)
{
// initialization..
for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) {
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
AS(ty, tx) = A[a + wA * ty + tx];
BS(ty, tx) = B[b + wB * ty + tx];
// Synchronize to make sure the matrices are loaded
__syncthreads();
for (int k = 0; k < BLOCK_SIZE; ++k)
Csub += AS(ty, k) * BS(k, tx);
// Synchronize to make sure that the preceding
// computation is done before loading two new
// sub-matrices of A and B in the next iteration
__syncthreads();
}
// Write the block sub-matrix to device memory;
C[c + wB * ty + tx] = Csub;
}
Figure 3.3: An example GPGPU kernel code (matrixMul of NVIDIA CUDA SDK
3.0 [3])
shared memory because the data copy may be handled by another warp in the CTA.
Thus, without a barrier blocking all threads within a CTA, it is not guaranteed that the
final computation results that are stored in shared memory are copied to global memory.
Therefore, all global memory accesses should be performed after all the warps reach
a barrier point that is placed after the data copy statement. If an application uses such
barrier operations, it is insufficient to reserve just enough registers to enable a single
warp to progress. Rather, what is needed is enough registers must be reserved to enable
an entire CTA to make forward progress. An SM may be assigned multiple CTAs and at
any given instance each CTA may require different number of registers to complete its
execution. In this case GPU-Shrink must guarantee the availability of as many registers
as needed by the CTA with the least register demand. If the number of available registers
falls below the registers required by the CTA with the least register demand no further
register allocations can be granted to any other CTA.
60
To enable register usage throttling we propose a simple modification to the warp sched-
uler. The modified warp scheduler throttles the register shortage by simply placing some
warps in the SM into pending queue. We use the following implementation to guaran-
tee progress. The maximum number of registers required for executing a CTA can be
obtained from the GPU compiler. For instance, ifN is the number of registers needed
for each warp and a CTA hasM warps then the maximum number of registers required
per CTA isC = NM. The warp scheduler keeps track of the number of registers
already assigned to each CTA using a per-CTA register balance counter; a total of eight
counters are needed in our baseline architecture since at most eight CTAs can be con-
currently executed in an SM. Ifk
i
is the number of registers assigned toCTA
i
at a given
time then thecounter
i
will storeCk
i
as the remaining registers that may be needed in
the worst case forCTA
i
. Before the warp scheduler selects a warp it checks the number
of available physical registers. If the number of available physical registers is greater
than the minimum of allCk
i
counter values then it allows the warp to continue. Oth-
erwise, the scheduler recognizes the problem that the available physical registers may be
too few to enable at least one CTA to complete its execution. In this case the scheduler
simply picks the CTA with the minimum register balance counter (arbitrarily breaking
ties) and allows only warps from that CTA to execute; as registers are released by this
CTA then new CTAs can again start issuing. Intuitively, the assumption is that a CTA
that has already occupied most registers will finish soon or has an opportunity to release
more registers than other CTAs.
The new warp scheduler is specified in Algorithm 1. When a warp allocates a new
register, functionALLOCATE REGISTER is called with the associated CTA id. Then,
the function checks if the CTA, C
max
, that currently uses the most registers can still
have enough registers for continuing its execution by using the available register count,
61
Algorithm 1 Warp scheduler for register throttling
R
avail
: number of available(unallocated) registers
R
mapped
[max number of CTAs/SM] : number of occupied registers per CTA
R
cta
: number of registers used per CTA
procedure ALLOCATE REGISTER(ID
cta
)
R
max
maximum value inR
mapped
C
max
CTA id associated toR
max
if (R
avail
1) (R
cta
R
max
) then
assign higher priority to warps inC
max
returnfail
else
decrementR
avail
incrementR
mapped
[ID
cta
]
returnsuccess
end if
end procedure
R
avail
. If there are not enough registers after allocating a new register for the request-
ing CTA, the function promotes the warps of the C
max
to higher priority and return
fail. Otherwise, as there are still enough registers for executing any of the active CTAs,
the function returns success after decrementing the available register count,R
avail
and
incrementing the corresponding entry ofR
mapped
. The actual register allocation occurs
when the functionALLOCATE REGISTER returns success.
The above approach avoids deadlocks except for one extremely rare corner case. The
CTA level throttling can work effectively unless a kernel assigns only one CTA that
requires more live registers than is available in the GPU-shrink. Even though it is a
very seldom occurring case (none of our benchmark applications had such situation), it
is theoretically possible. In this worst case scenario, we rely on conventional register
spilling. We rely on the scheduler to automatically issue special spill instructions to a
system-reserved memory location when there are not enough registers while running
62
a large CTA. To spill registers the warp scheduler selects warps in the pending queue.
Note that the registers in a warp can be stored using coalesced memory accesses where
registers associated with all the threads in a warp can be spilled by one memory oper-
ation per architected register. While the pending warps’ registers are maintained in the
memory, the active warps will proceed their execution and release as many register as
possible. Eventually when more registers than what is required for a pending warp are
released the scheduler loads these registers back from memory to physical registers to
re-start these warps.
3.3 Hardware Overhead
To implement the register throttling algorithm, an array of counters and one register
are needed. One counter per SM tracks the number of available registers (R
avail
). And
the number of registers allocated to each of the running CTAs (R
mapped
) is stored in
a per-CTA counter. R
mapped
is an array of counters that have as many entries as the
maximum number of CTAs per SM. In Fermi, maximum of eight CTAs are allowed to
run concurrently and henceR
mapped
has eight entries. In addition, a register is used for
maintaining the number of registers that are expected to be allocated for each CTA. This
value is generated by compiler and embedded in the program binary. Since the maxi-
mum values of counters and register are equal to the maximum number of architected
registers, the total hardware overhead for a Fermi architecture is 100 bits (=log(1024)
(R
avail
+ 8 CTAsR
mapped
+R
mapped
).
63
3.4 Evaluation
3.4.1 Register file utilization
0 5000 10000
0
50
100
(a) MatrixMul
0 5000 10000
0
50
100
(b) Reduction
0 5000 10000
0
50
100
(c) VectorAdd
0 5000 10000
0
50
100
(d) Blacksc.
0 5000 10000
0
50
100
(e) LPS
0 5000 10000
0
50
100
(f) BackProp
0 5000 10000
0
50
100
(g) Hotspot
0 5000 10000
0
50
100
(h) Gaussian
Figure 3.4: Fraction of live registers in the under-provisioned (50%) register file sub-
arrays that are not power gated, captured during the execution (X-axis: cycle, Y-axis:
utilization(%))
We first measured the register file utilization while running GPU-Shrink with a 50%
under-provisioned register file. Figure 3.4 shows the fraction of live registers among
all the registers in the active sub-arrays (the sub-arrays that are not power gated) during
the first 10K cycles of various applications execution. We assumed that there are four
sub-arrays in the register file of an SM. During the execution, any sub-array that does
not hold any live register is power gated and excluded from the utilization calculation.
Note that this measurement is different from Figure 2.1, which shows the fraction of live
registers among the compiler reserved registers. The fraction of live registers among the
compiler reserved registers does not change even under register virtualization and GPU-
Shrink because GPU-Shrink does not change the compiler reserved register allocation.
64
0.58
111
217
1,008
323
73.32
-40
-20
0
20
40
60
80
100
MatrixMul
BlackScho.
DCT8x8
Reduction
VectorAdd
BackProp
BFS
Heartwall
HotSpot
LUD
Gaussian
LIB
LPS
NN
MUM
ScalarProd
AVG
Execution cycle increase
normalized by 128KB RF (%)
GPU-shrink Compiler spill
Figure 3.5: Performance degradation when using half-sized (64KB) register file
On the other hand, the utilization in the physical register file improves with GPU-Shrink
since physical registers are reused across warps. As shown in Figure 3.4, in most of the
applications the utilization is over 50% and even close to 100% in many cases. These
results confirm that register virtualization with GPU-Shrink consolidates live registers
to fewer physical registers.
3.4.2 Performance overhead
We used the same evaluation infrastructure as was described in Section 2.8. We imple-
mented GPU-Shrink, modifications to the warp scheduler and the above described
microarchitecture changes within GPGPU-Sim v3.2.1 [44]. To verify the practicabil-
ity of register file under-provisioning, we first measured the performance impact of our
proposed GPU-shrink on a half sized (64KB) register file compared to Fermi baseline
that uses 128KB register file per SM. Note that it is also possible to simply recompile
an application forcing the compiler to just use half the number of available registers
65
and whenever the compiler needs more registers it has to rely on spilling some regis-
ters to memory and later filling them back. We compared GPU-shrink with a simple
compiler-enforced register file size reduction mechanism. In our baseline configuration,
the applications are already maximally optimized such that the minimum number of
registers are used that do not cause any register spill under 128KB register file. There-
fore, for this comparison, some applications are recompiled to use no more than 64KB
registers. Figure 3.5 shows the total execution cycle increase normalized by 128KB reg-
ister file configuration when GPU-shrink and the compiler enforced register file shrink
(denoted as Compiler spill) are used. Four among 16 benchmarks do not need any
throttling because their register usage does not exceed 64KB. Thus, those applications
(VectorAdd, BFS, Guassian, and LIB) had zero performance overhead. In the other
applications, GPU-shrink achieves much better performance than simply relying on the
compiler to force spills/fills. By releasing and reusing dead registers the effective regis-
ter file size is significantly enhanced thereby leading to minimal performance overhead.
In some applications, the performance is even enhanced when using GPU-shrink; MUM
had significant improvement. Further analysis showed that this unexpected behavior is
because GPU-shrink dispersed memory contention by throttling some warps leading to
performance improvements in those applications. Overall, GPU-shrink suffered 0.58%
performance overhead on average, while compiler spill approach suffered 73% increase
in execution time.
We also evaluated GPU-shrink-40% and GPU-shrink-30% that uses 40% and 30%
smaller register files, respectively. Since our 50% shrink gets zero performance over-
head, the additional registers available with these two configurations did not have any
impact on the execution latency.
66
3.4.3 Energy savings
Figure 3.6 shows the register file energy breakdown of three different design options,
normalized to 128KB register file that does not use any register virtualization. Dynamic
and Static are the register file’s dynamic and static energy consumption. Renaming
Table is the additional energy consumed by renaming table. Flag Instruction includes
the energy that is consumed by fetching and decoding release flag instructions and by
release flag instruction cache. The fetch/decode energy is measured by GPUWattch. The
first bar labeled 128KB RF w/ PG shows total register file energy consumption when the
register file uses sub-array level power gating after applying register virtualization. This
bar essentially shows the energy reduction when we use register virtualization just to
reduce the static power but do not alter the physical register file structure. The second
bar (64KB RF) shows the energy savings by cutting the register file into half while using
virtualization. By halving the register file size, both dynamic and static power can be
reduced even without power gating and hence, the average energy saving is even greater
than the full size register file with power gating. However, some applications, such as
VectorAdd, LUD, Gaussian, and LIB, spend vast majority of execution time on the code
segments that have very few live registers. Thus, static power savings are significant
when using power gating (as shown in 64KB RF w/ PG). Reducing the register file size
without any power gating actually leads to a small increase in the energy consumption
compared to power gating a 128KB register file. But when sub-array level power gating
is applied on top of the GPU-shrink, as plotted in the third bar, the overall energy savings
increase across all the benchmarks. On average, the GPU-shrink with register under-
provisioning and sub-array power gating saved a total of 42% register file energy.
67
0
0.2
0.4
0.6
0.8
1
1.2
128KB RF w/ PG
64KB (50%) RF
64KB (50%) RF w/ PG
128KB RF w/ PG
64KB (50%) RF
64KB (50%) RF w/ PG
128KB RF w/ PG
64KB (50%) RF
64KB (50%) RF w/ PG
128KB RF w/ PG
64KB (50%) RF
64KB (50%) RF w/ PG
128KB RF w/ PG
64KB (50%) RF
64KB (50%) RF w/ PG
128KB RF w/ PG
64KB (50%) RF
64KB (50%) RF w/ PG
128KB RF w/ PG
64KB (50%) RF
64KB (50%) RF w/ PG
128KB RF w/ PG
64KB (50%) RF
64KB (50%) RF w/ PG
128KB RF w/ PG
64KB (50%) RF
64KB (50%) RF w/ PG
128KB RF w/ PG
64KB (50%) RF
64KB (50%) RF w/ PG
128KB RF w/ PG
64KB (50%) RF
64KB (50%) RF w/ PG
128KB RF w/ PG
64KB (50%) RF
64KB (50%) RF w/ PG
128KB RF w/ PG
64KB (50%) RF
64KB (50%) RF w/ PG
128KB RF w/ PG
64KB (50%) RF
64KB (50%) RF w/ PG
128KB RF w/ PG
64KB (50%) RF
64KB (50%) RF w/ PG
128KB RF w/ PG
64KB (50%) RF
64KB (50%) RF w/ PG
128KB RF w/ PG
64KB (50%) RF
64KB (50%) RF w/ PG
MatrixMulBlackScho. DCT8x8 ReductionVectorAdd BackProp BFS Heartwall HotSpot LUD Gaussian LIB LPS NN MUM ScalarProd AVG
Total energy consumption
normalized by 128KB RF
Flag Instruction
Renaming Table
Dynamic
Static
0.58
Figure 3.6: Total register file energy breakdown
68
3.5 Related Work
In Section 2.9 we already described the related work relevant to the register file virtual-
ization. In this section we present related work relevant to warp scheduling.
Warp scheduling for better memory performance: Several warp scheduling algo-
rithms have been proposed for better memory performance. Jog et al. [62] proposed
a new prefetch-aware scheduling policy which schedules consecutive warps in differ-
ent scheduling group so that the warps in a scheduling group can prefetch data for the
logically consecutive warps that are scheduled in different scheduling groups. By dis-
tributing consecutive warps that are likely to access nearby addresses, the proposed
scheduling algorithm also derives better bank level parallelism. Rogers et al. [12] pro-
posed a warp scheduler that maximize the cache locality thereby improving the cache
hit ratio. They observed that round-robin warp scheduler is likely to thrash L1D cache
by scheduling as many ready warps back to back. They proposed to prioritize some
warps that are likely to have hits in L1D cache. To do that, they designed a locality
scoring system that basically tracks each warp’s locality information in L1D. Then, if a
data is expected to be evicted soon, they prioritize the associated warp so that the warp
can consume the data before its data is evicted.
Preemptive warp scheduling for GPU multiprogramming: There have been a cou-
ple of studies that proposed priority-based warp schedulers. Preemptive warp scheduling
proposed for GPUs are mainly for prioritizing critical kernel over long latency kernels
so that a long-latency GPU kernels can yield resources to the other critical kernels. The
main challenge of preemptive scheduling is high overhead of context switching. Tanasic
69
et al. [63] explored two preemptive mechanisms: classic context switching and drain-
ing. Classic context switching stops running GPU cores to save the current context to
the memory thereby a new kernel can be assigned to the cores. Draining waits until the
current running work drains from GPU cores and not issuing new work to the cores.
They also proposed a scheduling algorithm that enables different kernels to share GPU
cores. Park et al. [64] added another preemptive scheduling algorithm named flushing.
Flushing detects idempotent kernels, which generate exactly the same result regardless
how many times the kernel is executed, and drops an idempotent kernel’s execution to
yield the SM’s resource to an urgent kernel. Later, the dropped idempotent kernel is re-
executed from the beginning on either the same SM or a different SM. While dropping
a kernel, the context is not saved or restored.
These schedulers prioritize some warps or kernels over the others for better cache local-
ity, prefetching effectiveness, and fair real-time computing. Our scheduler prioritizes
CTAs to run applications that use more architected registers than provided physical reg-
isters, without deadlock. Unlike many of the prior art purposed to enhance performance,
our scheduler relies on register file virtualization at its base and then builds a novel solu-
tion to throttle insufficient resource without compromising performance, which results
in a better power efficiency.
3.6 Chapter Summary
This chapter proposes GPU-Shrink, a GPU design that is provisioned with fewer than the
architected number of registers. GPU-Shrink is an aggressive register power reduction
mechanism. Using the register virtualization that effectively reduces the register demand
70
by proactively release dead registers, GPU-Shrink shows that the applications can run
on half-sized register file with negligible performance overhead. For the applications
that require larger register file than the under-provisioned register file even under regis-
ter virtualization, we propose a simple priority-based warp scheduler that throttles the
register shortage without deadlock. Using smaller register file itself effectively reduces
dynamic and static power consumption. By applying a simple sub-array-level power
gating, average of 42% of total register energy is reduced with almost zero performance
overhead.
71
Chapter 4
Warped-DMR: Underutilization
Exploitation for Execution Units
4.1 Introduction
GPUs now run business-critical applications, long running scientific codes, and finan-
cial software. These new application domains demand strict program correctness [4].
A few erroneous computations or a corrupt value could have severe negative repercus-
sions. CMOS technology scaling, while improved transistor density, is also leading to
significant number of reliability concerns. GPU will be vulnerable to soft/hard error and
the vulnerability is predicted to grow exponentially [65]. The primary focus of GPU
design has been to increase parallel performance. In particular, reliability is consid-
ered as a secondary issue in GPU computations since traditional graphics applications
have been shown to be inherently fault tolerant [4]. However, to support business crit-
ical application domains on GPUs, there is a need to provide architectural support for
at least error detection. As a first step, error detection can translate the most harmful
silent data corruption (SDCs) errors to detectable but unrecoverable errors (DUEs). In
fact, commercial GPU designers have already started addressing reliability concerns.
Recently NVIDIA’s Fermi GPU added ECC for the memory components [7]. Our own
72
work on die-stacked DRAM integration also explored low-cost approaches to protecting
memory against various multi-bit errors [66]
GPUs have hundreds of hardware thread contexts today and in the near future they will
have thousands of contexts. As explained in Chapter 1, 32 threads are bundled into a
warp. Every fetch cycle a warp fetches a single instruction but computes on 32 different
input operands using multiple execution lanes. Hence the vast majority of the chip area
is dedicated to execution units, such as ALUs. In the presence of hundreds (or even thou-
sands) of thread contexts, even a tiny probability of a logic error in each thread context
adds up to an exponentially high probability of errors at the chip level. Recognizing this
concern, several researchers have been focusing on improving reliability of GPU com-
putation. Prior to the exploration of work presented in this chapter, most of the related
work focused on software approaches [65] [67]. Software approaches can be more flexi-
ble but demand programmers to re-write their applications with focus on fault tolerance,
which is quite undesirable given that writing a GPU application itself is non trivial [68].
Compilers could reduce some of the burden on the programmer by automatically pro-
viding redundant code execution [67]. However, the error coverage is limited by the
granularity of compiler’s code insertion and has the hidden error problem: even though
each line of code is executed twice for verification purpose, if the two instances are
executed on the same processing core, some hardware defects, such as stuck-at faults,
cannot be detected. Note that software does not decide which thread is mapped to which
core in current GPU architectures. Furthermore, in software approaches the results from
redundant execution are mostly compared at the end of the program execution. Hence
faults are likely to be discovered too late to take quick corrective action.
73
In this chapter we propose a low-overhead hardware approach for detecting compu-
tation errors in GPUs. The approach uses dual modular redundancy (DMR) [13] but
opportunistically switches between spatial and temporal redundancy to improve error
coverage while reducing the error detection overhead. We call this approach Warped-
DMR. In this chapter, we assume that only execution units are vulnerable. Memory is
assumed to be protected by ECC, as is done in many industrial GPU designs [7]. Hence,
for the memory operations, we only verify the address computations and assume that
the loaded data is always error free.
4.1.1 Exploiting opportunity
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1 2
3 4
5 6
7 8
9 10
11 12
13 14
15 16
17 18
19 20
21 22
23 24
25 26
27 28
29 30
31 32
Figure 4.1: Execution time breakdown with respect to the number of active threads
Before presenting the details of Warped-DMR, we first present motivating data high-
lighting the prevalence of underutilization of execution resources in a GPU. Figure 4.1
shows the execution time breakdown in terms of the number of active threads for a
subset of benchmarks selected from NVIDIA CUDA SDK [45], Parboil Benchmark
74
Suite [46], and ERCBench [69]. These results were generated by simulating a modern
NVIDIA-style GPU architecture using GPGPU-Sim [44]. More details of the simulated
architecture and benchmarks are provided later in Section 4.6. As detailed in Chapter
1, NVIDIA GPU executes instructions in a batch of threads unit (called Warp). A warp
consists of 32 threads. Each color in the bar chart denotes the fraction of cycles that the
corresponding number of threads are actively executing the code. As can be seen, major-
ity of applications do not have 32 active threads all the time. For example, over 40%
of BFS instructions are executed by only single thread. In other words, the remaining
31 threads within the warp are idle while the single active thread is executing instruc-
tions. The processing cores associated with the idle threads remain unused during the
idle period. The reasons for this underutilization are described later in Section 4.2.1.
The underutilization of GPU resources provides us an opportunity to provide error detec-
tion capability by exploiting the underutilized resources without incurring performance
overheads. In this chapter we present Warped-DMR, which consists of two techniques
for error detection.
(1) Intra-warp DMR: Our first error detection method is called intra-warp DMR. Intra-
warp DMR simply uses the inactive threads to verify execution of active threads in the
same warp by using dual modular spatial redundancy. The underutilized or idle cores
are used as computational checkers for a subset of active threads. Hence the overhead of
intra-warp DMR is nearly zero, with the exception of a negligible area overhead needed
to compare the computation results and duplicate the input data.
(2) Inter-warp DMR: The second error detection approach is called inter-warp DMR.
When a warp is fully utilized, all 32 threads are active in a warp, and hence there is no
75
opportunity for intra-warp DMR. In this scenario we use dual modular temporal redun-
dancy. A duplicated execution of each fully utilized warp is scheduled for execution
later whenever the associated execution unit becomes idle. A special purpose Replay
Queue(ReplayQ) is used for the purpose of buffering the duplicate execution warps.
We also shuffle the execution of redundant threads onto different cores, compared to
the original thread-to-core assignment, to reduce the hidden error problem. Inter-warp
DMR when combined with the ReplayQ mechanism reduces the need for unnecessary
stalls in the pipeline to execute redundant instructions, thus significantly lowering the
performance overhead of temporal redundancy.
Simulation results on several GPGPU applications shows that intra-warp and inter-warp
DMR complement each other to provide 96.43% error coverage with 16% worst case
performance overhead.
4.2 Background
4.2.1 Underutilization of GPU resources
Underutilization of GPU’s computational resources can be due to two reasons (1) under-
utilization within homogeneous execution units and (2) underutilization among hetero-
geneous execution units.
Underutilization of homogeneous units: Underutilization within homogeneous exe-
cution units is caused by lock-step execution in GPUs. All 32 threads in a warp share a
single PC. Whenever a branch instruction is encountered, some of the threads within the
warp may take the branch while others may not depending on the data operands. Due
76
If(cond) {
b++;
} else {
b--;
}
a = b;
(a) Code
Core 2 Core 1
b++ b++
a = b
Cond?
(b) No Divergence
Core 2 Core 1
b++
b--
Cond?
a = b
(c) Divergence
Core 2 Core 1
b++
b--
Cond?
a = b
b++
b--
DMR
DMR
(d) Proposed
Figure 4.2: Example of underutilization of homogeneous units and Intra-Warp DMR
to a single PC-constraint, threads with not-taken branch are executed first followed by
threads with taken branches (or vice-versa). While the not-taken path instructions are
executed, the cores assigned to the taken path threads are idled. This is called the branch
divergence problem.
To execute the divergent instructions, GPU hardware scheduler uses an active mask
which consists of 32 bits indicating the active state of each thread within a warp. At
each cycle, the threads whose active bit is set to ’1’ are allowed to execute the issued
instruction, while the threads whose active bit is set to ’0’ wait. We call the thread as an
active thread if it has ’1’ in the corresponding bit of the active mask and as an inactive
thread otherwise. Note that there is one active mask per each warp.
A simple example of a branch divergence is illustrated in Figure 4.2. Let us assume
that an if-then-else statement(shown in Figure 4.2(a)) is executed by a warp of two
threads. When both threads reach the conditional branch instruction and if the condition
is true for both threads then the two threads are concurrently executed (shown in Fig-
ure 4.2.(b)). However, if the two threads take different branch paths then only one thread
can execute at a time (shown in Figure 4.2.(c)). In this example, the utilization of the
system while executing the if-else statement becomes only 75% since among 8 cycles
77
(2 cores 4 cycles each), 6 cycles are actually used for the execution. Underutilization
is even worse in real applications as shown in Figure 4.1: ranging from 7% in BFS to
up to 77% in Bitonic Sort.
Underutilization of heterogeneous units: The underutilization among heterogeneous
execution units is caused by the limitations in the scheduler feeding three different exe-
cution units. GPUs have three different types of execution units: streaming proces-
sors (SPs) where many common integer/floating point instructions are executed, LD/ST
units, and special function units (SFUs). All three different types of execution units
are fed by a single warp scheduler and an instruction dispatcher unit [7]. Hence, dur-
ing any given cycle, only one instruction can be issued to one of the three execution
units which leads to idle units. Heterogeneous unit underutilization has not been con-
sidered as severe as the underutilization caused by control divergence. However, if a
code segment executes the same type instructions in a burst fashion then the scheduler
will schedule instructions to just one type of execution unit while the rest two execution
units remain idle.
Some state-of-the-art GPUs such as NVIDIA Fermi and Kepler [16] have multiple
schedulers per streaming multiprocessor (SM). For example, Fermi has two schedulers
in a SM which can issue instructions concurrently. The two schedulers share LD/ST
units and SFUs while having their own SPs. Hence, instructions can be simultaneously
issued to two different type execution units among three if the two schedulers issue
different type operations. Even in this case there is still an underutilization of heteroge-
neous units since not all three execution units are used, but the degree of underutilization
is decreased. Furthermore, due to several scheduling issues such as data dependency
78
among the instructions, schedulers are not likely to be able to issue instructions to all
the execution units.
4.3 Warped-DMR
Warped-DMR exploits the two types of underutilized resources to execute code redun-
dantly and opportunistically. Different execution strategy is used for each of them.
4.3.1 Intra-warp DMR
To detect errors in execution units, intra-warp DMR relies on DMR execution approach.
In traditional DMR there are as many verification cores as the number of monitored
cores. A verification core executes the same instruction stream of the associated moni-
tored core and the two execution outputs are compared. Error is detected if the results
on the two cores differ. Every single instruction is thus executed twice providing 100%
error coverage. However, the area or performance overhead of DMR exceeds 100%
since at least one verification core should be added for each monitored core.
Intra-warp DMR uses the cores idled by underutilization within homogeneous execution
units to execute the code redundantly, instead of adding extra cores for verification pur-
pose. Whenever a partially utilized warp is scheduled, the operands of an active thread
within the warp are forwarded to an inactive thread. The inactive thread thus can DMR
an active thread’s execution. The execution results of the inactive thread and the active
79
thread are compared at the end of execution. If the two results are not identical, the hard-
ware scheduler will be notified of an error occurrence. The necessary microarchitectural
support for intra-warp DMR are discussed in Section 4.4.
Since the focus of this work is to detect errors, error handling is out of scope of this
dissertation. But one can use simple techniques that allow the scheduler to either re-
schedule the warp (in case of transient errors) or to stop running the program and raise
an exception to the system (in case of a permanent fault).
4.3.2 Inter-warp DMR
warp2: add.f32 %f16, %f14, %f15
warp1: ld.shared.f32 %f21, [%r99+956]
warp2: add.f32 %f18, %f12, %f17
warp3: ld.shared.f32 %f2, [%r70+4]
warp1: ld.shared.f32 %f20,[%r99+824]
warp2: add.f32 %f16, %f14, %f15
warp1: ld.shared.f32 %f21, [%r99+956]
warp2: add.f32 %f18, %f12, %f17
warp3: ld.shared.f32 %f2, [%r70+4]
warp1: ld.shared.f32 %f20,[%r99+824]
warp4: sin.f32 %f3, %f1
warp1: ld.shared.f32 %f20,[%r99+824]
warp2: add.f32 %f16, %f14, %f15
warp1: ld.shared.f32 %f21, [%r99+956]
warp2: add.f32 %f18, %f12, %f17
warp3: ld.shared.f32 %f2, [%r70+4]
(a) Code
SPs LD/STs SFUs
time
sin
ld
add
ld
add
ld
(b) Normal Execution
SPs LD/STs SFUs
sin
ld
add
ld
add
ld
time
ld
add
ld
add
ld
sin
sin
sin
sin
sin
ld
add
ld
add
ld
DMR
DMR
DMR
DMR
DMR
DMR
(c) Execution with Inter-warp DMR
Figure 4.3: Example of underutilization of heterogeneous units and Inter-Warp DMR
80
Intra-warp DMR is an opportunistic approach that exploits idle cores. But when a warp
utilizes all the cores, intra-warp DMR is unable to provide error detection coverage.
To handle this case we present the second error detection method, called inter-warp
DMR. Inter-warp DMR exploits resource underutilization caused due to heterogeneous
execution units. As mentioned earlier, NVIDIA GPU uses SPs for arithmetic operations,
LD/ST units for memory instructions, and SFUs for complex GPU operations suchsine
and cosine. In any given cycle the instruction issue logic issues instructions to only
one of the three execution units. Hence, when an instruction is issued to SFUs or LD/ST
units, the SPs may become idle. If different instruction types are issued in an interleaved
manner, then each instruction’s verification is done at the following cycle of the original
execution. For instance, if an arithmetic instruction is followed by a LD/ST instruction,
then the arithmetic instruction will be redundantly executed in the next cycle on SPs
when the primary LD/ST instruction is being executed.
Figure 4.3 shows a simplified execution of a code segment which has several interleaved
add andload instructions. As the two instructions use different type of execution units
(add uses SPs and load executes on LD/ST units), whenever an instruction is issued
onto the corresponding execution units, the other type of execution units become idle.
If both units take only one cycle to execute, inter-warp DMR allows theadd andload
instructions to be DMRed one cycle later than the original execution cycle on the asso-
ciated execution units, without the need for stealing many cycles from regular program
execution. Figure 4.3(c) depicts the operation of inter-warp DMR which only adds one
extra cycle at the end of the eight cycle execution. Inter-warp DMR does not interfere
with the execution scheduling of the primaryadd andload instructions.
81
Even with inter-warp DMR, there are scenarios when it is not possible to completely
eliminate the overhead of DMR. In the example shown above, instructions that require
different types of execution units are interleaved. But when the same type of instruc-
tions are scheduled for several cycles in a row, a new microarchitectural structure called
ReplayQ is used to buffer the unverified instructions. Unverified instructions are queued
in the ReplayQ and re-executed whenever the corresponding execution unit becomes
available. Note that we do not allow an instruction to consume unverified instruc-
tion results that are still buffered in the ReplayQ. Hence, whenever there is a RAW
dependency on an unverified result, the dependent instruction is forced to wait and the
ReplayQ gives priority to verify the source instruction.
During intra-warp DMR, the original code and verification code are guaranteed to be
executed on different SIMT lanes. However, during inter-warp DMR, no such guarantee
can be provided by default since contemporary GPUs may use core affinity that assigns
a thread to the same core when redundantly executed. If an execution is DMRed on the
same core, hardware defect on the core cannot be detected. For example, if corei has
stuck-at-zero error, the result of the verification and original execution both will be 0,
which leads to a hidden error. To avoid such hidden errors, inter-warp DMR associates
a verification thread to a different SIMT lane than the original SIMT lane. We call this
approach Lane Shuffling. Lane shuffling is operated within a SIMT cluster to minimize
the wiring overhead. The microarchitectural enhancements for inter-warp DMR are
discussed in Section 4.4.
82
4.3.3 Error coverage
The theoretical error checking coverage of intra-warp DMR is 100% when the number
of the active threads is less than half of the warp size. In this scenario every active
thread’s execution can be verified by at least one of the inactive threads. If the active
thread count is greater than half of the warp size, the coverage is
#inactive threads100
#active threads
%.
The overhead of intra-warp DMR is almost zero as verification is done on the existing
idle cores concurrently with the active threads. Only minimal hardware logic is added
for register forwarding and results comparison.
The theoretical error checking coverage of inter-warp DMR is 100% as each fully occu-
pied warp’s execution is re-executed a few cycles later. The best case execution overhead
of inter-warp DMR is zero as the redundant execution is done only when the correspond-
ing execution unit is idle. In reality, due to the capacity of the ReplayQ and the unbal-
anced instruction distribution(see Figure 4.4 for instruction type distribution), there will
be some overhead. Our results show that the worst case overhead is 8%, which is well
below the theoretical DMR overhead of 100%.
4.3.4 Advantages of Warped-DMR
Warped-DMR verifies the computations at individual execution unit level(i.e. SP). The
DMR can be done at a coarser granularity, such as at the entire SM level by duplicating
a thread block
1
onto two different SMs or at the chip level by invoking two copies of
a kernel function onto two GPUs. The coarser method might be simpler to implement.
1
A thread block in NVIDIA CUDA programming is a logical partition of a program. Any thread can
communicate with other threads only when they are in the same thread block. A thread block is launched
onto a SM.
83
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
SP SFU LD/ST
Figure 4.4: Execution time breakdown with respect to the instruction type
However, Warped-DMR allows for more aggressive error detection. For example, when
there is a faulty SP, a SM-level or a chip-level error checking cannot isolate which core
has the defect. Hence, the only option to fix the problem is to disable the entire SM
even though the remaining 31 SPs in the SM as well as the other logic blocks including
scheduler, dispatcher, and the local memory are fault-free. Similarly, when using chip
level checking, one has to disable an entire GPU chip even with just one failed SP. With
Warped-DMR we can monitor the reliability at the granularity of a SP. In the previous
examples, we can still use the SM even though a SP has a defect by using a core re-
routing approach as proposed by Zhang et al. [70], and in more recent works [71, 72].
It is also worth noting that the static power consumption of GPUs is nearly 60% of
the total power consumption. To reduce static power consumption Hong and Kim [73]
showed that it is best to provide power gating at the SM level. Idle SM periods can be
long and hence they can be completely turned off. But providing power gating at the
SP level requires fine-grain sleep transistors for each SP and enhanced warp scheduling
policies as proposed by Abdel-Majeed et al. [74] and Xu and Annavaram [75]. When
84
Priority MUX0 MUX1 MUX2 MUX3
1st 0 1 2 3
2nd 1 0 3 2
3rd 2 3 0 1
4th 3 2 1 0
Table 4.1: Priority table of RFU MUXs
SP idleness is finely interspersed with periods of activity, the latency of power gating
outweighs the benefits of turning off idle SPs. Warped-DMR is thus an ideal choice for
repurposing idle SPs to provide reliability when power gating idle SPs is not feasible or
when reliability concerns outweigh other considerations.
4.4 Architectural Support for Warped-DMR
4.4.1 Register forwarding unit
th3.r0
th3.r1
.
.
th2.r0
th2.r1
.
.
th1.r0
th1.r1
.
.
th0.r0
th0.r1
.
.
SP SP SP SP
RF
EXE
WB
Comparator
active mask
ERROR!!
Register Forwarding Unit
1100
th3.r1 th2.r1 th1.r1 th0.r1
th3.r1 th2.r1 th3.r1 th2.r1
Figure 4.5: Register Forwarding Unit and Comparator for Intra-Warp DMR
For this work we assume the baseline GPU is similar to Fermi architecture as described
in Chapter 1. The 32 SIMT lanes are implemented as eight clusters of each four SIMT
85
lanes. We restrict the DMR to be performed within each cluster. Thus each active lane
looks for an inactive lane within its own cluster. For intra-warp DMR each inactive
thread that is going to verify the computation should be able to either access an active
thread’s register file or get the active thread’s register values using data forwarding. As
adding an extra port to the register file is expensive, we added a Register Forwarding
Unit(RFU) at the end of each register bank. RFU consists of four 4 32-bit input MUXs
as can be seen in Figure 4.5. A 128-bit entry of a register bank is divided into 4 32-bit
data and forwarded to all the 4 MUXs.
To enable intra-warp DMR the four MUXs in the RFU pair active threads with inactive
threads based on a priority. The priority configuration of the 4 MUXs within a SIMT
cluster is shown in Table 4.1. Each column indicates the priority ordering for each MUX.
As a first priority every MUX delivers the input data to its associated SIMT lane if that
SIMT lane’s active mask is set. MUX0 provides input data to SIMT lane 0, MUX1 to
SIMT lane 1 and so on. If a SIMT lane’s active mask is reset, then that SIMT lane is
idle and can be used for DMR. Hence, every idle SIMT lane looks for an active SIMT
lane whose computation can be redundantly executed on the idle SIMT lane. To find an
active SIMT lane which can be redundantly executed on an idle SIMT lane, each MUX
looks for the active SIMT lane according to the priority listed in the table. For instance,
if SIMT lane 0 is idle, then MUX0 looks at SIMT lane 1 to see if it is active as lane 1
is the 2nd priority for MUX0. If so, then the inputs from SIMT lane 1 are then simply
directed by MUX0 to run on SIMT lane 0. If SIMT lane 1 is also inactive then SIMT
lane 2 active mask bit is checked followed by SIMT lane 3 active mask to find an active
thread. As can be seen from Table 4.1, each MUX runs through a different priority
sequence to allow uniform pairing possibilities between active and idle SIMT lanes. In
this algorithm, if there is only one active lane, the lane is redundantly executed on the
86
rest three idle lanes, which results in more than dual modular redundancy. We simply
allow such a scenario to occur rather than to add additional hardware logic in the MUX
to prevent this scenario since it does not lower the error coverage.
In Figure 4.5, the bold lines inside of RFU illustrates a simple example of an intra-warp
DMR when an active mask for an instruction is 4’b0011. As each bit of an active mask
indicates each corresponding thread’s activeness, 4’b0011 means that thread 0 and 1 are
active and thread 2 and 3 are inactive for the instruction. Thread 2 and 3 will perform
DMR for the execution of thread 0 and 1 according to intra-warp DMR assignment from
Table 4.1.
We implemented a RFU design and a 128-bit comparator by using Synopsis Design
Compiler v.Y-2006.06-SP4 [76]. The respective area overhead is 390m
2
and 622m
2
and the timing overhead is 0.08ns and 0.068ns. The timing overhead of the MUX is thus
less than 0.06% compared to a typical cycle period(1.25ns) of GPU of 40nm technology
and 800MHz core clock [77].
4.4.2 Thread-Core mapping
Since the register forwarding is limited to within a SIMT cluster of just four SPs, in intra-
warp DMR the verification and monitored core are restricted to be within the same SIMT
cluster. This limitation minimizes wire delays and complex routing paths. With this
mapping restriction, however, some SIMT clusters might not be able to use intra-warp
DMR if all the SIMT lanes within a cluster are fully utilized, even when some SIMT
lanes across clusters are idle. Our empirical observations showed that many applications
are likely to have unbalanced active thread distribution within a warp.
87
FETCH (1 cycle)
DEC/SCHED
(1~2 cycles)
RF
(3 cycles)
Replay
checker
EXE
(1~4 cycles)
1
CORE
CORE
CORE
CORE
M
E
M
M
E
M
M
E
M
S
F
U
2 3
4
ReplayQ
op src1 dest src2 src3
Figure 4.6: ReplayQ and Replay Checker for Inter-Warp DMR
To improve the availability of idle SPs within a SIMT cluster, we modified the thread to
core affinity scheduling algorithm. There is only sparse documentation on how current
GPUs assign threads to SPs within a warp. It is believed that the threads are mapped to
cores in order: for example, thread 0 is always executed on core 0, thread 1 is mapped to
core 1 and so on. Our modified scheduling algorithm assigns threads to SIMT clusters
in a round-robin fashion. Thus thread 0 is assigned to cluster 0, thread 1 is assigned
cluster 1 and so on. We show later in our results section that this simple scheduler
change increased error detection opportunities by 9.6% compared to the default in order
mapping of threads to core.
88
4.4.3 ReplayQ
As discussed earlier, inter-warp DMR relies on re-executing an instruction at a later
time if the corresponding execution units are not free. Instead of stalling program exe-
cution, inter-warp DMR buffers unverified instructions into a ReplayQ whenever the
corresponding execution unit is not available. A Replay Checker engine is designed to
manage the ReplayQ. There is one Replay Checker and ReplayQ per SM.
Figure 4.6 shows how the Replay Checker works within the context of current GPU
pipeline. A GPU pipeline in the figure has the following stages: instruction fetch stage
(FETCH), decode & schedule stage (DEC/SCHED), register fetch stage (RF), and exe-
cution stage (EXE). The write back stage (WB) is omitted in the figure for simplicity.
The latency of each pipeline stage is also shown in the figure. The stages having multi-
ple cycle latency consist of multiple sub-stages. For example, RF is comprising of RF0,
RF1, and RF2. These latencies reflect the pipeline latencies of current GPUs that we
modeled [8].
If active mask of the instruction in the firstRF stage is all active, the Replay Checker
is activated. If the active mask has some idle slots intra-warp DMR will verify the
warp’s execution. During intra-warp DMR execution Replay Checker and ReplayQ do
not play any role in managing the warp’s redundant execution. Once Replay Checker
is active it compares the instruction type of the warp inRF (Ë) and that of the warp in
DEC=SCHED stage(Ê). If the instruction type is different, then the Replay Checker
creates a DMR copy of the RF instruction to be co-executed with the instruction in
DEC=SCHED (Í). DMR copy consists of the values of the input operands and
opcode. Note that even though an instruction takes several cycles of the execution,
the next instruction can be issued at the following cycle to the execution unit as the
89
EXE stage itself is super-pipelined. If RF and DEC=SCHED instruction type are
the same then the instruction type in RF stage (a two bit value indicating SP, LD/ST
or SFU instruction) is compared against all the queued entries in the ReplayQ (Ì). The
instruction decoder would have already marked each instruction based on its execution
resource demand into either SP, LD/ST or SFU instruction type. In our experiments the
maximum size of ReplayQ is 10 entries. Hence, 10 two bit XORs are used for this com-
parison. If any instruction in the ReplayQ has different type than the instruction inRF
then the Replay Checker dequeues that instruction and pairs it with the instruction in
RF stage (Í) for co-execution in the next cycle. When multiple ReplayQ instructions
are available for co-execution then one instruction is picked at random. The instruction
inRF stage is then enqueued in the ReplayQ. When an instructionRF is enqueued into
ReplayQ it simply implies that the instruction that is one cycle behindRF is going to
use the same type of execution units as the instruction inRF . Hence, there will be no
opportunity to verify theRF instruction in the next cycle following its execution cycle.
Hence, that instruction needs to be buffered for future verification.
Also to distinguish a DMR execution from an original execution, a single dmr bit is
added. dmr is set by Replay Checker when creating a DMR copy. By using this value,
RFU can apply the lane shuffling only to the fully utilized DMR executions.
If there is no instruction in the ReplayQ whose instruction type is different than the
instruction in the RF stage, the Replay Checker checks if the ReplayQ is full. If the
ReplayQ has empty slots, the RF instruction is enqueued to the ReplayQ (Ì). If the
ReplayQ is full, a stall cycle is inserted into the pipeline immediately after the instruc-
tion in RF finishes the first EXE stage and then the instruction is re-executed by
using the operand values that are still available in the pipeline. This eager re-execution
90
Algorithm 2 Inter-Warp DMR with ReplayQ
i
rf
instruction in RF stage
i
dec
instruction in DEC/SCHED stage
op
rf
instruction type ofi
rf
op
dec
instruction type ofi
dec
ifop
rf
6= op
dec
then
coexecute DMR ofi
rf
withi
dec
execution
else
if9i
rq
:i
rq
2ReplayQandinstructiontypeof i
rq
6= op
rf
then
dequeuei
rq
from ReplayQ
enqueuei
rf
to ReplayQ
coexecute DMR ofi
rq
withi
rf
execution
else
ifReplayQisfull then
insert a Stall cycle
DMRi
rf
one cycle later the original execution
else
enqueuei
rf
to ReplayQ
end if
end if
end if
reduces unnecessary register reads but adds one cycle performance penalty. Note that
this penalty is applicable only in the rare case that ReplayQ has no instruction that is
different thanRF stage instruction and ReplayQ is full.
Whenever a new instruction is scheduled in the pipeline which is going to consume
(RAW dependency) data from an unverified instruction that is buffered in the ReplayQ
then the Replay Checker stalls the pipeline and executes the verification of the source
instruction before allowing the consumer instruction to execute. Algorithm 2 shows the
pseudo code of the Inter-Warp DMR with ReplayQ.
91
0
2
4
6
8
10
12
14
16
18
20
SP SFU LD/ST
(a) Average instruction type switching distances
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1
5
9
13
17
21
25
29
33
37
41
45
49
53
57
61
65
69
73
77
81
85
89
93
97
101
105
109
113
117
121
125
129
matrixMul BFS vectorAdd dct8x8 reduction backprop gaussian hotspot MUM
(b) RAW dependency distances of the registers of warp0 thread 32
Figure 4.7: Two key factors to determine effective ReplayQ size
4.4.4 Effective size of ReplayQ
Since ReplayQ buffers instructions only when there is no available resource, it is critical
to quantify how often such a scenario occurs in GPUs. ReplayQ also stalls the pipeline
whenever there is RAW dependency on an unverified instruction.
Figure 4.7(a) shows the average cycle distance before an instruction type is switched to
another. In most of the applications, normally less than 6 instructions of the same type
92
are consecutively issued. gaussian has longer distances between different instruction
types but it is also bounded to a maximum of 20. Hence, the ReplayQ only needs to
buffer 20 instructions in the worst case, but an average size of 6 will suffice for most
applications.
Figure 4.7(b) shows the number of cycles between when a register is written to the time
when that register is read by another instruction. In this figure we only show the RAW
dependency distance for warp0 thread 32. But the data is quite similar for all warps and
all threads within each warp. The RAW dependency distance is at least 8 cycles and
almost half of the registers have greater than 100 cycles of distance and some have even
longer than 1000 cycles of distance. Hence, the RAW dependency related pipeline stalls
are likely to be just a few in Warped-DMR.
Each entry of the ReplayQ should maintain opcode and the original execution result
as well as the source register values. The original execution result is for verifying the
original execution. Each SM has one ReplayQ which covers all the SIMT Clusters.
Each entry of a ReplayQ contains32lanes3operands (each instruction can have up
to 3 operands)4bytes for the source register values,32lanes4bytes for the original
execution result and24bytes for the opcode so total of514516bytes. Therefore,
the ReplayQ size with 10 entries is around 5KB. Thus ReplayQ is only 4% of a 128KB
register file [8].
4.5 Enhancing Warped-DMR for 100% Error Detection
As discussed in section 4.3.3, intra-warp DMR verifies active threads execution in a
warp with the given inactive threads in the same warp. Therefore, the error coverage
93
might be lower than 100% when there is any warp that has more active threads than
inactive threads. For the systems that require strict 100% error coverage, we also pro-
pose an enhanced Warped-DMR. As inter-warp DMR guarantees 100% error coverage,
enhanced Warped-DMR runs inter-warp DMR for verifying warps that have insufficient
number of inactive threads even though those warps are not fully utilized.
In enhanced Warped-DMR, inter-warp DMR is triggered to verify an underutilized warp
whenever there is any SIMT cluster that has more than two active threads. To check the
active thread count, a four-input bit adder is added to each SIMT cluster. When a new
active mask is give to a SIMT cluster, the four active mask bits are fed to the adder to
check if there are more than two active threads for the instruction’s execution. If the
adder output is greater than two, inter-warp DMR is triggered.
While inter-warp DMR guarantees 100% error coverage, it might cause performance
overhead when the replayQ size is not sufficient. Recall that the pipeline should be
stalled to verify an unverified instructions when replayQ becomes full. However, accord-
ing to our evaluation, the overall performance overhead increases only by 3% when
inter-warp DMR is also used for verifying underutilized warps execution to achieve
100% error coverage. The details of this evaluation described in Section 4.6.3.
4.6 Evaluation
4.6.1 Settings and workloads
We used GPGPU-Sim v3.2.1 [44] to evaluate the proposed Warped-DMR approach. The
simulation environment is described in Table 4.2 and the simulation parameters are set
94
Component Description
OS Ubuntu Linux kernel v2.6.38
CPU Intel Core i7(quad core) @ 2.67 GHz
Compiler nvcc-2.3 / gcc-4.3.4
Table 4.2: Experimental Environment
Parameter Value
Simulator Version GPGPU-Sim v3.2.1
Execution Model In-order
Execution Width 32 wide SIMT
Warp Size 32
# Threads/Core 1024
Register Size 64 KB
# Register Banks 32
# Core(SP)s/Multiprocessor(SM) 32
# SMs 30
Warp Scheduler gto
Table 4.3: Simulation Parameters
as listed in Table 4.3. The simulation parameters model our baseline GPU architecture
as illustrated in Fig 1.4. A GPU chip has 30 SMs and each SM is comprising of 32
SIMT lanes. 4 SIMT lanes build a SIMT cluster which consists of 4 register banks, 4
SPs, 4 LD/ST units and 4 SFUs.
Category Benchmark Parameter
Linear Algebra/Graph Processing BFS Input file : graph65536.txt, gridDim = 256, blockDim = 256
Matrix Multiply gridDim = 5x10, blockDim = 1616
Gaussian Input file : Matrix4.txt
vectorAdd gridDim = 196, blockDim = 256
dct8x8 gridDim = 64x64, blockDim = 8x8
reduction gridDim = 64, blockDim = 256
Pattern Recognition Backprop Input : 65536, gridDim = 1x4096, blockDim = 16x16
AI/Simulation HotSpot Input file : temp 512, power 512
MUM Input file : NC 003997.20k.fna, NC 003997 q25bp.50k.fna
Table 4.4: Workloads
95
For the workloads, we used several applications from NVIDIA CUDA SDK [45], Par-
boil Benchmark Suite [46], and rodinia [47]. As mentioned earlier, our main target
applications are those needing strict accuracy such as scientific computing or finan-
cial applications. Hence, we excluded some applications that are inherently fault tol-
erant, such as graphics applications. We picked 3 categories of applications: linear
algebra/graph processing, pattern recognition, and AI/simulation. The applications that
are included in the 3 categories are listed in Table 4.4.
4.6.2 Error coverage and overhead
Figure 4.8(a) shows the percentage of executed instructions covered by Warped-DMR.
We compare three different implementations. The baseline implementation is the 4
SIMT lane cluster with no enhanced thread-core mapping. The second bar shows the
impact of increasing the cluster size to 8 SIMT lanes and allowing register forwarding
within a larger cluster size. The last bar shows the results using the enhanced thread-core
mapping as stated in the Section 4.4.2. Warped-DMR with enhanced thread mapping
provide an average of 93.87% error coverage compared to 91.91% error coverage with a
more hardware intensive 8 SIMT lane cluster. The gaps in error coverage are primarily
due to intra-warp DMR when the number of idle cores is fewer than the number of
active cores. For instance, BFS is almost exclusively covered by only intra-warp DMR
as all the warps are underutilized. The utilization of all the warps in BFS is less than
50% as illustrated in Figure 4.1. Hence, every single active thread’s execution can be
verified by inactive threads without any ReplayQ involvement. Such applications also
have negligible performance overhead (almost zero)as seen in Figure 4.8(b) while the
error coverage is 100%. hotspot derived the lowest error coverage, 70%.
96
93.87
0
20
40
60
80
100
Error Coverage (%)
4lane 8lane cross mapping
(a) Error coverage with respect to the SIMT cluster organization and Thread to Core mapping
1.08
0
0.5
1
1.5
2
Execution Cycles Normalized by
No Error Detection
1 5 10
(b) Normalized Kernel simulation cycles with respect to the ReplayQ size
Figure 4.8: Error coverage and Overhead of Warped-DMR
Applications that are well parallelized like vectorAdd, MatrixMul, dct8x8, and reduc-
tion are mostly covered by inter-warp DMR as most of the warps are fully utilized. In
such applications, the error coverage is almost 100% but the performance overhead is
higher than the other applications as shown in Figure 4.8(b). There are four bars per
benchmark in Figure 4.8(b). Each bar is normalized to the kernel execution cycles of
97
1.11
0
0.5
1
1.5
2
Execution Cycle of
Enhanced Warped-DMR
Normalized by No Error Detection
Figure 4.9: Performance overhead when enhanced Warped-DMR is used with a 10-entry
ReplayQ. Error coverage is 100%.
the base machine with zero error detection support. Using the data presented in Sec-
tion 4.4.4, we varied the ReplayQ from 1 to 10 entries. As the ReplayQ size increased
to a maximum of just 10 entries the average performance overhead reduced to 8%. In
some applications that are mostly covered by inter-warp DMR such as MatrixMul, per-
formance overhead with only one ReplayQ entry reaches 70%. However, by using 10
entries of ReplayQ, the overhead drops to 18%.
4.6.3 Enhanced Warped-DMR
We also measured the performance overhead of enhanced Warped-DMR. The enhanced
Warped-DMR guarantees 100% error coverage by using inter-warp DMR for the under-
utilized warps that have insufficient number of inactive threads. Figure 4.9 shows
the total execution cycle taken by the benchmark applications when running enhance
Warped-DMR with a 10-entry ReplayQ. Comparing with the performance results in Fig-
ure 4.8(b), we observed that backprop, and hotspot suffer longer execution time under
98
1.12
0
0.5
1
1.5
2
Energy Consumption Normalized by
No Error Detection
Overall device level Execution unit level
Figure 4.10: Normalized Energy Consumption in execution unit level and overall device
level
the enhanced Warped-DMR. These two applications could not achieve 100% error cov-
erage under the plain Warped-DMR as shown in Figure 4.8(a). The extra inter-warp
DMR execution under the enhanced Warped-DMR results in longer execution time.
Interestingly, MUM’s performance did not degrade under the enhanced Warped-DMR
even when the error coverage is 100%. This behavior is because 10 ReplayQ entries
are sufficient to avoid pipeline stall in MUM’s execution. Overall, the performance
overhead increases by 3% under the enhanced Warped-DMR compared to the plain
Warped-DMR.
4.6.4 Power consumption
To measure the energy overhead, we used GPUWattch [17]. As Warped-DMR dupli-
cates the workload for the execution units only, the energy overhead is measured in two
levels: execution unit level and entire device level. Figure 4.10 shows the energy con-
sumption measured when using Warped-DMR, normalized by the energy consumption
99
measured when error detection is not used. As shown in Figure 4.10, Warped-DMR
almost doubles the energy consumption in the execution unit level as the instructions
should be executed twice to be verified. However, as Warped-DMR does not stress the
other logics such as register file and memory, the energy overhead in the overall device
level is 12%.
4.7 Related Work
In this section we describe the most relevant prior work on GPU reliability and DMR
execution. Dimitrov et al. [65] proposed three software approaches: R-Naive, R-Scatter,
and R-Thread. R-Naive simply invokes memory API and kernel function twice to create
a software-centric DMR execution within a GPU. R-Scatter tries to exploit underutilized
VLIW lanes for redundant execution by duplicating kernel code. R-Thread doubles
the thread block count for a kernel and uses the newly added thread blocks to do the
redundant execution. We compared our work with the R-Naive and R-thread approaches
and showed that Warped-DMR significantly reduces the overhead of DMR execution.
Nathan and Sorin [78] proposed a mechanism that checks computations, control flow,
and data flow of a GPU program by inserting signature collector code. After the kernel
execution, the collected signatures are compared with statically generated signatures.
However, this approach only checks the computation after the kernel code is complete,
which can be much later than when the error was first encountered.
A more systematic approach was proposed by Yim et al. [67], which uses a guardian
process that intercepts the crash event and restarts the program using checkpoints. They
100
also instrumented the source code, duplicated non-loop code and inserted range check-
ing code for loops. This approach also relies on extensive software instrumentation and
large checkpoints to support redundant execution. Many of these reliability studies for
GPUs are software approaches. Software approaches are always more flexible compared
to hardware approaches. However, the error coverage can be limited by the granularity
of programmers (or compiler’s) code insertion. Compared to those approaches, Warped-
DMR can check 96.43% of all instructions without any programmer’s effort.
A sampling DMR approach was proposed by Nomura et al. [79] in which DMR is
conducted only for a short period of time within each epoch rather than doing it for
entire execution time. Using this approach the authors state that permanent errors can
be eventually detected even though transient errors might be missed. Warped-DMR
takes advantage of GPU-specific microarchitectural features to provide high coverage
for both transient and permanent faults.
Simultaneous and Redundantly Threaded (SRT) processor design was proposed by
Reinhardt and Mukherjee [80]. Instead of replicating hardware resources, they used
thread level replication. Trailing thread redundantly executes the same program copy
that the leading thread executes. Hardware resources are shared between the trailing and
the leading threads. Mukherjee et al. [13] proposed a Chip-level Redundantly Threaded
(CRT) processor, which explicitly disables core-affinity to make sure that two threads
execute on different cores. This approach essentially exploits the performance advantage
of SRT’s loose synchronization as well as the high fault coverage of lockstep execution
method [81]. Kumar and Aggarwal [82] proposed a method to reduce the performance
overhead of SRT, which prevents the trailing thread from redundantly fetching register
data. The key idea is to reuse the already fetched data for the trailing thread.
101
Compared to the hardware based DMR or RMT approaches, Warped-DMR has some
domain-specific advantages. Warped-DMR checks every single instruction (but in less
than 4% of cases it checks only partial number of inputs). This approach not only
detects permanent errors but most transient errors can also be detected. Also, unlike
[81], Warped-DMR does not use an entire core just for DMR. Instead, we utilize the
idle periods of cores for DMR.
Due to DMR’s high area overhead, some self-checking schemes also have been stud-
ied. One of the most popular self-checking schemes is residue checking [83] [84].
Instead of duplicating entire execution units, residue checking adds residue operator
units which require much less area than the entire execution units. An error in the
original operator unit is detected by comparing the residue of the original computation
result and the output of the residue operation which is executed on the residue operator
unit. Residue checking has small area footprint but residue checking is only applica-
ble for some simple arithmetic operations(it cannot be used for exponent calculations
[84]). Warped-DMR can detect errors in any arithmetic operation supported on a GPU,
including complex operations implemented in an SFU.
Recently, several more studies have been conducted for reliable GPGPU design. Wad-
den et al. [85] explored compiler level optimization to automatically convert a kernel to
a redundant multithreading (RMT) and evaluated the performance and power overhead
in detail. Unlike Warped-DMR, which verifies instructions execution in the hardware-
level, Wadden’s approach needs compiler level modification and optimization. Warped-
DMR has been extended by two studies [71, 72]. Warped-Shield [71] proposed a novel
hard error avoidance mechanism by using Warped-DMR-like scheme as an error detec-
tor. They proposed thread and warp shuffling to reroute the instruction stream that is
102
supposed to be fed to a faulty core to an alternative healthy core. To detect the defected
core, they used similar scheme as Warped-DMR. Warped-RE [72] proposed an error
correction mechanism for the GPU execution units. They used Warped-DMR as an
error detection scheme. When an error is detected, they run triple modular redundancy
(TMR) to correct the error. The proposed TMR basically have a group of three cores
run the same instruction with the same operand values then takes the majority result.
To reduce the performance overhead, they leveraged the fact that many of threads use
similar operand values. Instead of forcing all inactive threads duplicate active threads
execution, they simply compare the computation results of active threads that use the
same operand value.
4.8 Chapter Summary
As GPUs play critical role in high performance computing today, reliability should be
treated as a first class citizen alongside power and performance. In this chapter, we pro-
posed Warped-DMR a hardware approach to detect computation errors in GPUs. We
presented the reasons for underutilization of resources in GPU applications and then
presented inter-warp and intra-warp DMR to exploit the idle resources for error detec-
tion. Intra-warp DMR checks the active threads’ execution by using idle cores from
underutilized warps. For the fully utilized warps, inter-warp DMR verifies computation
by using temporal DMR whenever the corresponding execution unit becomes idle. A
simple ReplayQ microarchitecture design is used for maintaining instructions in case
the corresponding execution unit is not idle for several cycles. To prevent an instruction
from being executed and verified on the same core, which may lead to hidden errors, we
designed a register forwarding/lane shuffling logic. We presented a detailed state space
103
exploration and showed that Warped-DMR provides 96.43% error coverage with 16%
performance overhead.
104
Chapter 5
Conclusion
Graphics processing unit (GPU) is one of the most promising many-core architec-
tures for power efficient throughput computing. General purpose computing on GPUs
(GPGPU computing) is a new paradigm, which can process scientific applications that
require general purpose parallel computing capability at a massive scale as well as tradi-
tional graphics applications. The massive parallelism combined with programmability
made GPUs one of the most attractive choices in supercomputing centers. However,
despite the performance benefit, there are two critical hurdles that current GPUs must
overcome. As GPUs are now used for general purpose computing, reliability becomes a
critical concern. Especially because GPUs are forecasted to incorporate ever increasing
amount of execution resources to achieve higher throughput, even a small probability of
soft error in a single resource can lead to significantly higher failure probability at the
chip level. The second hurdle is that incorporating such massive resources also causes
power consumption to grow rapidly. In spite of some innovations introduced to tackle
power consumption, GPU’s power consumption is continuing to grow significantly.
In this dissertation, we present two mechanisms that either repurpose or even eliminate
the massive resources in GPU design for power efficient and reliable GPU design. As a
first step, this thesis makes the observation that not all the available resources are fully
utilized during an application’s run time. For instance, as shown in Chapter 4, nearly
32% of the execution lanes are idle during a typical benchmark execution on a GPU.
105
Also, as discussed in Chapter 2, the fraction of live registers among compiler reserved
register barely reaches 100% in various GPU applications. This dissertation provides
solutions to exploit the resource underutilization to improve power efficiency and/or
reliability in GPUs.
To prevent resource contention amongst thousands of threads, GPUs use a simple regis-
ter management method that basically assigns a separate set of registers for each warp.
However, the basic register management method causes several inefficiencies such as
power overhead and imbalanced wearout issues. The motivational data shown in Chap-
ter 2 indicates that not all registers are live at any given instance of time. We proposed
a new lifetime aware register management method named register virtualization. In our
proposed register virtualization, the physical register assigned to an architecture register
released immediately after its last use. Then, the released register space is reassigned
to another warp’s register. The compile time register lifetime analysis information is
used for providing information to the hardware about when a register is dead for certain
and hence can be released by one warp and assigned to another warp. The enlarged
spare register space is then reused for power efficiency and reliability enhancement.
Our evaluation showed that the new register management method reduces the demand
for register file size up to 43% compared with the optimally compiled applications. The
reduced live register space leads to an average of 47% static power saving. Register
assignment process can also be altered to improve wear-leveling in register files.
Motivated by the significant register file underutilization, we proposed a more aggres-
sive approach that actually eliminates underutilized registers for better power efficiency.
The conventional belief is that large register file is inevitable for accommodating more
parallel thread contexts and technology scaling makes it feasible to incorporate ever
106
increasing size of register file. In Chapter 3, we demonstrated that the register file size
does not need to be increased to accommodate more threads context. By using register
virtualization, we proposed to shrink the architected register space to a smaller phys-
ical register space. By under-provisioning the physical register file to be smaller than
the architected register file we reduced dynamic and static power consumption. We then
developed a new register throttling mechanism, namely GPU-Shrink, to run applications
having high register usage demand that exceeds the size of the under-provisioned regis-
ter file without any deadlock. Our evaluation showed that the applications successfully
run with a half-sized register file with negligible performance overhead and significant
register file energy reduction by using GPU-Shrink and register virtualization.
Chapter 2 and Chapter 3 explored power efficient and reliable register file design.
Besides the wearout issue that is solved by register virtualization in Chapter 2, to pro-
tect memory components from soft error strikes, conventional single error correction and
double error detection (SECDED) code has been incorporated in most of the recently
released GPUs. However, as observed by several studies, the vulnerability of com-
binational logics is also increasing [86]. In Chapter 4, we proposed an error detec-
tion method for execution units. We exploited unique architectural characteristics of
GPUs to propose a light-weight error detection method, called Warped Dual Modular
Redundancy (Warped-DMR). Warped-DMR detects errors in computation by relying
on opportunistic spatial and temporal dual-modular execution of code. Warped-DMR
is light-weight because it exploits the underutilized parallelism in GPGPU computing
for error detection. Error detection spans both within a warp as well as between warps,
called intra-warp and inter-warp DMR, respectively. Intra-warp DMR simply uses the
inactive threads to verify execution of active threads in the same warp by using dual
107
modular spatial redundancy. When all threads are active in a warp, we use dual mod-
ular temporal redundancy, which is called inter-warp DMR. A duplicated execution of
each fully utilized warp is scheduled for execution later whenever the associated exe-
cution unit becomes idle. Warped-DMR achieves 94% error coverage while incurring
a worst-case 8% performance overhead without extra execution units or programmers
effort.
108
Reference List
[1] “Legit reviews.”http://www.legitreviews.com/.
[2] M. Gebhart, S. W. Keckler, and W. J. Dally, “A Compile-time Managed Multi-level Reg-
ister File Hierarchy,” in Proceedings of IEEE/ACM International Symposium on Microar-
chitecture, pp. 465–476, 2011.
[3] NVIDIA, “NVIDIA CUDA SDK 3.0.” https://developer.nvidia.com/
cuda-toolkit-30-downloads.
[4] Xin Fu and Nilanjan Goswami and Tao Li, “Analyzing Soft-Error Vulnerability on GPGPU
Microarchitecture,” in Proceedings of the 2011 IEEE International Symposium on Work-
load Characterization, pp. 226–235, November 2011.
[5] A. Dixit and A. Wood, “The Impact of New Technology on Soft Error Rates,” in Proceed-
ings of The workshop on Silicon Errors in Logic - System Effects, 2011.
[6] J. Nickolls, I. Buck, M. Garland, and K. Skadron, “Scalable parallel programming with
cuda,” Queue, vol. 6, no. 2, pp. 40–53, 2008.
[7] NVIDIA, “Fermi white paper v1.1.” http://www.nvidia.com/content/
PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_
Whitepaper.pdf.
[8] Mark Gebhart and Daniel R. Johnson and David Tarjan and Stephen W. Keckler and
William J. Dally and Erik Lindholm and Kevin Skadron, “Energy-efficient Mechanisms for
Managing Thread Context in Throughput Processors,” in Proceedings of the 38th annual
International Symposium on Computer Architecture, pp. 235–246, Jun 2011.
[9] N. Jayasena, M. Erez, J. H. Ahn, and W. J. Dally, “Stream Register Files with Indexed
Access,” in Proceedings of IEEE International Symposium On High Performance Com-
puter Architecture, pp. 60 – 72, 2004.
[10] H. Kim, R. Vuduc, S. Baghsorkhi, J. Choi, and W.-m. Hwu, Performance Analysis and
Tuning for General Purpose Graphics Processing Units. Morgan & Claypool Publishers,
1st ed., 2012.
109
[11] J. Lai and A. Seznec, “Performance Upper Bound Analysis and Optimization of SGEMM
on Fermi and Kepler GPUs,” in Proceedings of ACM/IEEE International Symposium on
Code Generation and Optimization, pp. 1–10, 2013.
[12] Rogers, Timothy G. and O’Connor, Mike and Aamodt, Tor M., “Cache-Conscious Wave-
front Scheduling,” in Proceedings of the 45th Annual IEEE/ACM International Symposium
on Microarchitecture, pp. 72–83, 2012.
[13] Shubhendu S. Mukherjee and Michael Kontz and Steven K. Reinhardt, “Detailed Design
and Evaluation of Redundant Multithreading Alternatives,” in Proceedings of the 29th
annual International Symposium on Computer Architecture, pp. 99–110, May 2002.
[14] NVIDIA, “Haswell - The 4th Generation Intel Core Processor Platform.”
http://www.intel.com/content/www/us/en/intelligent-systems/
shark-bay/4th-generation-core-q87-chipset.html.
[15] J. M. Rabaey, A. Chandrakasan, and B. Nikolic, Digital Integrated Circuits. Prentice Hall,
2nd ed., 2003.
[16] NVIDIA, “NVIDIA GeForce GTX 680 white paper v1.0.”
http://www.geforce.com/Active/en\_US/en\_US/pdf/
GeForce-GTX-680-Whitepaper-FINAL.pdf.
[17] J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and
V . J. Reddi, “Gpuwattch: Enabling energy optimizations in gpgpus,” in Proceedings of
ACM/IEEE International Symposium On Computer Architecture, pp. 487–498, 2013.
[18] J. E. Lindholm, M. Y . Siu, S. S. Moy, S. Liu, and J. R. Nickolls, “Simulating multiported
memories using lower port count memories,” in US Patent No. 7339592, 2008.
[19] J. Srinivasan, S. V . Adve, P. Bose, and J. A. Rivers, “The Case for Lifetime Reliability-
Aware Microprocessors,” in Proceedings of ACM/IEEE International Symposium On Com-
puter Architecture, pp. 276–, 2004.
[20] R. Balasubramanian and K. Sankaralingam, “Virtually-aged Sampling DMR: Unifying Cir-
cuit Failure Prediction and Circuit Failure Detection,” in Proceedings of IEEE/ACM Inter-
national Symposium on Microarchitecture, pp. 123–135, 2013.
[21] J. Shin, V . Zyuban, P. Bose, and T. M. Pinkston, “A Proactive Wearout Recovery Approach
for Exploiting Microarchitectural Redundancy to Extend Cache SRAM Lifetime,” in Pro-
ceedings of ACM/IEEE International Symposium On Computer Architecture, pp. 353–362,
2008.
[22] M. Namaki-Shoushtari, A. Rahimi, N. Dutt, P. Gupta, and R. K. Gupta, “ARGO: Aging-
aware GPGPU Register File Allocation,” in Proceedings of IEEE/ACM/IFIP International
Conference on Hardware/Software Codesign and System Synthesis, pp. 30:1–30:9, 2013.
[23] M. Abdel-Majeed and M. Annavaram, “Warped Register File: A Power Efficient Register
File for GPGPUs,” in Proceedings of IEEE International Symposium On High Performance
Computer Architecture, pp. 412–423, 2013.
110
[24] N. B. Lakshminarayana and H. Kim, “Spare Register Aware Prefetching for Graph Algo-
rithms on GPUs,” in Proceedings of IEEE International Symposium On High Performance
Computer Architecture, pp. 614–625, 2014.
[25] Daley, Robert C. and Dennis, Jack B., “Virtual Memory, Processes, and Sharing in MUL-
TICS,” Commun. ACM, vol. 11, pp. 306–312, May 1968.
[26] R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, and B. Smith, “The
Tera Computer System,” in Proceedings of ACM International Conference on Supercom-
puting, pp. 1–6, 1990.
[27] “asfermi: An assembler for the NVIDIA Fermi Instruction Set.” https://code.
google.com/p/asfermi/.
[28] NVIDIA, “CUDA Binary Utilities.” http://docs.nvidia.com/cuda/
cuda-binary-utilities/index.html.
[29] S. Li, K. Chen, J. H. Ahn, J. B. Brockman, and N. P. Jouppi, “CACTI-P: Architecture-level
Modeling for SRAM-based Structures with Advanced Leakage Reduction Techniques,” in
Proceedings of ACM/IEEE International Conference on Computer-Aided Design, pp. 694–
701, 2011.
[30] N. S. Kim, T. Austin, D. Blaauw, T. Mudge, K. Flautner, J. S. Hu, M. J. Irwin, M. Kan-
demir, and V . Narayanan, “Leakage Current: Moore’s Law Meets Static Power,” Computer,
vol. 36, pp. 68–75, Dec. 2003.
[31] M. Anis and M. Elmasry, Multi-threshold CMOS digital circuitsmanaging leakage power.
Springer, 1st ed., 2003.
[32] T. Yamashita, N. Yoshida, M. Sakamoto, T. Matsumoto, M. Kusunoki, H. Takahashi,
A. Wakahara, T. Ito, T. Shimizu, K. Kurita, K. Higeta, K. Mori, N. Tamba, N. Kato,
K. Miyamoto, R. Yamagata, H. Tanaka, and T. Hiyama, “A 450 MHz 64 b RISC processor
using multiple threshold voltage CMOS,” in Proceedings of IEEE International Solid-State
Circuits Conference, pp. 414–415, Feb 2000.
[33] L. Wei, Z. Chen, M. Johnson, K. Roy, and V . De, “Design and Optimization of Low V oltage
High Performance Dual Threshold CMOS Circuits,” in Proceedings of the 35th Annual
Design Automation Conference, pp. 489–494, 1998.
[34] Z. Hu, A. Buyuktosunoglu, V . Srinivasan, V . Zyuban, H. Jacobson, and P. Bose, “Microar-
chitectural Techniques for Power Gating of Execution Units,” in Proceedings of the Inter-
national Symposium on Low Power Electronics and Design, pp. 32–37, 2004.
[35] S. Kim, S. V . Kosonocky, and D. R. Knebel, “Understanding and Minimizing Ground
Bounce During Mode Transition of Power Gating Structures,” in Proceedings of the Inter-
national Symposium on Low Power Electronics and Design, pp. 22–25, 2003.
[36] B. Zhai, D. Blaauw, D. Sylvester, and K. Flautner, “Theoretical and Practical Limits of
Dynamic V oltage Scaling,” in Proceedings of the 41st Annual Design Automation Confer-
ence, pp. 868–873, 2004.
111
[37] T. D. Burd and R. W. Brodersen, “Design Issues for Dynamic V oltage Scaling,” in Pro-
ceedings of the International Symposium on Low Power Electronics and Design, pp. 9–14,
2000.
[38] S. Borkar, “The Exascale Challenge.” http://cache-www.intel.com/cd/00/
00/46/43/464316_464316.pdf.
[39] “International Technology Roadmap for Semiconductors.” http://www.itrs.net/
Links/2013ITRS/2013Chapters/2013ExecutiveSummary.pdf.
[40] P. Packan, S. Akbar, M. Armstrong, D. Bergstrom, M. Brazier, H. Deshpande, K. Dev,
G. Ding, T. Ghani, O. Golonzka, W. Han, J. He, R. Heussner, R. James, J. Jopling,
C. Kenyon, S.-H. Lee, M. Liu, S. Lodha, B. Mattis, A. Murthy, L. Neiberg, J. Neirynck,
S. Pae, C. Parker, L. Pipes, J. Sebastian, J. Seiple, B. Sell, A. Sharma, S. Sivakumar,
B. Song, A. St.Amour, K. Tone, T. Troeger, C. Weber, K. Zhang, Y . Luo, and S. Natarajan,
“High performance 32nm logic technology featuring 2nd generation high-k + metal gate
transistors,” in Proceedings of IEEE International Electron Devices Meeting, pp. 1–4, Dec
2009.
[41] C. Auth, C. Allen, A. Blattner, D. Bergstrom, M. Brazier, M. Bost, M. Buehler, V . Chikar-
mane, T. Ghani, T. Glassman, R. Grover, W. Han, D. Hanken, M. Hattendorf, P. Hent-
ges, R. Heussner, J. Hicks, D. Ingerly, P. Jain, S. Jaloviar, R. James, D. Jones, J. Jopling,
S. Joshi, C. Kenyon, H. Liu, R. McFadden, B. Mcintyre, J. Neirynck, C. Parker, L. Pipes,
I. Post, S. Pradhan, M. Prince, S. Ramey, T. Reynolds, J. Roesler, J. Sandford, J. Seiple,
P. Smith, C. Thomas, D. Towner, T. Troeger, C. Weber, P. Yashar, K. Zawadzki, and K. Mis-
try, “A 22nm high performance and low-power CMOS technology featuring fully-depleted
tri-gate transistors, self-aligned contacts and high density MIM capacitors,” in Proceedings
of Symposium on VLSI Technology, pp. 131–132, June 2012.
[42] “Predictive Technology Model (PTM).”http://ptm.asu.edu/.
[43] J. Lim, N. B. Lakshminarayana, H. Kim, W. Song, S. Yalamanchili, and W. Sung, “Power
Modeling for GPU Architectures Using McPAT,” ACM Transactions on Design Automa-
tion of Electronic Systems, vol. 19, pp. 26:1–26:24, June 2014.
[44] Bakhoda, Ali and Yuan, George L. and Fung, Wilson W. L. and Wong, Henry and Aamodt,
Tor M. , “Analyzing CUDA workloads using a detailed GPU simulator.”http://www.
ece.ubc.ca/
˜
aamodt/gpgpu-sim/, 2009.
[45] NVIDIA, “NVIDIA CUDA SDK 2.3.” http://developer.nvidia.com/
cuda-toolkit-23-downloads.
[46] Stratton, John A. and Rodrigues, Christopher and Sung, I-Jui and Obeid, Nady and Chang,
vLi-Wen and Anssari, Nasser and Liu, Geng Daniel and Hwu, Wen-mei W., “Parboil: A
Revised Benchmark Suite for Scientific and Commercial Throughput Computing.”http:
//impact.crhc.illinois.edu/parboil.php.
112
[47] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, “Rodinia:
A Benchmark Suite for Heterogeneous Computing,” in Proceedings of IEEE International
Symposium on Workload Characterization, pp. 44–54, 2009.
[48] M. Moudgill, K. Pingali, and S. Vassiliadis, “Register Renaming and Dynamic Speculation:
An Alternative Approach,” in Proceedings of the 26th Annual International Symposium on
Microarchitecture, pp. 202–213, 1993.
[49] T. Monreal, V . Vinals, A. Gonzalez, and M. Valero, “Hardware schemes for early register
release,” in Proceedings of International Conference on Parallel Processing, pp. 5–13,
2002.
[50] O. Ergin, D. Balkan, D. Ponomarev, and K. Ghose, “Increasing processor performance
through early register release,”
[51] J. F. Mart´ ınez, J. Renau, M. C. Huang, M. Prvulovic, and J. Torrellas, “Cherry: Check-
pointed Early Resource Recycling in Out-of-order Microprocessors,” in Proceedings of the
35th Annual ACM/IEEE International Symposium on Microarchitecture, pp. 3–14, 2002.
[52] H. Akkary, R. Rajwar, and S. T. Srinivasan, “Checkpoint processing and recovery:
Towards scalable large instruction window processors,” in Proceedings of the 36th Annual
IEEE/ACM International Symposium on Microarchitecture, pp. 423–, 2003.
[53] NVIDIA, “NVIDIA Maxwell Architecture.” https://developer.nvidia.com/
maxwell-compute-architecture.
[54] M. M. Martin, A. Roth, and C. N. Fischer, “Exploiting Dead Value Information,” in Pro-
ceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture,
pp. 125–135, 1997.
[55] T. M. Jones, M. F. P. O’Boyle, J. Abella, A. Gonzalez, and O. Ergin, “Compiler Directed
Early Register Release,” in Proceedings of the 14th International Conference on Parallel
Architectures and Compilation Techniques, pp. 110–122, 2005.
[56] Lo, Jack L. and Parekh, Sujay S. and Eggers, Susan J. and Levy, Henry M. and Tullisen,
Dean M., “Software-Directed Register Deallocation for Simultaneous Multithreaded Pro-
cessors,” IEEE Transactions on Parallel Distributed Systems, vol. 10, pp. 922–933, Sept.
1999.
[57] S. Jourdan, R. Ronen, M. Bekerman, B. Shomar, and A. Yoaz, “A novel renaming scheme
to exploit value temporal locality through physical register reuse and unification,” in Pro-
ceedings of the 31st Annual ACM/IEEE International Symposium on Microarchitecture,
pp. 216–225, 1998.
[58] Ergin, Oguz and Balkan, Deniz and Ghose, Kanad and Ponomarev, Dmitry, “Register Pack-
ing: Exploiting Narrow-Width Operands for Reducing Register File Pressure,” in Pro-
ceedings of the 37th Annual IEEE/ACM International Symposium on Microarchitecture,
pp. 304–315, 2004.
113
[59] Lozano, Luis A. and Gao, Guang R., “Exploiting Short-lived Variables in Superscalar Pro-
cessors,” in Proceedings of the 28th Annual International Symposium on Microarchitec-
ture, pp. 292–302, 1995.
[60] Tarjan, David and Skadron, Kevin, “On Demand Register Allocation and Deallocation for
a Multithreaded Processor,” in US Patent No. 20110161616, 2011.
[61] S. Z. Gilani, N. S. Kim, and M. J. Schulte, “Power-efficient computing for compute-
intensive gpgpu applications,” in Proceedings of IEEE International Symposium on High
Performance Computer Architecture, pp. 330–341, 2013.
[62] Jog, Adwait and Kayiran, Onur and Mishra, Asit K. and Kandemir, Mahmut T. and Mutlu,
Onur and Iyer, Ravishankar and Das, Chita R., “Orchestrated Scheduling and Prefetching
for GPGPUs,” in Proceeding of the 40th Annual International Symposium on Computer
Architecuture, pp. 332–343, 2013.
[63] Tanasic, Ivan and Gelado, Isaac and Cabezas, Javier and Ramirez, Alex and Navarro,
Nacho and Valero, Mateo, “Enabling Preemptive Multiprogramming on GPUs,” in Pro-
ceeding of the 41st Annual International Symposium on Computer Architecuture, pp. 193–
204, 2014.
[64] Park, Jason Jong Kyu and Park, Yongjun and Mahlke, Scott, “Chimera: Collaborative Pre-
emption for Multitasking on a Shared GPU,” in Proceedings of the Twentieth International
Conference on Architectural Support for Programming Languages and Operating Systems,
pp. 593–606, 2015.
[65] M. Dimitrov and M. Mantor and H. Zhou, “Understanding Software Approaches for
GPGPU Reliability,” in Proceedings of 2nd Workshop on General Purpose Processing on
Graphics Processing Units, pp. 94–104, March 2009.
[66] Jeon, Hyeran and Loh, Gabriel and Annavaram, Murali, “Efficient RAS support for die-
stacked DRAM,” in IEEE International Test Conference, pp. 1–10, 2014.
[67] Keun Soo Yim and Cuong Pham and Mushfiq Saleheen and Zbigniew Kalbarczyk and
Ravishankar K. Iyer, “HAUBERK: Lightweight Silent Data Corruption Error Detector for
GPGPU,” in Proceedings of 25th IEEE International Parallel & Distributed Processing
Symposium, pp. 287–300, May 2011.
[68] Victor W Lee and Changkyu Kim and Jatin Chhugani and Michael Deisher and Daehyun
Kim and Anthony D. Nguyen and Nadathur Satish and Mikhail Smelyanskiy and Srinivas
Chennupaty and Per Hammarlund and Ronak Singhal and Pradeep Dubey, “Debunking
the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU,”
in Proceedings of the 37th annual International Symposium on Computer Architecture,
pp. 451–460, Jun 2010.
[69] “ERCBench.”http://ercbench.ece.wisc.edu/.
114
[70] Lei Zhang and Yinhe Han and Qiang Xuz and Xiaowei Li, “Defect Tolerance in Homo-
geneous Manycore Processors Using Core-Level Redundancy with Unified Topology,” in
Proceedings of the Conference on Design, Automation and Test in Europe, pp. 891–896,
March 2008.
[71] Dweik, Waleed and Majeed, Mohammad Abdel and Annavaram, Murali, “Warped-Shield:
Tolerating Hard Faults in GPGPUs,” in Proceedings of the 2014 44th Annual IEEE/IFIP
International Conference on Dependable Systems and Networks, pp. 431–442, 2014.
[72] Majeed, Mohammad Abdel and Dweik, Waleed and Jeon, Hyeran and Annavaram, Murali,
“Warped-RE: Low-Cost Error Detection and Correction in GPUs,” in Proceedings of the
2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Net-
works, 2015.
[73] Sunpyo Hong and Hyesoon Kim, “An Integrated GPU Power and Performance Model,”
in Proceedings of the 37th annual International Symposium on Computer Architecture,
pp. 280–289, Jun 2010.
[74] M. Abdel-Majeed, D. Wong, and M. Annavaram, “Warped gates: Gating aware scheduling
and power gating for gpgpus,” in Proceedings of the 46th Annual IEEE/ACM International
Symposium on Microarchitecture, pp. 111–122, 2013.
[75] Q. Xu and M. Annavaram, “Pats: Pattern aware scheduling and power gating for gpgpus,”
in Proceedings of the 23rd International Conference on Parallel Architectures and Compi-
lation, pp. 225–236, 2014.
[76] Synopsis, “Design Compiler User Guide.” http://acms.ucsd.edu/info/
documents/dc/dcug.pdf, 2010.
[77] NVIDIA, “GeForce 400 Series.” http://en.wikipedia.org/wiki/GeForce_
400_Series.
[78] Ralph Nathan and Daniel J. Sorin, “Argus-G: A Low-Cost Error Detection Scheme for
GPGPUs,” in Workshop on Resilient Architectures, December 2010.
[79] Shuou Nomura and Matthew D. Sinclair and Chen-Han Ho and Venkatraman Govin-
daraju and Marc de Kruijf, “Sampling + DMR: Practical and Low-overhead Permanent
Fault Detection,” in Proceedings of the 38th annual International Symposium on Computer
Architecture, pp. 201–212, Jun 2011.
[80] Steven K. Reinhardt and Shubhendu S. Mukherjee, “Transient Fault Detection via Simul-
taneous Multithreading,” in Proceedings of the 27th annual International Symposium on
Computer Architecture, pp. 25–36, 2000.
[81] Timothy J. Slegel and Robert M. Averill III and Mark A. Check and Bruce C. Giamei and
Barry W. Krumm and Christopher A. Krygowski and Wen H. Li and John S. Liptay and
John D. MacDougall, “IBM’s S/390 G5 microprocessor design,” in IEEE MICRO, pp. 12–
23, March 1999.
115
[82] Sumeet Kumar and Aneesh Aggarwal, “Reducing Resource Redundancy for Concurrent
Error Detection Techniques in High Performance Microprocessors,” in Proceedings of the
12th International Symposium on High Performance Computer Architecture, pp. 212–221,
February 2006.
[83] Norio Ohkubo and Tatsuya Kawashimo and Makoto Suzuki and Yuji Suzuki and Jun
Kikuchi and Masahiro Tokoro and Ryo Yamagata and Eiki Kamada and Takeo Yamashita
and Teruhisa Shimizu and Tohu Hashimoto and Toshiko Isobe, “A fault-detecting 400 MHz
floating-point unit for a massively-parallel computer,” in International Solid-State Circuits
Conference, pp. 368–369, February 1999.
[84] Daniel Lipetz and Eric Schewarz, “Self Checking in Current Floating-Point Units,” in Pro-
ceedings of the IEEE 20th Symposium on Computer Arithmetic, pp. 73–76, July 2011.
[85] J. Wadden, A. Lyashevsky, S. Gurumurthi, V . Sridharan, and K. Skadron, “Real-world
design and evaluation of compiler-managed gpu redundant multithreading,” in Proceedings
of the 41st Annual International Symposium on Computer Architecuture, pp. 73–84, 2014.
[86] D. Tiwari, S. Gupta, J. Rogers, D. Maxwell, P. Rech, S. Vazhkudai, D. Oliveira, D. Londo,
N. DeBardeleben, P. Navaux, L. Carro, and A. Bland, “Understanding gpu errors on large-
scale hpc systems and the implications for system design and operation,” in Proceedings of
the 21st International Symposium on High Performance Computer Architecture, pp. 331–
342, 2015.
116
Abstract (if available)
Abstract
The continuing march of Moore’s law, in spite of many prior dire predictions, enables chip designs with tens of billions of transistors today. But as Dennard’s scaling slows irrefutably, power consumption has become the first order design constraint. Furthermore, with device scaling, reliability has also come to the forefront of design considerations. To avoid excessive power consumption, chip industry has shifted away from high performance single threaded designs to high throughput multi‐threaded designs. Nowhere is this design trend so starkly visible than in a Graphics Processing Unit (GPU) design. GPUs are provisioned with hundreds of execution units and mega bytes of register file to run thousands of threads concurrently. Their high throughput and excellent performance per watt has attracted efforts to port general purpose applications to run on GPUs. Hence, a new computing paradigm called general purpose computing on GPUs (GPGPU computing) has emerged. When GPUs execute general purpose code with irregular parallelism, the massive on‐chip resources available for concurrent thread execution become underutilized. This dissertation presents two mechanisms that exploit the resource underutilization for improving power efficiency and reliability. ❧ The first mechanism proposes register file virtualization. This approach is motivated by the observation that at any given instance during an application execution, only a fraction of the total allocated registers carry live data. By eagerly deallocating registers with dead data, these registers can then be reassigned to new threads. Our scheme takes advantage of register liveness information to allow a flexible mapping between architected registers and their corresponding physical register allocation. Register virtualization tackles the inefficiency of existing GPU register management method that is the root cause of power and imbalanced wearleveling problems. By exploring different mapping algorithms, register virtualization can improve power efficiency or improve GPU reliability. Our results show that the register virtualization effectively reduces the register demand and imbalanced wearleveling problem. ❧ Inspired by the reduced demand on register file when using register virtualization, we also proposed a more aggressive mechanism, GPU‐Shrink, that under-provisions the register file by as much as 50% of the current GPU register file size. GPU‐Shrink guarantees deadlock‐free application execution with a slightly modified warp scheduler. The new warp scheduler reserves minimum number of available registers to guarantee the progress of at least one thread block within an application. Our results show that GPU‐Shrink effectively reduces register file’s dynamic and static power with negligible performance overhead. ❧ The second mechanism exploits execution unit underutilization to improve GPU reliability. Due to branch and memory divergence, several execution lanes in a GPU are left idle. We proposed Warped‐DMR to reuse the idle cores to verify the execution on active lanes. Dual modular redundancy (DMR) has been long used for execution verification in CPUs. However, unlike traditional DMR that adds a dedicated checker core for each core to be verified, Warped‐DMR repurposes idle execution lanes for opportunistic execution verification. Hence, Warped‐DMR needs zero extra execution lanes. Our results show that the Warped‐DMR can verify almost all the instructions’ execution without significant performance and power overhead.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Demand based techniques to improve the energy efficiency of the execution units and the register file in general purpose graphics processing units
PDF
Low cost fault handling mechanisms for multicore and many-core systems
PDF
Hardware techniques for efficient communication in transactional systems
PDF
Enabling energy efficient and secure execution of concurrent kernels on graphics processing units
PDF
Efficient memory coherence and consistency support for enabling data sharing in GPUs
PDF
Improving reliability, power and performance in hardware transactional memory
PDF
Architectural innovations for mitigating data movement cost on graphics processing units and storage systems
PDF
Energy proportional computing for multi-core and many-core servers
PDF
SLA-based, energy-efficient resource management in cloud computing systems
PDF
Defect-tolerance framework for general purpose processors
PDF
Improving the efficiency of conflict detection and contention management in hardware transactional memory systems
PDF
Towards a cross-layer framework for wearout monitoring and mitigation
PDF
Low power and reliability assessment techniques for advanced processor design
PDF
Optimizing power delivery networks in VLSI platforms
PDF
Variation-aware circuit and chip level power optimization in digital VLSI systems
PDF
Distribution system reliability analysis for smart grid applications
PDF
Thermal analysis and multiobjective optimization for three dimensional integrated circuits
PDF
Designing efficient algorithms and developing suitable software tools to support logic synthesis of superconducting single flux quantum circuits
PDF
Cache analysis and techniques for optimizing data movement across the cache hierarchy
PDF
Advanced cell design and reconfigurable circuits for single flux quantum technology
Asset Metadata
Creator
Jeon, Hyeran
(author)
Core Title
Resource underutilization exploitation for power efficient and reliable throughput processor
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
07/23/2015
Defense Date
06/11/2015
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computer architecture,execution unit,many‐core processor,OAI-PMH Harvest,power efficient computing,register file,reliable computing
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Annavaram, Murali (
committee chair
), Gupta, Sandeep K. (
committee member
), Halfond, William (
committee member
), Loh, Gabriel H. (
committee member
)
Creator Email
hyeran.jeon@gmail.com,hyeranje@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-604488
Unique identifier
UC11300366
Identifier
etd-JeonHyeran-3677.pdf (filename),usctheses-c3-604488 (legacy record id)
Legacy Identifier
etd-JeonHyeran-3677.pdf
Dmrecord
604488
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Jeon, Hyeran
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
computer architecture
execution unit
many‐core processor
power efficient computing
register file
reliable computing