Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Cache analysis and techniques for optimizing data movement across the cache hierarchy
(USC Thesis Other)
Cache analysis and techniques for optimizing data movement across the cache hierarchy
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Cache Analysis and Techniques for
Optimizing Data Movement Across the
Cache Hierarchy for HPC Workloads
by
Aditya Madhusudan Deshpande
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
May 2019
Copyright 2019 Aditya Madhusudan Deshpande
Dedication
To my family...
2
Acknowledgments
It is with pleasure that I thank all those who made this dissertation possible through
support, guidance, and encouragements.
First and foremost, I would like to express my deepest gratitude to my advisor Dr.
Jerey Draper for his time, mentorship, guidance and encouragement throughout my
graduate studies. He supported and mentored me through every obstacle I faced. He
gave me the freedom to pursue a wide range of research interests. I will always be
grateful for having worked with him.
I am also grateful to my dissertation and qualifying committees, including Dr.
Sandeep Gupta, Dr. Robert Lucas, Dr. Alice Parker, Dr. Xuehai Qian for providing
time, guidance and constructive feedback.
I want to extend my thanks to Simon Hammond, Arun Rodrigues, Scott Hem-
mert for their mentorship during my internship at Sandia National Laboratories. I
thank Jim Ang for providing me an opportunity to intern at the Computer Science
Research Institute, Sandia National Laboratories and supporting my research at USC.
3
I want to thank all the Ph.D. students at our MARINA group. My special thanks
to Je Sondeen, T.J. Kwon and Woojin Choi for their help with the tools. I cherished
many discussions with my oce mates Lihang Zhao and Praveen Sharma on a wide
range of research topics. Conversations with them helped iron out many ideas. I
would also like to extend my thanks to the colleagues and sta at Information Sci-
ences Institute, graduate students in the Computer Engineering group, teachers at
USC who have contributed to my success in professional and academic endeavors.
I would also like to thank Melissa S. Smith, Diane Demetras and other academic
advisors in the Electrical Engineering department at USC for their help throughout
this journey.
I would also like to thank all the friends for their continuous encouragement and
patience. I would especially thank Vivek Bhandwalkar, Prasanjeet Das, Santosh
Mundhe, and Tejas Bansod for their companionship and being a phone call away.
I would also like to thank all the family members who continuously encouraged
and motivated me. I would especially thank my uncle Govind and aunt Purnima
Deshpande for providing a home away from home and support throughout my gradu-
ate studies. I thank my uncle Vivek Tatke for mentoring me throughout this journey.
Finally and importantly, I would like to thank my parents, Dr. Madhusudan Desh-
pande, Dr. Anjali Deshpande and my sister Aditi Deshpande for their unconditional
love, support, and encouragement through this journey. My journey tested their
patience.
4
Abstract
High-Performance Computing is entrenched in our lives. It is used to model complex
physical processes in the eld of sciences, engineering, medicine and requires ex-
tremely large computing platforms. To continue new research over the next decade,
U.S Department of Energy plans to build exascale systems which requires at least 10x
performance improvement while maintaining a
at energy prole. Building exascale
systems necessitates that we design and develop high performance and energy-ecient
hardware blocks. Traditional computing models rely on a bring data to core paradigm,
which means for any operation, data must be fetched from the memory through the
cache hierarchy into the processor's register-le, operated upon and then written back
to the memory through the cache hierarchy. With increasing cache sizes, deeper cache
hierarchies and increasing application problem sizes leads to signicant increases in
the movement of data through the cache memory hierarchy. This increased data
movement often causes performance bottlenecks and leads to poor performance and
utilization of computing resources. With the criticality of data movement across the
memory hierarchy, my research focuses on optimizing the cache hierarchy through
the perspective of minimizing excess and unnecessary data movement.
5
In this work, I focused on the trifecta of approaches to optimize data movement
across the cache hierarchy. First, as application behavior plays an important part in
dictating the
ow of data through the cache hierarchy, we characterize applications.
Specically, we characterized PathFinder, a proxy application for graph-analytics sig-
nature search type algorithms critical to the DOE HPC community. Our results show
an inecient utilization of L2 cache (50% cache hit rate) results in increased data
movement. Then we characterized applications on various locality metrics to quan-
tify dierences between several classes of applications. Graph kernels show 20% less
spatial and temporal locality compared to other HPC applications and lower (50%)
data intensiveness and 90% data turnover rate. Second, we developed a cache energy
model to measure energy in caches when running applications on real-world systems.
Using our model, we observed leakage energy is the primary energy dissipation mech-
anism in L1 caches and accounts for up to 80% of cache energy. We quantied the
implications of using heterogeneous cacheline size across dierent levels of the cache
hierarchy and cache set-associativity on the data movement. Using dierent cacheline
sizes across dierent levels of the cache hierarchy, we observed over 13% data move-
ment savings and over 30% improvement in L2/L3 cache hit rates and using 2-way
L1 cache results in over 18% savings over an 8-way L1 cache. Third, we present a
cache guidance system framework that allows us to expose cache architecture arti-
facts to users to explicitly dictate the behavior of an individual cache. Using one
guidance generation scheme, we demonstrated over 30% reduction in cache misses in
application phases.
6
Contents
Contents 7
List of Figures 10
List of Tables 13
1 Introduction 14
1.1 Problem and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.1 Application Characterization . . . . . . . . . . . . . . . . . . . 17
1.2.2 Architecture Characterization . . . . . . . . . . . . . . . . . . 19
1.2.3 Cache Guidance System . . . . . . . . . . . . . . . . . . . . . 22
1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2 Background 26
3 PathFinder Characterization Study 42
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 PathFinder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.1 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4 Performance Modeling, Results . . . . . . . . . . . . . . . . . . . . . 48
3.4.1 Experiment Environment . . . . . . . . . . . . . . . . . . . . . 48
3.4.2 Performance Characterization . . . . . . . . . . . . . . . . . . 50
3.4.3 Cache Characteristics . . . . . . . . . . . . . . . . . . . . . . . 57
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4 Memory Access Pattern Modeling 63
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 Background and Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.5 Application Address-Space Modeling . . . . . . . . . . . . . . . . . . 75
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5 Cache Energy Study 87
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2 Energy Estimation Techniques . . . . . . . . . . . . . . . . . . . . . . 90
5.3 Energy Estimation Model . . . . . . . . . . . . . . . . . . . . . . . . 94
5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
8
6 Cache Architecture Parameter Characterization 107
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.3 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.4.1 Cacheline size Sensitivity Study . . . . . . . . . . . . . . . . . 117
6.4.2 Cache Associativity Sensitivity Study . . . . . . . . . . . . . . 127
6.4.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7 Cache Guidance System 136
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.3 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.4 Modeling, Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.4.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . 146
7.4.2 Guidance Generation . . . . . . . . . . . . . . . . . . . . . . . 147
7.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
8 Conclusion 155
8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.2 Looking Forward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Bibliography 159
9
List of Figures
2.1 Steps involved in a memory operation . . . . . . . . . . . . . . . . . . 36
3.1 PathFinder graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.2 Data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3 Execution time for PathFinder . . . . . . . . . . . . . . . . . . . . . . 51
3.4 Problem size scaling for PathFinder . . . . . . . . . . . . . . . . . . . 52
3.5 Density scaling for PathFinder . . . . . . . . . . . . . . . . . . . . . . 53
3.6 Unique-label scaling for PathFinder . . . . . . . . . . . . . . . . . . . 55
3.7 Strong scaling for PathFinder . . . . . . . . . . . . . . . . . . . . . . 56
3.8 L1 cache Hit Rate for PathFinder . . . . . . . . . . . . . . . . . . . . 57
3.9 L2 cache Hit Rate for PathFinder . . . . . . . . . . . . . . . . . . . . 59
3.10 Combined private cache Hit Ratio for PathFinder . . . . . . . . . . . 60
4.1 Spatial Locality score for dierent HPC workloads . . . . . . . . . . . 71
4.2 Temporal Locality score for dierent HPC workloads . . . . . . . . . 72
4.3 Data Intensiveness Locality score for dierent HPC workloads . . . . 73
4.4 Data turn-over score for dierent HPC workloads . . . . . . . . . . . 74
4.5 References and Evictions for HPCCG miniapp . . . . . . . . . . . . . 79
10
4.6 Address Count Histogram for HPCCG miniapp . . . . . . . . . . . . 80
4.7 Mantevo miniapps Address Counts Histogram . . . . . . . . . . . . . 81
4.8 GraphBIG Benchmark Kernel Address Counts Histogram . . . . . . . 82
5.1 Cache Energy Estimation Framework . . . . . . . . . . . . . . . . . . 94
5.2 PDGEMM L1 Data Cache Energy Distribution . . . . . . . . . . . . 97
5.3 PDGEMM L1 data cache total energy vs application runtime . . . . . 97
5.4 Dynamic Energy across various compiler optimizations for NAS Par-
allel Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.5 Total L1 data cache energy across various compiler optimizations for
NAS Parallel Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . 99
5.6 L1 data cache energy distribution for IS benchmark with -O3 optimiza-
tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.7 L1 data cache energy distribution for LU benchmark with -O3 opti-
mization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.8 Strong Scaling for HPCCG . . . . . . . . . . . . . . . . . . . . . . . . 103
5.9 Weak Scaling for HPCCG . . . . . . . . . . . . . . . . . . . . . . . . 104
6.1 Cache Hit Ratios for various cache hierarchy congurations . . . . . . 119
6.2 Main Memory Requests (a) Summary (b) All congurations . . . . . 121
6.3 Total Main Memory Trac (a) Summary (b) All congurations . . . 122
6.4 Total Cache Data movement (a) Summary (b) All congurations . . . 125
6.5 Cache Hit Ratios for dierent L1 cache set-associativity congurations 127
6.6 Normalized L1 cache miss . . . . . . . . . . . . . . . . . . . . . . . . 128
7.1 Cache Design with guidance data-structures . . . . . . . . . . . . . . 143
11
7.2 Guidance Aware LRU Replacement Policy . . . . . . . . . . . . . . . 144
7.3 Percentage of total references for top5% most active address in a phase
for (a) miniapps (b) NPB benchmarks (c) GraphBIG kernels . . . . . 149
7.4 Guidance generation algorithms . . . . . . . . . . . . . . . . . . . . . 150
7.5 Total data movement change . . . . . . . . . . . . . . . . . . . . . . . 152
7.6 Cache miss improvements in an application phase . . . . . . . . . . . 153
12
List of Tables
3.1 Memory footprint (MB) for PathFinder . . . . . . . . . . . . . . . . . 61
4.1 Mantevo miniapps command-line arguments . . . . . . . . . . . . . . 76
4.2 Cache Performance for various benchmarks . . . . . . . . . . . . . . . 77
4.3 Average References per Evictions (ACRE) for various benchmarks . . 84
6.1 Cacheline size (B) for dierent cache congurations . . . . . . . . . . 118
6.2 Energy Costs for dierent cache models . . . . . . . . . . . . . . . . . 129
6.3 Total Energy costs for dierent models . . . . . . . . . . . . . . . . . 131
7.1 Weights for Guidance generation . . . . . . . . . . . . . . . . . . . . 151
13
Chapter 1
Introduction
High-Performance Computing (HPC) is our gateway to endeavors in the eld of sci-
ences, engineering, medicine, and others. Modeling complex processes in these elds
requires extremely large computing platforms. To continue facilitation of new re-
search over the next decade, U.S. Department of Science and Department of Energy
plan to build exascale systems, which requires at least 10x performance improvement,
while maintaining a
at energy prole as compared to current state-of-the-art sys-
tems. This necessitates that we not only focus on improving performance but on
improving energy eciency simultaneously across all the system components (CPU,
hardware accelerators and memory systems). With nearly
at gains from semicon-
ductor scaling in the last few years, architects and system designers cannot just rely
on transistor scaling for improving energy eciency. With multiprocessor systems,
the memory wall [HP03, BC11] problems have exacerbated. Developing these next
generation machines under such circumstances provides both challenges and oppor-
tunities for computer architects. With ever increasing core counts in a multiprocessor
14
chip, large amounts of data must be brought to the cores for processing from the
memory, and thus the movement of data through the memory hierarchy becomes
critical to system performance and must be optimized.
Traditional computing models rely on a bring data to core paradigm, which means
for any operation, data must be fetched from the memory through the cache hierarchy
into the processor's register-le, operated upon and then written back to the memory
through the cache hierarchy. With increasing cache sizes, deeper cache hierarchies
and increasing application problem sizes leads to signicant increases in the move-
ment of data through the cache memory hierarchy. This increased data movement
often causes performance bottlenecks and leads to poor performance and utilization of
computing resources. With the criticality of data movement across the memory hier-
archy, my research focuses on optimizing the cache hierarchy through the perspective
of minimizing excess and unnecessary data movement.
1.1 Problem and Motivation
The performance of the memory hierarchy has a strong correlation to the overall sys-
tem performance and identied as one of the top challenges in the pursuit of exascale
computing [P. 08, R. 14]. The memory hierarchy consists of multiple levels of on-chip
(private and shared) caches, o-chip memory and various levels of secondary and ter-
tiary storage. Advances in semiconductor technologies led to multiprocessor systems
and larger on-chip caches. With larger caches, every access becomes more expensive
in terms of energy and latency and requires data to travel longer physical distances.
15
As wires do not scale as well as transistors, increased distances traveled translates to
increased data movement costs (energy and latency). With more cores being incorpo-
rated on a chip, the complexity of managing data coherence and consistency increases
and requires large numbers of control messages to be sent over on-chip interconnec-
tion networks for book-keeping purposes; a source of additional data movement.
A major source of the ineciency in data movement in current HPC systems
is due to poor matching between application and cache characteristics. Typically,
general-purpose processors are used for computing in the HPC community. Caches
in these processors are architected to support a wide range of application proles and
have a xed conguration, i.e., cache organization and operation cannot be modi-
ed by users. If an application's caching prole cannot be mapped to the available
cache architecture, it results in cachelines being moved between various levels of the
cache hierarchy ineciently; often causing unnecessary and excess data movement
and in-turn consuming more energy. While hero programmers in the HPC commu-
nity try to optimize their application algorithms by utilizing various compiler and
system-software optimization knobs to squeeze all the performance they can from the
memory system, they are still limited by the rigid nature of cache and lack of hardware
optimization knobs for performance tuning. Also, caches are designed to take advan-
tage of spatial and temporal locality in the memory access patterns; if applications
have poor spatial and temporal locality, traditional caching approaches inherently
cause unnecessary data movement. Given that energy consumption is directly pro-
portional to the data movement in the cache/memory hierarchy, we hypothesize that
by focusing on optimizing and minimizing unnecessary data movement that occurs in
16
the cache/memory hierarchy, we can increase performance and eciency of memory
subsystems and HPC systems overall.
1.2 Thesis Contributions
In this thesis, we focus on understanding and quantifying the implications of data
movement in the memory hierarchy and exploring techniques that minimize unnec-
essary and excess data movement across the memory hierarchy to improve cache
utilization and energy eciency. We make the following contributions {
1.2.1 Application Characterization
We were among the rst to characterize a class of graph-analytics applications crucial
to the DOE HPC community to understand data movement issues. Then we studied
application memory reference patterns and measured various locality metrics.
Graph Application Study
Graph Processing is widely used in data analytics applications and is gaining attention
in Computational Scientic and Engineering (CSE) application community. These
applications exhibit runtime proles distinct from the other CSE application. Binary
(executable) signature search represents an important class of graph applications.
PathFinder [man, BCH
+
15], a proxy for real-world signature search applications rep-
resents various properties of signature search and allows for evaluating performance
characteristics of algorithms used. PathFinder searches directed graphs to nd char-
acteristic signatures, which are specic sequences of labels associated with nodes in
17
the graph. The graph traversal occurs for nding the next label in a signature in a
modied depth-rst recursive search. We characterized PathFinder to understand its
execution and cache characteristics.
We observed L1 and L3 cache hit ratios exceed 90% whereas the L2 cache hit
ratio was substantially lower ( 50%), which is a unique cache characteristic of the
PathFinder application. Pointer-chasing that occurs in the search for label matches
in the graph leads to poor L2 cache behavior.
Memory Access Pattern Modeling
Applications in the HPC domain have a wide range of behavior and stress memory
systems in dierent ways. To better understand and quantify how dierent applica-
tions stress memory systems, we analyzed memory reference patterns. In the rst part
of this work, we focus on characterizing and quantifying application characteristics
in an architecture-neutral way to understand application behavior and implications
for the memory hierarchy design. We analyzed applications on four dierent locality
metrics, namely, spatial locality (measure of reuse of data near the recently referenced
data), temporal locality (measure of reuse over time), data intensiveness (measure of
unique data application uses) and data turnover (measure of the data changed be-
tween two phases of execution). We also analyzed how each of these metrics changes
during the application execution. In the second part, we model how the application
address-space is referenced and eviction activity from various levels of caches during
application execution.
18
Our results show graph applications have20% lower spatial and temporal locality
compared to NPB [B
+
91] benchmarks and miniapps applications [BCD
+
12, man].
Also, graph applications exhibited a data-intensiveness score of 0.5 indicating that
nearly half of the data fetched through cachelines was unused. They also exhibit
larger data turnover as compared to NPB and miniapps benchmarks and applications.
From the modeling of the application address-space usage, address-space hot spots
are readily identiable and known techniques for mitigation can then be applied. This
modeling enables identication of address-space ranges for dierent priorities. These
metrics aggregately indicate critical architecture requirements for dierent classes of
applications.
1.2.2 Architecture Characterization
In this part of the dissertation research, we developed a cache energy model that
estimates energy used in various levels of caches when running on real-world HPC
systems. Then we study how dierent cache architecture parameters aect data
movement in the cache hierarchy for a wide range of HPC workloads.
Cache Energy Study
Improving energy eciency of an HPC system requires a thorough understanding of
cache energy. Our cache energy estimation model uses instrumentation to capture
performance counters available in modern processors for individual cache accesses dur-
ing application execution and combines them with cache access energy information
generated using cache architecture modeling tools to model total cache energy. Our
modeling combines performance monitoring tools and computer architecture mod-
19
eling to estimate energy in caches when running on large-scale real-world machines
and does not require any special instrumentation and is easily ported across dierent
computing platforms.
Using this model, we estimate L1 cache energy for various real-world compute
kernels and applications running on large-scale systems. We quantied the leakage
energy contribution of the total cache energy for various parallel applications running
in their native execution environment and study the impact of compiler optimizations
on the total cache energy. Based on our experiments, we observed that the leakage
energy component of total cache energy varies between 40% - 80% for L1 data caches
and between 60% - 85% for L1 instruction caches across a broad range of applications.
Moreover, L2 and L3 caches are accessed less frequently as compared to L1 cache;
indicating leakage energy component of total cache energy would be even higher for
L2/L3 caches. Data sitting idle in the cache and not being utilized is detrimental
to energy and performance and increases the leakage energy component costs. This
result further underscores the need for malleable caches that can be tuned to better
match individual applications.
Cache Architecture Parameter Characterization
Application characterization is an essential tool in understanding how applications
interact with the underlying computing system. We are especially interested in the
implications of cache parameters on application performance. Each cache congura-
tion parameter not only impacts the behavior of that cache but also aects the overall
cache hierarchy behavior during application execution. Cacheline or cache-block size
20
is crucial to the eciency with which all interactions occur inside the cache hierarchy.
It is the fundamental unit of transfer between caches as well as main memory, and
data in the caches are stored at this granularity. It also dictates cache tag size as
well as the organization of the cache, layout and the interconnection network (NoC).
Traditionally the cacheline size is xed rst, and then caches are architected around
it. Given this fundamental cache architecture parameter and its associated impact
on the entire memory hierarchy, we analyze eects of using dierent cacheline sizes
for individual caches across the cache hierarchy for various HPC applications from
the data movement standpoint. Another important cache architecture parameter is
associativity. Associativity dictates the degree of freedom for placement of cacheline-
sized blocks in each cache-set. It aects the number of comparators needed in the
cache. It also impacts cache replacement policies. By changing associativity of a
cache, one aects the reuse distribution and cache performance. A cache with more
ways reduces cache contention among blocks in a cache-set and allows data to remain
in cache longer potentially allowing for reuse, thereby minimizing data movement.
On the other hand, such a cache consumes more energy per access as it requires
more tag comparisons for every cache access. Similarly, a cache with fewer ways
has more cache-sets while increasing contention among blocks in any given cache
set. Such a cache consumes less energy per access as it requires a smaller number
of tag-comparisons for every cache access. Less-way set associative caches trades o
contention in each cache set with more sets. We study how the change in cache set-
associativity aects the performance of the cache as well as its impact on the overall
data movement across the cache hierarchy.
21
Using a 2-way set-associative L1 cache, cache hit rate decreased slightly, did not
aect L2 misses or overall system data-movement and showed signicant (20-30%)
decrease in the cache energy as compared to an 8-way L1 cache. In many application
phases, using 2-way or 8-way set associative cache shows no dierence in the cache hit
rate or number of misses, and in such application phases, it is benecial to use 2-way
cache for energy savings. Similarly, using heterogeneous cacheline sizes for individual
caches across dierent levels of cache hierarchy resulted in an increased cache hit ratio,
decreased main memory requests and overall system data movement as compared to
a xed cacheline-size cache hierarchy. Our results suggest that by varying cacheline-
size and set-associativity of a cache, one can optimize cache performance and data
movement. Dierent combinations of cacheline-size and set-associativity produce
optimum results for dierent classes of applications. At a minimum, future caches
should provide options to specify cacheline-size and associativity parameters of a cache
before application execution to optimize cache performance and data movement and
improve energy eciency.
1.2.3 Cache Guidance System
Much of the ineciency in the data movement in current HPC systems stems from
poor matching between applications and cache characteristics. Caches in general-
purpose processors have a xed conguration, i.e., cache conguration and operations
cannot be modied by users. If an application's caching prole cannot be eectively
mapped to the available cache architecture, it results in cachelines being moved be-
tween various levels of the cache hierarchy ineciently; often causing unnecessary
data movement and thereby consuming more energy. While dynamically adaptive
22
caches can help, they are not implemented in general-purpose processors because of
their complex design. Thus, there exist no mechanisms for systems developers to
pass their program analysis insights down to the hardware level of individual caches.
In this part of the dissertation research, we propose a cache guidance system, which
utilizes a small-user controlled data-structure that allows users to specify application
guidance to individual caches. The cache uses that guidance to improve its caching
decisions by biasing its replacement policy. This mechanism adds application-specic
adaptability and recongurability to an otherwise rigid cache conguration without
the need to implement complex dynamically adaptive caches.
Using one guidance generation scheme, we achieved up to 2.5% reduction in the
total cache hierarchy data movement and over 30% reduction in the L1 cache misses in
application phases for a wide range of HPC workloads. Our results demonstrate that
by using eective guidance generation, unnecessary data movement can be reduced.
1.3 Summary
To summarize, I focused on a trifecta of approaches to optimize data movement across
the cache hierarchy. First, as application behavior plays an important part in dictat-
ing the
ow of data through the cache hierarchy, we characterize application execution
to understand cache utilization and data movement. Specically, we characterized
PathFinder, a proxy application for graph-analytics signature search type algorithms
critical to the DOE HPC community. Then we characterized applications on var-
ious locality metrics to quantify dierences between various classes of applications.
23
Then we analyzed how the application address-space is utilized during execution to
identify address-space hotspots and to identify address ranges that require dierent
priorities. Second, we developed a cache energy model to measure energy in caches
when running applications on real-world systems. Cache parameters aect individual
cache performance as well as overall cache hierarchy performance and in turn data
movement. Therefore, we characterize implications of using heterogeneous cacheline
size cache hierarchy and cache set-associativity on the data movement to motivate
and quantify the need for better support for recongurable cache hierarchies in fu-
ture systems. Third, we present a cache guidance system framework that allows us
to expose cache architecture artifacts to users to explicitly dictate the behavior of
an individual cache. The framework adds application-specic adaptability and re-
congurability to a xed cache conguration without the need to implement complex
dynamically adaptive caches. We demonstrate the use of this framework to minimize
cache data movement.
1.4 Organization
The work is organized as follows. Chapter 2 provides a brief overview of caches,
discusses some of the techniques to minimize cache energy and presents one model
to capture data movement costs in the memory hierarchy. In chapter 3, I present
performance and cache characterization of PathFinder, a proxy for signature search
graph analytics applications. In chapter 4, I present results from the locality modeling
to understand application behavior and application address-space use modeling. In
chapter 5, I present a cache energy model that estimates individual cache energy
24
when running on real-world HPC systems. In chapter 6, I present results from the
architecture parameter characterization study to analyze the impact of the use of
heterogeneous cacheline-size cache hierarchy and sensitivity of cache set-associativity
on the cache performance and data movement. In chapter 7, cache guidance system
is detailed. I conclude by discussing the next steps to optimize memory hierarchies
for future systems in chapter 8.
25
Chapter 2
Background
A CPU cache is a cache of the main memory [Smi82]. It is a small, fast buer in
which a system tries to keep contents of a larger, slower memory that will be ac-
cessed soon [Hil87]. Caches operate on the principle of locality, a widely held rule
of thumb that a program spends 90% of its execution time in only 10% of the code.
Applications tend to reuse data and instruction code that was recently accessed.
There exist two types of locality in the caches, temporal locality (locality in time)
that suggests recently accessed locations are likely to be accessed in near time, and
spatial locality (locality in space), that suggests memory locations near the recently
accessed locations are likely to be accessed soon. An ideal scenario would be to have
an innite amount of cache. As this is impractical, the goal is to use small, fast
caches to speed up access to the memory to improve CPU performance. With an
increasing gap between CPU and memory speeds, a hierarchy of caches is used to
augment performance. The cache closer to processor tend to be small and operate
near CPU speeds while those closer to memory have larger capacities and requires
26
many clock cycles for access. A cache is dened by a set of parameters - capacity,
block-size, associativity, replacement-policy (RP), write-policy [HP03]. Capacity in-
dicates the size of the cache. Block or cacheline size indicates the width of an entry
in the cache; it is the smallest amount of memory transfer between the various levels
of caches and memory. Associativity dictates the degree of freedom for the cache
block residing in a cache set. If a block can reside at any location in the cache, such
cache is called fully-associative cache. Alternately, if a block can reside in only one
location, it is a direct-mapped cache [Hil88]. If a cache block can reside in a restricted
set of locations, such caches are known as the set-associative cache. For example, if
a block can reside in one ofn-locations in a cache-set, it isn-way set-associative cache.
Replacement policies determine which block in the cache be replaced when a miss
occurs. Commonly used strategies to decide which block to replace are random,
least-recently-used (LRU) and First-In-First-Out [SG85]. In Random RP, a candi-
date block is randomly selected. As implementing random selection logic is expensive
in hardware, typically, a pseudo-random function is used. LRU schemes are widely
used as they reduce the chances of replacing currently in-use information. FIFO based
policies are similar to LRU except they calculate replacement candidate by selecting
the oldest block in the cache. FIFO schemes are easier to implement in the hard-
ware compared to the use-based LRU schemes. Write-policies determines how the
writes in the cache are propagated to the main memory. Two widely used policies are
write-through, wherein, information is written to both cache-block and the memory,
and write-back, wherein, information is written to cache-block only, and only upon its
replacement, memory gets updated. Most modern processors use write-back policy as
27
it also helps in minimizing on-chip communication and data-movement in the cache
hierarchy.
Cache misses occurs when the data is not found in the cache. In case of a cache
miss, the data needs to be fetched either from other caches in the hierarchy or from
the main memory. Cache misses can be sorted into four categorizes - compulsory,
also known as cold misses are misses that occur on an innite cache. These cache
misses are associated with fetching data from main memory for their rst use. Ca-
pacity misses occurs when a block is replaced and then retrieved (fetched) for use
again. Capacity misses are cache misses that occur in addition to compulsory misses
in a fully associative cache. Con
ict or collision misses occur due to block placement
strategy. These are cache misses that occur in addition to the compulsory and capac-
ity misses in a set-associative or direct-mapped caches. Coherence misses occur due
to invalidations when using an invalidation based coherence protocol. Resolution of
cache misses has a strong correlation to processor performance and its energy con-
sumption. Majority of improvements in the cache designs focus on either reducing
the number of cache misses or improving its resolution.
With cache behavior having a signicant impact on the system performance, there
have been many advancements to improve the performance of caches over the years.
Based on Hennessey and Patterson's [HP03] approach of categorizing various tech-
niques for improving the performance of caches, we divide these techniques into four
categories namely - reducing miss penalty, reducing miss rate, reducing miss penalty
or miss rate via parallelism and reducing the time to hit in the cache. We discuss
28
some of these advances to cache architectures as follows {
Reducing Miss Penalty
Computer architects are always in the dilemma of deciding whether to increase
cache speed to keep pace with CPU speeds or make caches larger to minimize
the widening gap between processor and main memory speed. The multi-level
caches try to balance both problems, the rst-level cache can be small, keeping
pace with processor speeds, whereas second and higher levels provide the ad-
ditional capacities to minimize the number of accesses going to main memory,
eectively reducing the miss penalty. Multilevel caches started with two-level
caches in the hierarchy [DKM
+
12]; now most of the modern processors use a
three-level cache hierarchy, and future architectures are expected to incorporate
four-level cache hierarchy [HMB
+
14, FFD
+
14]. Multilevel caches require extra
hardware to reduce miss penalty; the critical-word-rst and early-restart tech-
niques do not require any additional logic. When the cache miss is returned from
higher level caches or main memory, the data from the requested address loca-
tion is returned rst before remaining words of the cache block are transferred,
potentially allowing an early restart of the CPU execution, thereby reducing
miss penalty. Generally, critical-word-rst and early-restart techniques benet
caches with large cache blocks or bandwidth-limited caches. Cache requests are
either READ or WRITE. With the use of write-buers in the cache, WRITE
requests are generally NOT on the processor's critical path. Therefore, by pri-
oritizing READ requests on whose resolution CPU execution depends, one can
reduce the associated cache-miss penalty. Both write-through and write-back
caches use a small cache called write-buer to store updates before they are
29
written back to the main memory. CPU writes cache WRITES to this buer
and then continue with its operation, write-buer processes these WRITE op-
erations and writes the word to the memory. If the write-buer is full, CPU
has to stall until a slot is freed in the buer. During writing to the buer, if
the incoming address of the new data word is checked with the block-address
of the entries in the write buer, then multiple write entries can be combined
into a single entry of the write-buer, this technique of merging writes of words
in a single cache block is called write-merging [Smi82]. This improves write-
buer utilization, reduces the total number of writes going to memory leading
to reduced overall cache trac. Victim caches [Jou90] also help in reducing the
miss penalty by using a small buer to store the recently replaced or discarded
cache-blocks in the case they are needed again. A small cache of up to ve
entries has been shown eective at reducing misses in direct-mapped caches
[Jou90].
Reducing Miss Rate
The simplest way to reduce the miss rate is to increase the cache block size.
Though increasing block size decreases compulsory misses on account of im-
proved spatial locality, they increase miss penalty (longer transfer time) and
may increase con
ict and capacity misses. This increase in miss penalty may
outweigh the decrease in the miss rate [Smi87]. Another popular way to re-
duce capacity misses is to increase cache size; increasing cache size leads to
increased cache hit-time and energy-per-access. Increasing set-associativity of
caches is another technique to reduce miss rate, increasing set-associativity of
a cache increases the number of locations where a cache block can reside re-
30
ducing con
ict misses [HS89] and also leading to the better utilization of cache
capacities. As a rule of thumb, an eight-way set-associative cache is as eective
as a fully-associative cache at a lower energy cost. Another rule of thumb, for
cache sizes less than 128KB, a direct-mapped cache of size N has nearly same
miss rate as a two-way set associative cache of size N=2 [DAS12]. Increasing
set-associativity increases the energy used per access. Way prediction [NLJ96]
can also be used to reduce miss rate, uses extra bits to predict the way (block
within the set of next cache access). The prediction leads to a single tag com-
parison in the rst cycle. A hit results in decreased miss latency and energy
saving whereas a miss leads to the comparison with the rest of the blocks in the
cache-set in subsequent clock cycles. Compiler optimizations can also reduce
miss rates without any hardware changes. Code optimizations like rearranging
code without aecting correctness, instruction reordering, code alignment with
cache blocks and others have shown to reduce cache miss rate.
Reducing Miss Rate/Penalty via Parallelism
Most modern processors use an out-of-order execution model, non-blocking
cache or lockup-free cache [Kro81, FJ94] allows for continued execution by al-
lowing multiple requests to be processed in parallel in addition to the handling
of cache misses. This optimization reduces the eective miss penalty by mak-
ing caches helpful during a cache miss by allowing requests to be processed
instead of ignoring them. This lowers eective miss penalty by overlapping
multiple misses, albeit increasing the complexity of the cache design. This
technique does not reduce the miss penalty of an individual cache miss but re-
duces the overall miss penalty. Prefetching can also help in reducing the miss
31
penalty. In Hardware-based prefetching [CB95], instructions and data are di-
rectly prefetched into the cache by the hardware based on its prediction of which
cache blocks would be used next. Instruction cache prefetching is frequently
done in hardware outside of cache. For every cache miss, the processor fetches
multiple cache blocks, currently requested one and next consecutive block(s).
Requested block is fetched into I-cache while additional blocks get prefetched
into the instruction stream buer [Jou90]. Many processors like Ultrasparc III
use a prefetch scheme [HL99]. Many prefetching techniques are used to ac-
curately predict which cache blocks to prefetch that would be used in future
[CB95]. An alternative to using hardware prefetching is to use compiler driven
prefetching. In compiler driven prefetching, compiler explicitly prefetches values
either into registers, called register-prefetch or into caches, called cache-prefetch.
The hardware or software driven prefetching makes use of available additional
memory bandwidth by anticipating future needs of cache.
Reducing the time to hit in cache
Cache access latency is critical as it aects processor clock rates. Majority of
times, cache access latency is on the processor's critical path. As cache size
is directly proportional to access latency, a smaller cache has smaller access
time. Most processors use virtual address while the memory systems rely on
physical addressing. Every access to main memory requires address translation,
consumes time, energy and can adversely impact access time. Minimizing the
address translation time or moving it from critical path allows for reduced access
time. Another technique to reduce hit latency is to use pipelined caches. For
example, Pentium 4, takes 4-clock cycles to access instruction cache. The split
32
increases the pipeline stages leading to a greater penalty on a miss-predicted
branch and increased clock cycles between the use of load and its fetched data.
Pipelined cache accesses do increase the bandwidth of instructions rather than
decreasing the actual latency of hit. Trace caches also improve cache hit time
by loading a dynamic sequence of instructions including branches into a cache
block [RBS96].
Most of the above techniques apply to both single-core and multi-core processor
caches. Advances in the multi-core architectures have resulted in large Last-Level
Caches (LLC) being incorporated in the processor chips. As these LLC caches typ-
ically run into several megabytes, these caches are physically distributed across the
chip to optimize chip area and connected with an interconnection network fabric.
From an architecture standpoint, these distributed caches are monolithic, whereas,
from a circuit perspective they are distributed. Typically, a cache bank is aligned
with an individual core on the chip. Because of physical proximity, a core can access
data from some cache banks faster compared to remaining banks (require access and
transfer of data over the interconnection network). This led to a shift from classical
uniform-cache-access(UCA) model to Non-Uniform Memory Access or Non-Uniform
Cache Access Model (NUMA/ NUCA) architectures wherein a processor accesses cer-
tain cache banks faster compared to other banks in the same hierarchy for improved
performance [KBK03].
Energy Reduction Techniques
With multicore architectures becoming the de-facto standard for computing, manu-
facturers are incorporating a deep cache hierarchy. As the amount of on-chip caches
33
increase, so does the area occupied by them and they may account for up to 50% of
total power budgets [Seg01, SC99]. Energy dissipation in the cache can be broadly
classied into two categories {dynamic and static. Dynamic energy is dissipated
when the state changes. It is the energy required for charging and discharging of
capacitances. In caches, it is directly proportional to the number of cache accesses.
The static or leakage energy is dissipated when it is powered-ON independent of the
state of the circuit [NH06]. With advances in chip fabrication technology, leakage
energy losses have increased [ITR]. Borkar [BC11] estimates a factor of 7.5 increase
in leakage current and a ve-fold increase in total leakage energy dissipation for each
technology node. Semiconductor companies now employ high-K dielectric materials
to replace SiO
2
(Silicon dioxide) material in the transistor gate to reduce leakage
energy[BCGM07]. Advances in FinFET transistor technology has provided a signi-
cant one-time reduction in leakage energy losses in transistors, leakage losses in cache
hierarchy remains a concern as memory components are always active and occupy a
large share of chip transistors [ITR]. Since leakage energy losses are directly propor-
tional to the active transistor, deactivating unused transistors reduces leakage losses.
Following circuit-level techniques help in deactivating unused transistors to reduce
leakage {
Gated-V
dd
: In gated-V
dd
approach [PYF
+
00], cache leakage is reduced by dis-
connecting SRAM cells from V
dd
or ground rails using a transistor, whereby
portions of caches are dynamically shut down to control leakage. This approach
was used for dynamic cache tuning approaches, for ex: DRI-Cache [YPF
+
01]
and cache decay [KHM01]. In [LKfT
+
03], both state-preserving and state-
destroying leakage control techniques are used to exploit data duplication across
34
various levels of cache to reduce leakage.
Drowsy Cache: In gated-V
dd
approach, the supply voltage to a portion of mem-
ory is either ON or gated (OFF). This can potentially lead to state-destruction
in caches. The drowsy cache [FKM
+
02, KFBM02] provides a compromise be-
tween ON and gated state by reducing the supply voltage to data-retention volt-
age, thereby, avoiding the state destruction. These techniques are also known as
state-preserving techniques. In [MBD05], authors combined slumberous mode
with modications to cache replacement policies.
Changing the Threshold Voltage (V
T
): Threshold voltage of a transistor can
also be used to lower leakage power. Since the speed of the transistor depends on
the threshold voltage V
T
. Decreasing V
T
increases speed, and dynamic power
dissipation whereas increasing V
T
can signicantly reduce leakage power albeit
at the expense of speed. Therefore, most of the techniques that modify V
T
try
to use high V
T
cells on non-critical paths to reduce leakage and minimize the
impact on performance [HHA
+
03, KKMR05].
Modeling Data Movement
Data movement occurs mainly on account of data transfer between two nodes. The
nodes can be a processor, cache, or the memory controller for o-chip access. Let
us consider a simple example of fetching data from memory. Consider a three-level
cache hierarchy consisting of L1, L2, and L3 cache followed by main memory. We
assume a coherence protocol exists and an inclusive cache hierarchy. Figure 2.1 shows
the steps involved in a memory operation. Let us assume that we are performing a
memory operation (Load/Store) for address A. We have following scenarios {
35
Figure 2.1: Steps involved in a memory operation
36
Cache Hit for A in L1
In this scenario, we access the L1 cache. Before sending data back to the pro-
cessor, we need to perform coherence check to make sure that the cacheline
containing Address A is in the correct coherence state. Depending on the co-
herence protocol this involves two-or-more messages passed between coherence
controllers and waiting for the response to complete coherence check and to
keep cacheline in correct coherence state.
Cache Miss for A in L1 { Cache Hit in L2
In this scenario, First A is checked in L1, address A is not present in L1. An
entry is updated in Miss-Handling-Status-Register (MHSR) in L1. Then L2 is
accessed, and A is found. Coherence check must be performed before A is sent
to L1 cache. Once coherence check is done, A is sent to L1 cache. L1 cache
nds a replacement candidate, say B. If B is in a modied state, it needs to be
written back to L2 before A is added to L1 (Assuming inclusive cache means
any cacheline that present in L1 is also present in L2). If B is not in a modied
state, it is dropped. Depending on the coherence protocol, it may trigger a
directory update before being dropped. Then nally, A is written to L1 cache,
and value returned to the processor.
Cache Miss for A in L1 and L2 and Cache Hit in L3
In this scenario, similar to the previous scenario, A is accessed in L1, since it is
not present in L1, an entry is created in L1 MHSR. Then L2 is accessed, again,
as A is not found in L2, another entry gets created in L2 MHSR, and then A
is accessed for in L3 cache. The entry for address exists in the L3 cache, data
37
for A is accessed in the L3 cache, and coherence check is performed to get it in
the correct coherence state, and ready for transfer. Transferring from L3 back
to L1 involves two operations - adding A to L2 (to maintain cache inclusivity),
and adding A to L1. When writing to L2, L2's replacement candidate, say C
must be rst invalidated in L1 (if cached in L1) and then written back to L3 (C
is in a modied state). If C is in a modied state in L1, it also requires a write-
back from L1 to L2 before invalidating itself in L1. This requires coherence
transactions. Invalidating C in L1 creates an empty position in L1. Now C
may need to be written-back to L3 cache from L2 and may require additional
coherence transactions. Now A can be added to L2 replacing C. A must also be
added to L1. L1's replacement candidate, say B, requires write-back to L2 if it
is in a modied state. Note that even though there is an empty spot available
in L1 on account of invalidation of C, it may not get used by L1 to add A as the
cache-set mapping may be dierent or add-A-to-L1 request reaches L1 before
invalidation of C request.
Cache Miss for A in L1, L2, L3 caches and main-memory access
In this scenario, similar to previous scenario, A is accessed in L1, L2 and L3 and
entries created for A in L1 MHSR, L2 MHSR, and L3 MHSR. It requires a main-
memory access and potentially triggers a page-walk. Once the data for A arrives
from main-memory into L3 cache, L3 nds a replacement candidate, say D. Now
D needs to be evicted from the L3 cache. This would trigger invalidation of D
in L1, L2 which in-turn may require multiple coherence transactions. Also, A
needs to be added to L2 and L1 caches, additional invalidations and write-backs
may occur causing additional data movement and triggering up to 2-empty slots
38
in L1 and 1-empty slot in L2.
These various scenarios highlight the complexity of data movement for a single
memory operation. With multiple memory operations occurring in parallel, the com-
plexity increases and leads to signicant hidden trac than just cache accesses. These
scenarios list the data transfers that occur on account of moving cacheline sized data
items between various levels of caches and coherence transactions; the energy costs
also includes management of MHSR, tag checks to identify the presence of address,
coherence states and access, and management of interconnection network node.
The data movement costs for moving data between the processor and the memory
can be measured in terms of energy [KGKH13], latency and the actual amount of
data moved. Costs also involve managing the data through the on-chip cache hier-
archy, wherein cache access can potentially trigger multiple cache events resulting in
additional data movement across the memory hierarchy [Kog14].
Quantication
We model the amount of data moved across the cache hierarchy during application
execution. Data movement costs for dierent types of data operations during appli-
cation execution can be broadly categorized into 4-categories.
TotalDataMoved =FirstBrought +Replacements +Moved +Coherence
1. FirstBrought: Amount of data brought into the cache hierarchy for application
execution from the memory.
39
2. Replacements: Amount of data moved through the cache hierarchy on account
of cache replacements, or other cache operations likes evictions, write-backs.
This is the biggest source of data movement in the cache hierarchy.
3. Moved: Amount of data moved through the cache hierarchy on account of con-
text switches and end of program execution.
4. Coherence: Amount of data moved on account of managing and maintaining
coherence states.
Factors aecting data movement costs
The FirstBrought data into the cache hierarchy is constant and is equal to the mem-
ory footprint of the application and is a constant cost. The data Moved is aected
by the number of context switches during application execution. Depending on the
runtime system and other programs running on the processor, an application may get
context switched often, leading to increased Moved costs. In many cases, depending
on the system security requirements, a context switch
ushes all the cachelines from
the cache hierarchy to the main memory. Such a system increases both FirstBrought
and Moved costs. Replacements is the most dominant contributor to the overall data
movement. It is aected by both cache architecture conguration as well as the
software stack in the system. Cache architecture parameters at every level of cache
hierarchy aect Replacements costs. For example, a wider cacheline size cache tends
to incur higher data movement costs as any replacement requires a larger chunk of
data transfer. Another factor that aectsReplacements costs is the memory reference
stream and the ordering of memory requests generated by the compiler, operating and
runtime systems. If the application address-space is very tightly organized, this can
40
cause cache contention and increase overall data movement. Prefetching also aects
Replacements as it can pollute caches, requiring evicted lines to be re-fetched from
higher in the cache hierarchy. In shared caches, data from other processes can also
aect Replacements as it can limit the available cache resources for the application.
While Moved costs can also be considered as Replacements, it is better to separate
them out as moved costs aects all the cachelines in the cache hierarchy. Coherence is
dependent on the coherence protocol used. In directory-based protocols, management
of directory also adds to the overall Coherence costs.
In chapter 5, we discuss some of the techniques related to cache tuning, measure-
ment of cache energy for applications executing on real machines. Prior work related
to the measurement of recongurable cache designs are detailed in chapter 6 and 7.
Decades of cache research have explored a wide range of cache architectures ideas
to improve performance or energy eciency of the caches. With applications ex-
hibiting a wide-range of cache and memory access proles, having a xed cache ar-
chitecture without any application-specic support limits the available optimization
opportunities for the system designers. A xed, rigid architecture cannot be expected
to support a wide range of application proles eciently. With data-communication
costs far exceeding data computation costs, it becomes imperative that caches not just
support processors, but must actively be involved and ecient at managing proces-
sor demands. They need to provide constructs, abstractions that allows processors
to exploit available memory resources better to improve the performance of their
applications. Caches need to support processors intelligently.
41
Chapter 3
PathFinder Characterization Study
To optimize data movement in the systems, the rst step is to analyze the application
behavior on the current systems. This chapter presents one such example.
3.1 Introduction
Graph processing is widely used in data analytics applications in a variety of elds and
is rapidly gaining attention in the computational, scientic and engineering (CSE)
application community. One such application of graphs is in the binary (executable)
signature search. For example, most commercial anti-virus programs use binary sig-
nature matching to detect malware. Often, malware evades binary signature search
detection via obfuscation, removing the signature from the malicious program. A
potentially better solution for malware detection is to determine the pattern of ex-
ecution for an executable binary under examination. Typically, this is done using
dynamic execution tracing of the binary under examination. However, malware pro-
grams are now evading dynamic analysis by detecting virtualization and limiting
42
execution accordingly. An alternative to dynamic analysis is to use static analysis.
Static analysis provides a more complete set of control paths compared to dynamic
analysis. Additionally, it can be performed in the absence of an instrumented exe-
cution environment. The approach is to determine the binary's control
ow graph
(CFG) to determine patterns of system calls made by that binary. Ordered sequences
of system calls can dene patterns that in turn be used as signatures. These signa-
tures are based on the CFG characteristics between the system calls. For example,
CFG metrics such as the number of hops between calls or common paths between
calls would indicate a likely signature match.
Graphs generated from binaries have several interesting properties not seen in
other sorts of graphs. An application proxy, named PathFinder, represents these
properties, allowing examination of the performance characteristics of the algorithms
used in this work. In particular, our goal is to nd signatures in the control
ow, for
example, the rst, the shortest, or some statistical measure of the total number of
signatures.
3.2 Related Work
Graphs provide a powerful means for describing relationships within physical sys-
tems across a broad set of domains. CSE application programs organize unstructured
grid points as undirected graphs for spatial decomposition across parallel processes
[DBH
+
06, KK98]. Social networks use graphs to determine relationships between
members, wherein the graphs are undirected since the relationship is presumed to go
43
both ways. Betweenness centrality measures the number of shortest paths through a
node [Fre77], where a high value conveys something about the importance of a node.
The graph may be directed or undirected. PathFinder is designed to investigate con-
trol
ow in a program using a directed graph. So in addition to node connectivity,
the actual execution path is important, i.e., in PathFinder, we are also concerned
with how nodes are connected in addition to if the nodes are connected.
The Graph500 benchmark [Gra] operates on graphs such as those dened by so-
cial networks. Using a breadth-rst search of large undirected graphs, an ordering
of computer performance is based on the number of edges traversed (TEPS). The
Scalable Synthetic Compact Applications (SSCA) benchmark #2 [BM05] represents
betweenness centrality, where the graph may be directed or undirected. Pathnder
is similar to dominator analysis [Pro59, DSVPDB07], but the underlying goals are
dierent enough to make comparing them insucient for our purposes. Dominator
analysis simplies a call graph by identifying basic blocks that always precede other
blocks when evaluated from a global start node traversed to a global end node. How-
ever, PathFinder's analysis may start and terminate anywhere in the graph, with no
node a dominator of any other, and cyclic paths are viewed as normal and desirable.
The characteristics of the graphs being operated on are dierent in ways that
impact the runtime characteristics of the algorithms operating on them. Graphs in
these other areas typically have a small diameter with a large degree, thus lead-
ing to the notion of six degrees of separation. PathFinder graphs are typically
long and skinny, look very directional, and thus have large diameters with small de-
44
grees. The key characteristics of PathFinder graphs and algorithm are detailed in
[DDRB15]. PathFinder is a proxy application, maintained as part of the Mantevo
project [BCH
+
15] (www.mantevo.org). Unlike a benchmark, where rules constrain
experimentation, proxy applications are designed for modication and experimenta-
tion.
The PathFinder algorithm is detailed in section 3.3. In section 3.4 we describe the
experiment environment, the input graph generation process followed by performance
characterization and cache behavior characterization analysis. Section 3.5 summarizes
the results presented in this work.
3.3 PathFinder
3.3.1 Graphs
A graph is dened as
G =fV;Eg
for a set of vertices V and set of edges E. For a directed graph, the edges have a
direction, e.g. e
i
! e
j
, but e
j
9 e
i
. PathFinder searches directed graphs to nd
characteristic signatures, where a signature is a specic sequence of labels that are
associated with nodes in the graph. A path is a list of nodes where each node has an
edge to the next node in the path. Paths that match a signature begin with a node
labeled with the rst label in the signature, and subsequent nodes in the path are
associated with the remaining signature labels in order. However, many nodes may
45
Figure 3.1: PathFinder graph
lie on the path between labels that are subsequent in the signature; the portion of a
path between two subsequent labels in a signature is called a leg. The signature path
terminates with a node associated with the last label in the signature. The set of
nodes that make up a path between two labels is the path for a \leg" of the signature.
The combination of all legs denes the complete signature path.
An example graph, illustrated in Figure 3.1, contains paths for the signatures
(Red, Blue),(Red, Green, Blue),(Red, Green, Yellow) and(Red, Red, Blue)
but not (Red, Green, Blue, Yellow).
The graphs can be noisy (missing edges, nodes or entire sequences of nodes and
edges), nested, directed and cyclic. The exterior graphs may also have interior Control
Flow Graph (CFG)s. The graphs can have large dierences in sizes, not be complete
and have irregular metadata and signatures. PathFinder currently has two modes
46
of operation. Normal operation searches for signatures as described above. The
exhaustive mode determines whether or not a path exists between each pair of labels
in the graph.
3.3.2 Algorithm
PathFinder traverses the graph in a modied depth-rst recursive search, comparing
all adjacent nodes before recursing down the edges. If a match is found, the algorithm
advances to the next label in the signature and recurses with the matched node as the
new start. If there are no more labels in the signature, the search is deemed a success,
and the recursion stops. If no matches are found among the directly-connected nodes,
the algorithm recurses along each edge with the current label still being the one being
compared against. Once all edges have been checked without nding a match, then
recursion is terminated as a failed search.
PathFinder maintains a list of nodes for each label represented in the graph.
Each signature search begins at a node labeled with the rst label in a signature.
PathFinder checks adjacent nodes for the next label in the signature. If not found,
PathFinder then performs a recursive depth-rst search until the next label is found.
This completes the discovery of a leg of the signature. Once a leg is found, a new
search is started for the next label in the signature. Loops are not allowed within
a leg, but they may exist within the complete signature path. PathFinder iterates
through all nodes with the rst label until it nds a valid signature path or terminates
proving that the graph does not contain the signature.
47
struct NodeStruct
{
char *label;
int id, labelIdx, nodeCount, edgeCount, entranceCount;
NodeType type;
// If interior node, points to outer node containing this one
Node *container;
// All nodes contained in node's subgraph
NodeList *interiorNodes;
// Nodes accessed from this node (includes entrance nodes)
EdgeList *edges;
};
Figure 3.2: Data structures
The description of the graph is dened in terms of a C structure, shown in Fig-
ure 3.2. The structs within it (NodeType, Node, NodeList, interiorNodes, and
EdgeList) are essentially linked lists.
3.4 Performance Modeling, Results
We begin by describing the setup environment and graph problem generation method-
ology used for this study, followed by performance characterization and impact of
various parameters of the input graph on application performance. We conclude the
discussion by presenting cache characteristics of the PathFinder miniapp.
3.4.1 Experiment Environment
In this section, we describe the environment used for performing characterization
studies for the PathFinder miniapp.
48
Machine Description
We conducted experiments on Edison, a Cray XC30 system located at the National
Energy Research Scientic Computing Center (NERSC). Each compute node consists
of dual socket Intel Xeon E5-2670 Ivy Bridge 12-core processors clocked at 2.4 GHz.
Each core has a private 64KB L1 and 256KB L2 cache. All cores on a socket share
a 30MB L3 cache. Each compute node uses 64GB DDR3 memory operating at 1600
MHz. The computing environment was set using module PrgEnv-gnu/5.2.25, which
includes the GCC compiler version 4.9.1, with which we used
ags-static -fopenmp
-O3. Runtime proling data was collected using the Cray Performance Analysis Tool
(CrayPat) found in module perftool version 6.2.1.
Graph Generation
The complexity of the graph problem is often determined by the number of edges
and nodes in a graph. Similarly, for our PathFinder miniapp, the complexity of the
signature search algorithm application is determined by the number of nodes in the
graph, and characteristics of the signature searches, largely in
uenced by the number
of unique labels in the graph and aggregate label count across all the nodes in the
graph. The number of nodes determines the size of the graph, whereas the number
of unique-labels and total-labels determine the signature search time. To study the
impact of the total-labels in the graph on performance, we dene density as the
average number of labels per node in the graph. Therefore, the density of a graph
problem is given by {
49
Density =
Total labels in graph
Number of nodes
The PathFinder miniapp can operate in two dierent modes: signature-search
mode, wherein a given signature is searched for in the graph; or exhaustive-search
mode, wherein, for every label pair in the graph, a path is determined if it exists.
Execution of the PathFinder miniapp can be divided into two phases: build, in which
the program reads the input and builds a graph, an operation that is inherently serial
in nature; and search, wherein a signature search operation is conducted. Signature
searches performed in exhaustive-search mode are embarrassingly parallel, an individ-
ual path/leg of a signature can be treated as an individual search and be distributed
among available threads for computation. We use the exhaustive-search mode of
operation for the results presented in this work. This mode also mimics real-world
scenarios wherein one is interested in nding the relationship between graph elements
rather than focusing on a single search.
3.4.2 Performance Characterization
We begin by presenting execution time for various graph sizes. Figure 3.3 shows
the average execution time for 1000-node, 2000-node and 4000-node graph problems.
There are 800, 1600, 3200 unique-labels and 2000, 4000, 8000 total-labels in the graph
for 1000, 2000 and 4000 nodes, respectively. We kept the ratio of unique-labels to
the number of nodes in the graph for this experiment constant at 0.8 and density
or the average number of labels per node to 2. These values are representative of
real-world problems. However, in later sections, we discuss the impact of changes in
50
Figure 3.3: Execution time for PathFinder
the ratio of unique-labels to the number of nodes and density of labels in the graph.
For each problem size, we generate ten dierent graph problems. Each of the ten
problem graphs is then executed and their performance characteristics captured. To
capture the impact of inherent parallelism in the signature-search operation in the
application, we use various numbers of threads for application execution. In Figure
3.3, the X-axis indicates the number of OpenMP threads used for execution, while
the Y-axis represents the average execution time in seconds on a logarithmic scale.
Each column indicates the average of the execution time over the ten graph problems.
The variation bars on each column indicates the spread of the execution times for the
ten dierent problem runs between the min and max execution times.
From Figure 3.3, we observe that, with the same input parameters, execution
time varies over 30% depending on the conguration of the graph. As problem size
51
Figure 3.4: Problem size scaling for PathFinder
increases, the increase in runtime is exponential compared to the increase in problem
size. With the increase in the problem size, every search traverses more nodes for
nding a signature match, leading to this exponential increase in runtime. Similarly,
as the number of threads available for computation increases from 1 to 16, the average
execution time for all problem sizes decrease, indicating that the search operation ben-
ets from parallelism provided by additional threads. Even with additional threads,
we observe average speedups of 1.7x, 2.7x, 4.4x and 7.3x for 2, 4, 8, 16 threads, respec-
tively, indicating that we achieve nearly half the speedup per additional threads, and
also the average speedup decreases with increase in threads available for computation.
52
Figure 3.5: Density scaling for PathFinder
We now look into how various input parameters of the graph aect the perfor-
mance of the PathFinder miniapp. We rst discuss the impact of the number of nodes
in the graph on execution time. Figure 3.4 shows how the execution time varies with
an increase in the graph size. The Y-axis shows the execution time on a log
2
scale. We
present results for various numbers of threads used for application execution. Again,
the density was kept constant at 2 and the ratio of unique-labels to the number of
nodes in the graph at 0.8. From the plot, we observe that for various threads, as the
problem size grows, the increase in execution time is exponential, indicating that the
amount of computation increases exponentially with problem size, a trait of signature
search algorithms.
53
Figure 3.5 shows the impact of the increase in density (average number of labels
per node) of the graph on performance (execution time). For this experiment, we
consider a 1000-node, 800-unique-labels graph and we vary density from 2 to 16. We
plot results for execution time for various numbers of threads used for computation.
From Figure 3.5, we observe that in every thread execution scenario, the average
execution time decreases as the density of the graph increases. Considering that we
are running PathFinder in exhaustive-search mode, the goal is to nd a path between
a pair of nodes containing the searched label-pair. Therefore, as density increases,
there is a higher probability of nding a node traversal containing the label early in
the graph traversal. This expedites an individual label-pair search through the graph
thereby decreasing execution time. Also, we observe that the decrease in runtime
with an increase in density is highest in two-thread execution; this can be attributed
to 2-thread execution being a good t for this problem size, and we expect that as
problem size increases, higher numbers of threads will benet execution.
Figure 3.6 plots the impact of the increase in the unique-labels in the graph on the
performance. For this experiment, we consider 1000-node, 4000-total-labels graphs,
and we vary the number of unique-labels between 800-1600. We plot results for vari-
ous numbers of threads allowed to be used for computation. In this experiment with
a constant number of nodes and total-labels in the graph and executing in exhaustive
search mode, the memory footprint of the graph remains constant; only the search
space increases with the higher unique-label count. The total number of label-pair
searches increase from around 800
2
to around 1600
2
for unique-labels increasing from
54
Figure 3.6: Unique-label scaling for PathFinder
800 to 1600. We observe that as the number of unique-labels increase, the runtime
increases. Increase in runtime is higher for runs with lower numbers of available
threads for computation. For 16-thread execution, runtime increased by4X while
it increased by over5X in single-thread execution. Increase in the runtime can be
directly correlated to the increased number of searches in the application execution.
The performance characterization results presented previously were for test-case
sized graph problems. For real-world problems, much larger graph problems are
expected. To that extent, we now examine how the PathFinder miniapp scales under
strong scaling. For this experiment, we consider 8000-nodes, 6400 unique-labels, and
55
Figure 3.7: Strong scaling for PathFinder
16000 total-label graph problems. For strong scaling, we analyzed the application
execution running between 1-16 threads. Figure 3.7 shows the strong scaling results
for the PathFinder miniapp. In addition to the actual execution results, we plot a
ideal scaling scenario wherein for each number of threads used for computation, we
estimate the ideal execution time by dividing the single-thread execution-time by the
number of threads used during a particular computation. From the graph, we observe
that the PathFinder miniapp scales well with increases in the computation resources.
As the number of threads used for execution increases, the execution time decreases
and tracks the ideal-time curve albeit with a xed oset. The xed oset is attributed
to the serial nature of the graph build phase as well as the overhead of parallelization
56
Figure 3.8: L1 cache Hit Rate for PathFinder
for distributing work among increasing numbers of threads. This oset decreases as
the number of threads available for computation increases, indicating good strong
scaling characteristics.
3.4.3 Cache Characteristics
The behavior and utilization of the memory hierarchy, especially the cache hierarchy,
has a strong impact on the performance of any application. Understanding these
characteristics is important to application developers, as it often indicates where im-
provements can be made to maximize the performance of applications. It becomes
especially imperative that one understands this behavior for new graph processing
57
applications, as their cache utilization behavior is signicantly dierent from other
CSE (Computational Scientic and Engineering) applications. Figure 3.8 presents
the L1 cache behavior of the PathFinder miniapp. The Y-axis in Figure 3.8 indi-
cates the percentage of L1 cache accesses which hit in the L1 cache, while the X-axis
represents various numbers of threads used for computation. We consider three dier-
ent problem sizes consisting of 1000-node, 2000-node and 4000-node graph problems
with 0.8*nodes unique-labels and 2*nodes total-labels in the problem graph. For
each problem size, we generate ten random graph problems, and the average cache
hit rate across all graph problems is presented here. For all three problem sizes, we
observe an L1 cache hit ratio between 89-92%. The majority of execution runs across
various numbers of threads have L1 cache hit rates over 90%. Although this indicates
signicant L1 cache locality, these hit rates are lower than those observed in other
CSE applications where L1 cache hit rates are observed to be 95%. As the number
of threads used for computation increases, for every problem size, we see a decrease
(albeit minor) in L1 cache hit rates. However, the L1 cache hit rate remains nearly
constant irrespective of problem size.
Figure 3.9 presents the L2 cache hit rate for the PathFinder miniapp. We use the
same problem sizes that were described above for gathering L1 cache statistics. As
seen from the graph, the L2 hit rate is around 60% for 1000-node, 47% for 2000-node
and 40% for 4000-node graph problems. As problem size increases, the L2 hit rate de-
creases. This behavior occurs because when the problem size increases, a longer graph
traversal is required during every search operation. With graph traversals in a larger
graph spanning a larger memory footprint, the pointer chasing is spread out across
58
Figure 3.9: L2 cache Hit Rate for PathFinder
a larger address range, leading to poor spatial locality during consecutive memory
accesses. Scenarios such as a miss in the L1 cache causing a miss in the L2 cache are
more likely to occur, leading to node information to be fetched from L3/Last Level
Cache (LLC) or memory. This L2 cache behavior is unique to PathFinder and very
dierent as compared to other CSE applications. With increases in the number of
threads for computation, we observe slight increases in the L2 cache hit rate, which
can be attributed to a larger combined L2 cache available across all the threads of the
program. Even with the increase in the number of threads leading to more L2 cache
being available to the program, the inherent nature of graph search means that no
individual L2 cache can hold the entire problem memory footprint, which also leads
59
Figure 3.10: Combined private cache Hit Ratio for PathFinder
to lower L2 cache hit ratios. Further cache studies are needed to determine which
cache architectures/ organizations would lead to better L2 cache utilization.
In Figure 3.10 we show the combined hit rate for the private caches available to
each core. The Y-axis of Figure 3.10 shows the combined L1-L2 cache hit rate per-
centage, while the X-axis represents the number of threads used for computation. We
observe that the combined hit rate is over 94% for all problem sizes. The combined
hit rate is higher than the individual hit rates observed for L1 or L2 cache indicating
that the presence of L2 cache improves overall hit rates. Given that L1 cache, ac-
cesses dominate aggregated L1 and L2 cache accesses; we observe that the combined
60
Table 3.1: Memory footprint (MB) for PathFinder
# of Threads
Graph problem size (# nodes)
1000 2000 4000
1 23 25 29
2 40 42 45
4 72 74 77
8 136 139 142
16 201 204 207
hit-rate tracks the L1 cache hit rate more closely. Similar to the L1 cache behavior,
when either the problem size or the available threads for computation increases, the
combined cache hit rate decreases.
We also measured the L3 cache hit ratio and observed nearly a 100% hit rate for
various problem sizes. The L3 cache hit-rate drops to around 98% for all problem sizes
when executing with 16 threads. The L3 cache being the Last Level Cache (LLC) is
shared among all the available threads. As the number of working threads increases,
the average L3 cache size available to each thread decreases. Even though IvyBridge
processors have a large shared L3/LLC cache, with 16 threads and assuming uniform
sharing, each thread is limited to around 2.5MB of the L3 cache on the socket. This
leads to an increase in the number of L3 cache misses, thereby decreasing the observed
hit rate. We also measured the memory footprint for various problem sizes running
with various numbers of threads for execution. The memory footprint is shown in
61
table 3.1. From Table 3.1 we observe that the memory footprint increases with the
number of threads used for computation. The memory footprint of the application
increased from23MB to201MB for 1000-node graphs. The memory footprint
remains nearly constant with the increase in the graph problem size.
3.5 Summary
We exercised PathFinder with dierent problem sizes and various numbers of threads.
We showed the impact of increases in the number of nodes, unique-labels and total-
labels in the graph on the performance of the application. We also demonstrated and
characterized scaling properties of the PathFinder miniapp. With cache characteris-
tics and behavior having a strong impact on the performance of any application, we
showed the cache characteristics of the PathFinder miniapp. Both L1 and L3 caches
have a cache hit ratio of over 90% whereas the L2 cache hit ratio is substantially
lower ( 60%). Also, the L2 cache hit ratio decreases as the problem size increases,
suggesting poor utilization of L2 cache in this graph-processing miniapp.
62
Chapter 4
Memory Access Pattern Modeling
Application characterization is an essential step in understanding application behav-
ior and cache utilization. In the previous chapter, we characterized a DOE graph
analytics signature search application. In this work, we focus on studying memory
access pattern across dierent HPC applications.
4.1 Introduction
Machine performance is increasingly dependent on the performance of the memory
subsystem. With a wide range of applications running on HPC systems, each applica-
tion stress memory system in dierent ways. One way to understand how (dierently)
memory system is stressed would be to study memory reference patterns of the ap-
plications. Quantifying application characteristics enables a better understanding of
architecture bottlenecks in the memory system; optimizing them would lead to better
support for applications. In this work, we focus on characterizing and quantifying
application characteristics in an architecture neutral way to understand dierent ap-
63
plication behavior and their implications for memory hierarchy designs.
The purpose of this work is twofold { (a) characterize dierent HPC workloads
using various locality metrics and (b) quantitatively understand dierences in various
classes of application workloads and its impact on the memory subsystem. Spatial
and temporal locality are widely used metrics to understand memory locality. In
this work, along with calculating spatial and temporal locality for the application
execution, we also study how they change over the application execution. In addition
to measuring spatial and temporal locality metrics for an application, we also focus
on the two additional metrics namely data intensity and rate of change of data over
application phases. The key characteristics of application studied here are {
1. Spatial Locality (reuse of data items near the recently used data item)
2. Temporal Locality (reuse of data item over time)
3. Data Intensiveness (Amount of unique data application accesses)
4. Data Turn-over (Amount of data that changed between two phases)
The remainder of the work is organized as follows. Chapter 4.2 discusses the work
in measuring various locality metrics considered here. In section 4.3, we describe
benchmarks and modeling methodology and results presented in section 4.4. In sec-
tion 4.5 we discuss application address-space modeling and section 4.6 concludes the
work.
64
4.2 Background and Metrics
In this study, we are focused on understanding memory reference patterns of dierent
applications. Traditionally, based on the memory reference pattern one can calculate
spatial and temporal locality for a given application. While a single score for the
spatial and temporal locality of an application provides insight into how the access
pattern would impact memory hierarchy, spatial and temporal locality varies signi-
cantly during dierent phases of the application. In this work, instead of relying on
computing a single score, we generate a vector for each score which is represented as
a histogram. This histogram can then be viewed as a signature of the application
and used by the developers. The various metrics considered here are {
Spatial Locality
Spatial locality is the tendency of an application to access memory locations near
the recently referenced ones [BM84, HP03]. The idea here is to nd how close the
references are clustered around recently referenced address-locations. We use the
denition similar to [WMSS05, BW03] to dene spatial locality score based on the
average length of strides. Most of the memory references made by the modern pro-
cessor use interleaved addressing during application execution. Application code may
have references of stride-1, while during dynamic execution, there can be multiple
other memory references between two stride-1 references. In order to nd the near-
est neighbor among previously referenced memory locations, we need to look at a
window of previously referenced memory references. We use a look back window that
contains previous memory references to determine the nearest neighbor of the current
memory reference location. Similar to [WMSS05], we use a look back window of size
65
32. This also improves data collection performance. We then dene stride for the
given memory reference as the dierence between memory reference and the nearest
neighbor from the look-back window. We consider absolute value for the stride. Using
the stride distribution, we can combine all the strides into a single score using the
following summation equation {
1
X
i=0
stride
i
i
Where stride
i
indicates the stride distance for the i
th
memory reference. This gener-
ates a normalized score in the range [0,1] that is inversely proportional to the stride
length. A score of 1 indicates all the references are of stride 1 (consecutive accesses)
whereas a score 0 indicates no strided references. The way spatial locality is dened,
we are only accounting for references near the recently referenced addresses, and
therefore not considering any self-addressing. The self-addressing will be captured in
the temporal locality measurement.
Temporal Locality
Temporal Locality is the tendency of an application to reference the same memory
location that was recently referenced [HP03]. Reuse distance is often used as a metric
to quantify temporal locality [YBS05, DRR06, GCB
+
09]. Reuse distance is the num-
ber of unique memory addresses referenced between two accesses to the same memory
address. The reuse distance for every memory reference is captured, and then this
distribution is combined into a single score. In order to combine the distribution into
a single score, we need to employ a weighting function. Intuitionally, the temporal
locality should decrease as the reuse distance increase. Following [WMSS05], we use a
log scale where each memory reference is weighted by the ratio of its reuse distance to
66
the maximum reuse distance considered. We use the following equation to aggregate
reuse distribution into a single value.
log2(maxRd)
X
i=0
C
i
(log2(MaxRd)log2(Rd))
log2(MaxRd)
Where C
i
is the number of references with reuse distance of Rd
i
. MaxRd is the
largest Reuse distance considered. The idea is to use a decreasing weight function as
the reuse distance increases.
Data Intensiveness
One of the critical parameters that often gets unnoticed during application perfor-
mance is how eectively the data is used when referenced. Data intensiveness is
dened as the amount of unique data accessed by the application over an interval.
It would be equal to memory footprint when the interval considered is complete ap-
plication execution. In modern processors given that data is brought in terms of the
cachelines which is typically 64B wide and typically not every location (byte) of the
cacheline will be accessed in a given phase, the data intensiveness value indicates
the eective utilization of the data in an application phase. We follow the denition
dened in [MK07] for calculating data intensiveness as directly measuring the total
number of unique bytes accessed over a xed number of memory references.
Data Turn-over
This metric measure the change in the data across two consecutive phases. A larger
value indicates that more resources are required in the memory hierarchy to support.
67
To measure the change, we keep track of memory references in the current and one
previous phase of measurement. The dierence between these two sets indicates the
number of new cachelines referenced. The ratio of dierence over the number of
memory references gives us the data turn-over rate.
4.3 Modeling
To account for a wide range of cache and memory proles of HPC workloads, we
considered ten benchmarks from GraphBIG, a graph application suite [NXT
+
15], six
benchmarks from NAS Parallel Benchmarks (NPB) [B
+
91], four from Sandia Man-
tevo Miniapps suite (CoMD, HPCCG, miniFE, miniGhost) [man] , RandomAccess
benchmark from the HPC Challenge Benchmark Suite (HPCC) [LBD
+
06], and stride
benchmark.
The GraphBIG suite represents dierent graph operations. Graph operations are
broadly classied into graph traversal, graph update, and graph analytics. Breadth-
First-Search (BFS) and Depth-First-Search (DFS) represent graph traversal, a fun-
damental operation of graph computing. Graph Construction (GCons) constructs a
directed graph with a given number of edges and vertices, while Graph Update (GUp)
deletes a given list of vertices and related edges from the graph, and Topology Morph-
ing (TMorp) generates an undirected graph from a directed-acyclic-graph (DAG). The
graph analytics group focuses on topological analysis and graph path/
ow. Short-
est Path (SPath) is for graph path/
ow analysis. K-core decomposition (KCore),
Triangle Counting (TC) and Connected Components (CComp) represent dierent
68
topological graph analytics and Degree Centrality (DC) represents social-analysis, a
subset of graph-analytics, and uses a dierent algorithm compared to shortest-path.
These ten kernels represent a variety of graph workloads.
NPB benchmarks are derived from computational
uid dynamics (CFD) applica-
tions and widely used in the HPC community. We consider one pseudo application,
BT (Block Tri-diagonal solver), and ve computation kernels, namely, Conjugate-
Gradient (CG) having irregular memory access and communication patterns, Em-
barrassingly Parallel (EP), discrete fast Fourier Transform (FT), Multi-Grid (MG)
(memory intensive) and integer sorting (IS). Mantevo's CoMD is a proxy for molec-
ular dynamics simulations with short-ranged Lennard-Jones interactions. HPCCG is
a linear solver using the conjugate gradient method. It is characterized by unstruc-
tured data and irregular data communication patterns. MiniFE is also a linear solver
that includes all the computational phases. The dierence between HPCCG and
MiniFE is that HPCCG generates a synthetic linear system focusing on the sparse
iterative solver. MiniGhost executes the halo exchange pattern of structured and
block-structured nite elements.
RandomAccess [LBD
+
06] exhibit extremes of memory access patterns and typ-
ically exhibit very low spatial and temporal locality. The Stride benchmark is a
synthetic code in which a large memory is allocated and then accessed multiple times
over such that one 8B data is accessed every 64B. This synthetic benchmark will have
zero spatial and temporal locality up to 64B cacheline size, and after that its spatial
locality will increase with cacheline size.
69
For the experiments, we used SST Simulator [RHB
+
11]. We used Ariel, a PIN-
based processor model from SST running single-thread programs of the above applica-
tions. The simulation assumes the processor is running at the 2GHz clock frequency
and Ariel was congured to issue 2-memory operations per cycle with a 256-issue
queue. Each of the locality metrics is modeled as a component in the SST. For all
the measurements each memory reference was aligned to an 8B boundary. If any
reference was for more than 8B data, then that reference was split into multiple re-
quests. For spatial locality measurement, we evaluated a look-back window size of
32 and 1024. We did not observe any signicant dierence in the score; a window
size of 32 improves performance, and we used a look back window of size 32 for
the experiments. For temporal locality, we evaluated a maximum reuse distance of
4096 and 65536. Using maximum reuse distance as 65536 signicantly increased the
execution time, so we decided to use 4096 which provide sucient resolution to mea-
sure reuse accurately. For both data intensiveness and data turn-over calculations we
used a 64B cacheline. For data turnover instead of tracking individual references, we
tracked cacheline improving the runtime requirements. To measure and track how
each metric changes over application execution, we used 500K memory references as
a phase. For NPB, we used Class A problem size. For GraphBIG kernels, the input
graph was a 1000 node LDBC synthetic graph [PBE13], and for miniapps, we used
default command-line arguments. For all benchmarks, the simulation was limited to
4-Billion memory references, but in many cases, the experiment completed before the
limit.
70
Figure 4.1: Spatial Locality score for dierent HPC workloads
4.4 Results
Figure 4.1 shows the spatial locality score for dierent benchmarks. The spatial lo-
cality score is between 0 and 1. The average value represents the mean across all the
phases of the application. We observe GraphBIG kernels have low spatial locality
scores. The scores across dierent phases vary between 0.15 and 0.47 with an average
score of 0.35. The NPB benchmarks have a wide distribution of scores across dierent
phases. For example, FT and MG benchmarks have an average score of0.9 which is
intuitively expected as these tend to be compute-bound codes, whereas BT and CG
have an average score of0.5. IS benchmark has the smallest average score0.25
and EP has the smallest variation. Miniapps also exhibit wide variation in the scores
and some phases have very high locality (1) and while others have very low spatial
locality (0.1). The averages range between 0.4-0.6 for miniapp. RandomAccess and
Stride benchmark have low spatial locality scores as expected.
71
Figure 4.2: Temporal Locality score for dierent HPC workloads
Figure 4.2 shows the temporal locality scores for various applications. As expected
with the GraphBIG kernels, they show poor (low) temporal locality. These kernels
have a tight-bound range between 0.15 and 0.3 with an average score of around 0.25.
Some phases in TC execution show much higher temporal locality as compared to
the rest of the benchmarks. NPB benchmarks also exhibit a wide range of behavior.
BT shows a narrow band of scores, while FT, IS, MG scores vary signicantly across
dierent phases. As the problem size increases, the array size grows, and working
sets get bigger, leading to a decrease in the temporal locality. Miniapps also exhibit
a wide range of behavior with CoMD and HPCCG having poor temporal locality
compared to miniFE and miniGhost. RandomAccess and Stride benchmarks have
low temporal locality score as expected.
72
Figure 4.3: Data Intensiveness Locality score for dierent HPC workloads
Figure 4.3 shows the range and average data intensiveness scores for various bench-
marks. Note that data intensiveness is a measure of the amount of unique memory
references in a given phase of an application. Ideally, we want the application to
use all the data from the cacheline sized memory references that were accessed, and
therefore the score should be 1; in real designs, we want this score to be as high as
possible. We observe a wide range of behavior across dierent classes of applications.
In GraphBIG kernels, we observe an average score of around 0.45 while the maximum
value in any given phases is 0.8. This indicates the applications did not reference (uti-
lize) nearly half of the data that was accessed in terms of cachelines in the application
phases. i.e., only 45% of the data is consumed, and the rest of the data is untouched.
In the case of NPB benchmarks, the average is above 90%, i.e., the majority of the
data items in the cacheline were accessed or referenced by the application. Similarly,
in the case of miniapps, we observe very high utilization. RandomAccess and Stride
73
Figure 4.4: Data turn-over score for dierent HPC workloads
benchmarks have very low data intensiveness score as they typically access only one
data item per cacheline. Another factor to note is that the interval we considered
here is 500K memory references, a smaller value may potentially decrease the data
intensiveness score in a phase as data item may not be re-referenced in a smaller
phase duration.
Figure 4.4 shows the data turn-over scores across dierent phases of the applica-
tion. The data turnover score is dened as the amount of new data that is referenced
between two consecutive phases. Ideally, we want this score to be as small as possible
indicating data reuse. A low score also indicates that data did not change across
phases and thereby reduces the memory bandwidth requirements of the application.
If an application has a very high turn-over score, it would indicate that the appli-
cation would incur a lot more main memory references and have potentially higher
74
main memory bandwidth requirements. GraphBIG kernels show average turn-over
score0.9, i.e., 90% of data changes across phases. In many NPB benchmarks, the
lowest score was 0 indicating in those phases no new memory references. The average
score was around 0.75. In miniapps, miniGhost has an average score around 0.97
indicating that every 500K references, nearly all the data items are replaced. Stride
and RandomAccess have an average score of 1 as expected.
To summarize the results, GraphBIG kernels has a dierent prole as compared to
miniapps and NPB. GraphBIG kernels show lower spatial locality, temporal locality,
and data-intensiveness compared to NPB and miniapps. Many NPB benchmarks
have a similar prole as compared to miniapp benchmarks. NPB benchmarks average
over 0.9 for data intensiveness indicating all the data inside the cachelines is being
referenced in application phases.
4.5 Application Address-Space Modeling
In the previous section, we evaluated applications in an architecture-neutral way to
understand the dierences between dierent classes of applications. While analyzing
application memory reference pattern is helpful, we also need to understand how the
application address-space is utilized during execution. Given that capturing informa-
tion about the data being referenced or evicted from a cache provides insights into its
utilization, we begin by tracking and monitoring each cache reference and evictions
across all the levels of cache hierarchy during application execution. This approach is
similar to [DD15] wherein they generate an address histogram for memory addresses
75
Table 4.1: Mantevo miniapps command-line arguments
Benchmarks Command-line Arguments
HPCCG 64 64 64
miniAMR (default values)
CoMD (default values)
miniGhost {nx 100 {ny 100 {nz 100
referenced by the processor. In our approach, instead of tracking memory addresses
referenced by the processor, we count the number of times a memory address location
was referenced and evicted in an individual level of the cache hierarchy. Combining
this information across all the memory addresses referenced during application exe-
cution provides information on how the application-address space is being referenced
or evicted from a particular level of the cache hierarchy. Even though the processor
issues a large number of memory requests of various sizes during normal application
execution, such requests typically result in a cacheline-size data movements in the
memory hierarchy as a cacheline is the smallest granularity of data transfer. There-
fore, we track the movement of data at the cache-line granularity. A high reference
count indicates that the memory address location is accessed often in the cache.
Similarly, a high eviction count indicates that the memory address location is get-
ting evicted multiple times from the cache and may potentially point to issues with
cache thrashing. We then analyze the reference and eviction information collected
across all the memory locations for a particular level in the cache hierarchy using our
proposed average cache references per eviction (ACRE) metric, a measure of cache
utilization. ACRE records the average number of times cachelines are referenced be-
fore they evicted from a particular level of the cache during the application execution.
76
Table 4.2: Cache Performance for various benchmarks
Benchmark
References (M) Hit Ratio (%)
L1 L2 L3 L1 L2 L3
Mantevo
HPCCG 3,871 270 270 93.02 0.00 0.00
miniAMR 736 24 22 96.80 5.51 21.9
minGhost 6,077 334 333 94.51 0.08 0.10
CoMD 14,679 70 54 99.52 23.29 14.77
GraphBIG Benchmark Kernel
BC 113B 10B 9B 90.90 13.72 10.99
BFS 709 13 12 98.14 6.12 3.41
CComp 709 13 12 98.13 6.15 3.56
DC 708 13 12 98.12 6.09 3.66
DFS 705 13 12 98.13 6.11 3.45
GUp 703 13 12 98.21 5.05 2.73
KCore 706 13 12 98.23 5.06 2.70
SPath 708 14 13 98.03 5.87 3.32
TMorp 3,368 131 120 96.11 8.14 3.14
TC 762 15 14 98.10 5.66 3.48
We analyze the reference and eviction information for many benchmarks from
GraphBIG [NXT
+
15], a graph computing benchmark suite and various miniapps
from the Mantevo suite [man]. For the GraphBIG benchmarks, we considered a syn-
thetic LDBC graph of 10K nodes [PBE13] as the input graph. The commandline
arguments for Mantevo miniapps are listed in Table 4.1. For the experiments, we
used the SST simulator [RHB
+
11, Lab]; we modied the SST simulator to allow for
the above tracking and monitoring to be incorporated in the memHierarchy compo-
nent of SST. We used the Ariel processor model from SST running a single thread.
The memory hierarchy consists of L1, L2, L3, and main memory. The L1/L2/L3
cache sizes are set to 64KB/ 256KB/ 512KB, respectively. The access latency for
L1/L2/L3 cache is 4/10/20 cycles, respectively. All the caches modeled as 8-way set
77
associative caches. The main memory has a
at access latency of 100 cycles.
Table 4.2 shows the number of cache references (millions) and the cache hit ra-
tios for various benchmarks under consideration. As seen from the results, all the
benchmarks exhibit L1 hit ratios of over 90%, indicating good cache reuse. For
the GraphBIG benchmarks operating on LDBC 10K graph, only the Betweenness
Centrality (BC) benchmark kernel has an L1 cache hit ratio of less than 96%. All
the graph computing kernels have low L2 cache hit ratios, and L3 cache hit ratios.
Even for BC kernel which has 1000x more cache references than other kernels, L2/L3
cache hit rates are still around 10% indication very poor spatial locality leading to
increased data movement across the cache hierarchy. For the Mantevo miniapps, the
L3 cache hit ratio is larger than the L2 cache hit ratio. Most of the performance
characterization eorts results in generating and gathering cache hit/miss statistics.
The cache hit/miss statistics summarize how the caches are utilized without provid-
ing any insights into how the cache architecture modications. To better understand
the interactions between the application and processor cache hierarchy, we present
results from our modeling mentioned above.
Figure 4.5 shows the references and evictions observed at L1, L2 and L3 caches
across the application address-space for the HPCCG miniapp. The application address-
space is plotted on the X-axis, whereas, Y-axis indicates the number of times a par-
ticular memory address location (cacheline granularity of 64 bytes) is referenced or
evicted. As seen from Figure 4.5, the number of memory address references and
evictions are nearly identical for L2/L3 caches, while L1 cache results indicate that
78
Figure 4.5: References and Evictions for HPCCG miniapp
memory address locations are being referenced more than they are evicted, implying
cache reuse. Across all the levels of the caches, we observed that starting memory
address locations show orders of magnitude more activity compared to other address
ranges (the compiler kept control information in the starting address ranges). The
memory address locations towards the end of application address spaces also showed
increased activity. Similar trends were also observed for other benchmarks analyzed.
While the address distribution graph provides valuable information about how the
application address-space is being accessed or evicted from various levels of the cache
hierarchy, Figure 4.6 shows an alternate view for the references and evictions across
the application address space for each level of the cache. We call this alternate view
a address counts histogram. The X-axis of an address counts histogram indicates the
79
Figure 4.6: Address Count Histogram for HPCCG miniapp
number of times a memory address location is referenced or evicted while the Y-axis
indicates the number of memory address locations in the application address space
being referenced or evicted. If good cache reuse is being achieved, one should ideally
observe references skewed towards the right side of the address counts histogram,
indicating that the memory address locations are accessed often, while the evictions
should be skewed towards the left side of the address counts histogram, suggesting
that memory address locations are not brought back into cache again when evicted.
For example, in Figure 4.6, around 75K memory address locations are evicted
only once and around 75K memory address locations are referenced between 4-8
times. From the histogram for L1 cache, we can observe that nearly 1.4M address
locations are evicted between 128-256 times, while, nearly 1.5M address locations are
referenced at least 512 times during application execution. Given that a large number
of address locations are evicted so many times, it indicates some cache thrashing is
occurring in the L1 cache during the application execution. The number of mem-
ory address locations that are evicted and referenced in the L2/L3 cache is nearly
identical for the HPCCG application indicating that once the memory location is ref-
erenced, it is evicted without further reuse correlating the poor L2/L3 cache hit ratios.
80
Figure 4.7: Mantevo miniapps Address Counts Histogram
In Figure 4.7, we present address count histograms for the various Mantevo
miniapps. Each of the Mantevo miniapps (CoMD, HPCCG, miniAMR, miniGhost)
has a unique prole at each level of cache. For example, from the L1 cache histogram
for CoMD, we observe that around 105K memory address locations are being evicted
only once whereas around 95K memory address locations are referenced 4-8 times
and 150K locations are being referenced over 512 times. L2 and L3 caches have sim-
ilar reference and eviction counts distributions. In HPCCG, both L2/L3 cache show
around 1.4M memory address locations being referenced and evicted between 128-256
times, indicating that data does not t into the cache and an L1 cache miss usually
results in an L2/L3 cache miss. For the miniAMR miniapp, The L1 cache address
count histogram is closer to an ideal curve, wherein evictions mostly are skewed to-
81
Figure 4.8: GraphBIG Benchmark Kernel Address Counts Histogram
82
wards the left side of the histogram while references are skewed right, indicating that
cache-lines are being reused multiple times. The L2/L3 cache prole indicates that
most of the address locations are referenced and evicted around 4 times, a fairly low
number. In miniGhost, we observe that a large majority of address locations are
evicted between 256 and 512 times across all levels of the cache hierarchy, potentially
because of either capacity or contention.
Figure 4.8 shows the address counts histograms for L1, L2 and L3 caches for the
dierent kernels from the GraphBIG suite. As seen from the histograms, all the
benchmarks show an identical number of address locations that are referenced and
evicted in L2/L3 caches, i.e., a miss in L1 cache most likely causes a cache miss in L2
and L3 caches and requires main memory access. This also indicates that the data is
referenced only once before it is evicted from the L2 and L3 cache. Very few memory
locations are either referenced or evicted more than 32 times, an artifact of the graph
applications, wherein, to nd relationships, various connected nodes in a graph must
be traversed. The L1 cache address counts histograms follows the expected curve
with evictions being skewed left (locations evicted few times) while the references
are generally skewed right (locations referenced many times). A key dierence be-
tween the address counts histograms observed for Mantevo miniapps and GraphBIG
kernels is that in Mantevo miniapps a majority of the memory address locations are
referenced and evicted over 256 times, while, only in Betweenness Centrality (BC)
kernel of GraphBIG we observe that around 20% of memory address locations are
referenced/evicted more than 256 times while in other GraphBIG kernels even smaller
share of memory address locations are referenced/evicted more than 256 times.
83
Table 4.3: Average References per Evictions (ACRE) for various benchmarks
Benchmark
ACRE
L1 L2 L3
Mantevo miniapps
CoMD 209.90 1.30 1.17
HPCCG 14.32 1.00 1.00
miniAMR 31.27 1.06 1.28
miniGhost 18.21 1.00 1.00
GraphBIG Benchmark Kernels
BC 10.98 1.16 1.13
BFS 53.72 1.07 1.04
CComp 53.60 1.07 1.04
DC 53.27 1.07 1.04
DFS 53.54 1.07 1.04
GUp 55.81 1.05 1.03
KCore 56.54 1.05 1.03
SPath 50.74 1.06 1.04
TMorp 25.70 1.09 1.03
TC 52.49 1.06 1.04
In Table 4.3 we present results for the average number of cache references per
cache evictions (ACRE), a metric which can potentially provide insights into cache
utilization. A high ACRE value indicates that the memory address locations are being
reused often when cached, suggesting caches are eective in minimizing data move-
ment and improving performance for the analyzed application; whereas, a low ACRE
value indicates that the cache is not being used eectively, and a dierent cache
architecture could potentially minimize data movement and improve performance.
When the ACRE value is aggregated over the application address-space, it turns out
to be the ratio of cache-hits to cache-misses. The distribution of ACRE score over
application address-space is more valuable than a single-score for a given application.
To simplify and be able to compare ACRE scores across dierent application, we
84
present a single score for each application across individual levels of caches in the
cache hierarchy. From Table 4.3, we observe that the L1 cache ACRE value can vary
signicantly for dierent applications. For example, CoMD exhibits an ACRE value
of 210 while the ACRE value for miniGhost is 18. Most of the GraphBIG benchmarks
have ACRE values 50-56 for the L1 cache except BC which has L1 ACRE value of
11. The ACRE values for L2 and L3 cache were observed to be around 1 indicating
not much cache reuse. We will be performing more experiments with larger datasets
sizes/ cache congurations to verify if this low ACRE value is because of the problem
size we tested here or a phenomenon of the memory architecture. These ACRE values
highlight the dierences in how dierent applications utilize caches, given that all of
these benchmarks have a very similar L1 cache hit ratios over 95%. The low ACRE
values for the L2 and L3 cache also correlate with the poor L2, and L3 cache hit ratios.
To summarize, we presented how application address space is being referenced
and evicted from various levels of cache hierarchy along with its alternative view; an
address counts histogram which bins the memory address locations based on the times
they are referenced and evicted in L1, L2 and L3 caches for various HPC and graph
computing applications. Then we presented an ACRE metric and reported ACRE
value for L1, L2 and L3 caches observed during application execution. Analyzing the
address counts histogram and ACRE values for the caches provides insights into how
eectively a particular level of cache is being utilized during application execution,
which provide more insights than the cache hit/miss ratios and potentially allows us
to develop improved memory architectures to better support these scientic HPC and
emerging graph computing applications.
85
4.6 Summary
In this work, we evaluated dierent benchmarks and applications for dierent locality
metrics. In addition to measuring spatial locality and temporal locality, we focused on
two additional metrics data intensiveness which measures the amount of unique data
accessed by application and data turnover, a measure of the rate of change of mem-
ory reference pattern over application phases. We observed that GraphBIG kernels
memory access pattern is dierent from NPB benchmarks and miniapps. GraphBIG
kernels show lower spatial and temporal locality, data intensiveness. They also exhibit
lower variance across dierent application phases as compared to NPB Kernels.
Next, we presented how the application address-space is referenced and evicted
from various levels of the memory hierarchy during application execution for a wide va-
riety of HPC applications ranging from scientic computations benchmarks to graph
processing kernels. To provide more insights into the utilization of memory hierar-
chy resources during application execution, we showed an alternate view of appli-
cation address space references and evictions information, called an address counts
histogram, which bins memory address locations based on the number of times they
are referenced and evicted from a particular level of cache. From the modeling of the
application address space usage, address-space hot spots are readily identiable and
known techniques for mitigation can then be applied. This modeling enables iden-
tication of address-space ranges for dierent priorities. These metrics aggregately
indicate critical architecture requirements for dierent classes of applications.
86
Chapter 5
Cache Energy Study
The previous chapter presented the performance characterization of High-Performance
Computing (HPC) applications. Extending that work further, in this chapter we
study and analyze the energy implications of running applications on HPC systems.
In this chapter, a model to measure cache energy for applications running on real
machines is described. Using the model, dynamic and leakage energy components of
the total cache energy are estimated and quantied for wide-range of HPC applica-
tions. Results are presented for L1 cache energy for wide-range of HPC applications
operating in both serial and parallel modes.
5.1 Introduction
Current large-scale high-performance systems are energy constrained. The current
generation of high-performance machines achieves an energy eciency of around 2
PFlops/MW [Top]. With a 25x performance improvement needed over current state-
of-the-art systems to reach exascale computing levels, current energy eciencies would
87
make it impossible to operate them. Energy eciencies in these machines need to
improve by at least an order of magnitude to make deployment of these machines
feasible, as described in an exascale system study report [P. 08].
Energy used in these large systems can be broadly divided into two components,
namely, dynamic and static energy, where most of the static energy can be attributed
to leakage. Historically, the total energy component was dominated by dynamic en-
ergy whereas the static energy component was negligible. With advances in chip
fabrication technology, the transistor thresholds have decreased resulting in increased
leakage energy costs [ITR]. The ratio of leakage energy to total energy has increased,
leading to higher energy budgets. As a rule of thumb, based on ITRS roadmap, de-
signers now account for the leakage energy component to be around 30% of total chip
energy budgets.
Prior research [P. 08] has identied the memory subsystem as one of the most
critical blocks from an energy standpoint. The memory hierarchy consists of on-chip
caches, o-chip main-memory, disks, and other secondary storage. With a compli-
cated memory hierarchy present in the system, most of the prior work has developed
a coarse understanding of memory energy requirements from a system perspective;
very little information is available on the energy implications from an application's
perspective or on the interactions between various levels of the memory hierarchy for
application execution. In general, on-chip caches being smaller and faster compared
with other types of memories have been extensively used to reduce the ever-increasing
gap between the speed of processors and memory, widely known as the memory wall
88
problem. Most modern processors utilize a deep hierarchy of on-chip caches (e.g.,
separate L1 data and instruction cache, unied L2 caches and a shared L3 cache
across all cores in a chip). With a focus on performance, computer architects are
continuously nding ways to increase the capacity of on-chip caches. As the amount
of caches increases, on-chip caches occupy a large percentage of chip area and accord-
ingly may account for up-to 50% [MMC00, Seg01, PYF
+
00] of total power budgets.
Such trends also reinforce the need for more detailed memory system energy studies.
With on-chip caches occupying a large portion of chip area, being a primary
component (level) of the memory hierarchy, and with leakage expected to become
worse in transistors which are used to build on-chip caches, on-chip caches make a
strong candidate for an energy study from an application perspective. To study how
much energy is used in on-chip caches during application execution, we measure the
energy in various levels of the on-chip cache hierarchy and then separate the measured
cache energy into dynamic and leakage energy components to understand the severity
of the leakage energy in on-chip caches. We evaluate a variety of parallel benchmarks,
and real-world compute kernels to study the severity of the leakage problem in on-chip
caches. We prole application execution and measure various hardware performance
counters (HWPC) in the system. Using memory access information from hardware
performance counters, we develop a model to estimate energy usage for various levels
of caches, which in turn can be used to compute the total memory subsystem energy.
Our work makes the following contributions {
We present a model to estimate energy consumed in on-chip caches running
parallel applications on real-world processors
89
We measure cache energy and then distribute the energy into its dynamic and
leakage energy components
We quantify leakage energy component in the total memory system energy.
We present data for various compiler optimization options for application exe-
cutions and their cache energy implications.
The model presented here exposes software developers to the energy implica-
tions of the applications, thereby potentially enabling them to tune their codes
towards a lower energy prole.
In the following section, we present a review of energy estimation techniques in
high-performance systems. In section 5.3, we discuss our energy estimation model,
followed by the description of the benchmarks and the real-world compute kernels
considered for this study. In section 5.4, we present analyses of the benchmarks and
application execution. The last section concludes the analysis of the data presented
in this chapter by highlights key research contributions.
5.2 Energy Estimation Techniques
In HPC systems with hundreds of thousands of cores at their disposal, limiting energy
usage is critical not only for thermal and reliability issues but merely for containing
operating costs. With energy being a barrier to exascale computing, more research is
needed for improving energy eciencies of various components in the system. The rst
step in reducing energy consumption is to identify how and where energy dissipation
occurs. Eective energy measurement could enable designers to identify energy sinks
90
and replace them with more energy ecient mechanisms. The two major contributing
components for the computational energy are the processor and memory subsystem
which consists of the cache hierarchy and main memory. Most prior research has
focused on reducing the energy consumed in the processor. Dynamic Voltage and
Frequency Scaling (DVFS) techniques in addition to the use of dark silicon [HFFA11]
resources have reduced the core energy. Earlier study [P. 08] projects that in fu-
ture technologies, getting data to the processor core will be more expensive in terms
of energy than the actual computation of that data, and therefore more research is
needed to develop techniques for reducing energy expended in the memory subsystem.
Energy measurement in high-performance systems enables designers to measure
the energy consumed in the system and then correlate the measured energy to var-
ious components in the system. The techniques used for energy measurement or
energy estimation techniques can be divided into two categories namely, invasive and
non-invasive techniques. Invasive techniques require additional hardware instrumen-
tation in terms of sensors or power meters into the system where power/energy is to
be measured, and then through either additional data-acquisition hardware or using
software APIs information is gathered and analyzed. PowerPack [GFS
+
10] provides
an interface for measuring power from external power sources (sensors and digital
meters) across the system and also provides API routines for data gathering on re-
mote machines. IBM Power Executive [Pop] allows for monitoring power and energy
on IBM blade servers. Both PowerPack and IBM Power Executive require an addi-
tional external machine, from where the monitoring and data gathering is controlled.
The Linux Energy Attribution and Accounting Platform (LEA
2
P) acquires data on a
91
system using data-acquisition board. [Ryf]. PowerScope uses digital multimeters for
oine analysis using statistical sampling. In [IM03] external power meter measure-
ments were combined with hardware performance counters to generate power readings
for a modeled CPU. Each of these approaches requires some hardware intervention
and in some cases proprietary hardware, they cannot be easily replicated across other
systems and are limited to specic machines.
Non-invasive measurement techniques require no additional hardware instrumen-
tation on the machine where energy is to be measured. These techniques rely on the
use of one or all of the following (a) onboard sensors already present, (b) through an-
alytical models tuned for specic machines, (c) using hardware performance counters.
In newer processors, manufacturers have incorporated performance counters that
allow for the measurement of a wide variety of events, e.g., cache accesses, type of
instructions executed, network and I/O activities. With the increased availability of
these counters and their ability to accurately measure a large variety of CPU events,
some of these counters are useful for energy estimation. The Performance API or
PAPI framework [BD
+
00] provides low-level cross-platform access to hardware perfor-
mance counters allowing measurement of a diverse set of CPU events with a standard
interface. Component PAPI [TJYD] extends this scheme to support reading of per-
formance data from dierent components in the system, enabling energy estimation
for such components. Once an application is instrumented with PAPI, it can be used
with minimal changes to measure a variety of CPU events and in turn, be used for
energy estimation. Intel Energy Checker SDK [Int10] provides similar functionality
92
to PAPI but is limited to Intel architectures. Many prior research eorts have used
hardware performance counters to model energy and power consumption in various
components of the system [IM03, KCK
+
01, SBM09, GMG
+
10, TMW94, BJ12, RJ98,
WJK
+
12, HKSW03, JM01, Bel00].
As measuring energy and power consumption for application execution is becom-
ing complex and challenging due to various power modes in processor cores, processor
manufacturers have started providing API's to access various available on-chip sensors
to enable power proling. Intel Running Average Power Limit (RAPL) [Int], AMD
Application Power Management [AMD], and NVIDIA Management Library [Nvi] al-
low access to various status registers which contain power and energy information.
This information is generated using proprietary models based on temperature, power
prole, leakage models and hardware performance counters.
Many of the above energy estimation models provide only numbers for overall
processor energy; they do not provide distribution of energy in terms of dynamic
and leakage. For measuring system energy, such models are sucient, but to gain
more insight into exactly where energy is consumed, more is needed. In our case,
since we aim to dierentiate individual cache energy consumption into dynamic or
static energy components, these models prove inadequate. Also, many of the above
approaches are tuned for a specic processor model and cannot be directly used with
another processor without substantial update and verication of the model to the
current processor. Also, many of these techniques rely on modeling processors in sim-
ulators, while modeling them in simulator provides eective means of measuring and
93
Figure 5.1: Cache Energy Estimation Framework
capturing events, not every aspect of the processor can be captured in the simulator
and therefore, variances are introduced. With running applications in their native
execution environment, we try to minimize some of these variances between modeling
and real systems.
5.3 Energy Estimation Model
Figure 5.1 shows our cache energy estimation model. In order to estimate the energy
consumed in various levels of caches during application execution, we rst compile
the application and instrument the application executable with the PAPI framework.
The PAPI framework allows users to set which of the PAPI performance counters
are to be proled for a particular execution. The instrumented code is then executed
94
multiple times in order to capture and measure various HWPC values. At the end
of execution, a report containing information about the HWPC proled is generated.
CACTI [M
+
] is used widely in computer architecture research for evaluating energy
and performance implications of various advancements, and its energy models have
been validated for processor caches. CACTI simulator is congured to model caches
present in the processors of the system used for the studies. We combine the recorded
HWPC information with the cache energy information from CACTI. This energy es-
timation model essentially multiplies cache-energy per access from CACTI with the
number of accesses to a particular level of cache as recorded from instrumented code
execution. This approach derives fast and accurate cache energy values for various
levels of caches with minimal intervention and substantially lower overheads com-
pared to other schemes. It also provides a way to measure dynamic and leakage
energy components of the total cache energy.
Machine Description
We conducted the experiments on Hopper, a Cray XE6 machine located at the Na-
tional Energy Research Scientic Computing Center (NERSC) facility. Each com-
pute node of Hopper consists of two 12-core AMD `MagnyCours' 2.1 GHz processors.
Each core of a 12-core AMD `MagnyCours' processor has a 64 KB private L1 data
and instruction cache. Each core also has a unied L2 cache of 512 KB. A 6 MB
L3 cache is shared between six cores. We used Intel compilers for compiling codes.
We instrumented application code with the PAPI framework (Version 5.1.0.2) for
measuring HWPC using Cray's Performance Analysis Tool (Version 6.1.0) [Cra].
We modeled cache energy using CACTI 6.5.
95
5.4 Results
We begin our discussion by rst presenting data for a simple parallel code.
We analyzed PDGEMM, a Parallel implementation of Double precision GEneral
Matrix Multiplication operation routine from the Scalable Linear Algebra PACKage
(ScaLAPACK) [L.S97, B
+
96]. This routine is an implementation of BLAS (Basic
Linear Algebra Subprogram) for distributed memory machines. In order to measure
cache energy for distributed execution of the application, we ran the matrix multi-
ply operations for multiple sizes using 4, 16, 64 MPI nodes. We kept the block size
parameter constant to 1=8
th
of the size of the matrix. The matrix multiply code
was compiled using an Intel compiler with dierent optimization congurations and
then instrumented with CrayPat to measure accesses to various levels of caches. We
compiled the code with three dierent optimization conguration using -O1, -O2,
and -O3 optimization
ags. Four dierent matrix sizes of 1024, 2048, 4096 and 8192
were considered for this set of experiment. Figure 5.2 shows the distribution of energy
consumption in the L1 data cache. The Figure 5.2 shows the percentage of dynamic
energy and leakage energy of total L1 data cache energy on the primary (left) Y-axis
and the measured total L1 data cache energy (in Joules) on the secondary (right)
Y-axis. Figure 5.3 plots the total L1 data cache energy and execution runtime for
various congurations.
From the Figure 5.2, we observe that the leakage energy component varies between
25%-70% of the total L1 cache energy for -O1, -O2, and -O3 compiler optimizations.
For all compiler optimization congurations, we observe that for a given matrix size,
96
Figure 5.2: PDGEMM L1 Data Cache Energy Distribution
Figure 5.3: PDGEMM L1 data cache total energy vs application runtime
as the number of nodes increases, total energy increases. This suggests that as the
overheads of parallelization increase, there is an increase in cache activity resulting
in a total cache energy increase. From Figure 5.3, we observe that the runtime of the
application execution decreases as the number of nodes increases, but the reduction
in the runtime is not large enough to reduce the total energy used for computation.
We also observe that the dynamic energy as a percentage of total L1 cache energy
increases when the number of nodes in the system increases; this can be attributed
to increased cache accesses for inter-process communications. We also calculated L1
97
data cache energy when the code was compiled using GNU compilers and observed
that for the same optimization
ags, Intel compilers resulted in lower L1 data cache
energy, and therefore, all future sets of experiments were conducted using code com-
piled with Intel compilers.
Studying leakage in the context of MPI implementations of applications is impor-
tant as they represent a very large class of real-world parallel scientic applications.
We evaluated a subset of NAS Parallel Benchmarks version 3.3 [B
+
91] MPI imple-
mentations to study energy distribution. Figure 5.4 shows the dynamic energy for
L1 data cache for various benchmarks compiled with three dierent compiler opti-
mization options (O1, O2, O3). The values in Figure 5.4 indicate percentages and
the remaining percentages indicate the leakage energy component of total L1 cache
energy. We consider CG, EP, FT, IS, LU, MG benchmarks from the NPB suite.
For each benchmark, we study class A, B, and C problems sizes. For class A and B
problem sizes, we vary the number of MPI nodes for computation from 2 to 32 in
multiples of 2. For class C problem size, we vary MPI nodes used for computation
from 4 to 64 in powers of 2. We did not evaluate 64-MPI nodes implementations for
class A, B evaluations as we expect signicant MPI communication and synchroniza-
tion overheads for those cases. Figure 5.5 shows the total energy for each benchmark
run. Every block in Figures 5.4 and 5.5 is a collection of 15 points representing three
dierent compiler optimization options, each compiler optimization option is run with
ve-dierent MPI nodes conguration based on the problem class as described above.
As seen from Figure 5.4, the dynamic energy accounts for 18% to 61% of total
98
Figure 5.4: Dynamic Energy across various compiler optimizations for NAS Parallel
Benchmarks
Figure 5.5: Total L1 data cache energy across various compiler optimizations for NAS
Parallel Benchmarks
L1 cache energy; which indicates leakage energy accounts for 39-82%, which is signif-
icantly higher than the overall chip leakage target of 30% in most modern-day chips.
As we observe from Figures 5.4 and 5.5, for the EP benchmark, all compiler optimiza-
tions have very similar dynamic energy proles and total cache energy. The leakage
energy is around 77% of total cache energy, and irrespective of the number of MPI
nodes, the total cache energy remains nearly constant. This indicates the paralleliz-
ability of the EP benchmark, which as the name suggests is indeed Embarrassingly
Parallel. From the above graph, we observe that higher the compiler optimization,
lower is the total energy used in the memory hierarchy. Even though the compiler
99
Figure 5.6: L1 data cache energy distribution for IS benchmark with -O3 optimization
optimization improves energy utilization, for a distributed systems environment more
work is needed to improve the energy prole as the majority of the benchmarks show
an increase in total cache energy with an increase in the number of nodes for compu-
tation.
Figure 5.6 shows the L1 data cache energy for the IS benchmark with -O3 opti-
mization. From the graph, we observe that for each problem size, as the number of
computation nodes increases, total energy increases. The increase in total energy is
dominated by the increase in the dynamic energy component. The performance of
the IS benchmark, which performs integer sorting, is dependent on integer computa-
tion speed and random memory access. These random memory accesses along with
100
distributed processing overheads cause a substantial increase in memory activity, re-
sulting in a large number of additional cache accesses, leading to an increase in the
dynamic energy component. The cache misses also aects application runtime, which
in turn increases the leakage energy losses.
Figure 5.7 shows the L1 data cache energy for the LU benchmark with -O3 op-
timization. We observe that the dynamic energy component varies between 29-49%
for all of the problem sizes. The total L1 data cache energy initially increases as the
number of nodes increase. This is due to the MPI overheads not getting compensated
by the advantages of parallelism. As the number of nodes used for computation fur-
ther increase, advantages of parallelism outweighs the MPI communication overheads
and the lowest energy point is achieved around 16 and 32 MPI nodes for class B and
C problem sizes, respectively. Given that class A problem sizes are very small, the
lowest energy point is with 4-MPI nodes and any further increase in the number of
nodes increases energy. For the class C problem size, as the number of MPI nodes
increases from 32 to 64, the MPI communication overheads become too large to be
compensated by the additional number of cores, leading to additional cache accesses
and the total cache energy increases. If the MPI nodes are further increased, we
would observe a signicant increase in the total cache energy as well as the runtime.
This suggests that for a xed data-set size, there exists a point (number of nodes used
for computation) where the total energy is minimum, and our energy estimation tool
can help users determine such optimum operating points for their applications.
We also performed experiments to understand how dynamic and leakage energy
101
Figure 5.7: L1 data cache energy distribution for LU benchmark with -O3 optimiza-
tion
contributions to total cache energy varies with application scaling. We considered
HPCCG, a miniapp from the Mantevo project [Her, man]. HPCCG is a linear solver
using the conjugate gradient method. It is characterized by unstructured data and ir-
regular data communication patterns. We compiled the application using -O1, -O2,
and -O3 compiler optimizations. For the strong-scaling experiments, we kept the
global problem size constant at 64 64 1024. For weak-scaling experiments, the
local problem size was kept constant at 64 64 64. In both cases, the number of
nodes used for computation was varied from 2-to-64 in powers of 2. Figure 5.8 shows
the strong-scaling energy distribution for the L1 data cache while Figure 5.9 shows
the weak-scaling energy distribution for the L1 data cache. In both Figures 5.8 and
5.9, the left-side Y-axis indicates the percentage of total L1 cache energy for dynamic
102
Figure 5.8: Strong Scaling for HPCCG
and leakage energy. The right-side Y-axis shows the total energy (J) and runtime
(seconds) for the application execution.
For strong scaling experiments, as the global problem size remains constant ir-
respective of the number of nodes used for computation, the total energy consumed
should remain constant while the runtime should decrease proportionally to the nodes
used for computation. We observe from Figure 5.8 that as the nodes increase there is
an increase in total energy, but after 8-nodes, the total energy used in the L1 cache
remains nearly constant while the runtime decreases. Similarly, in weak-scaling ex-
periments, with the local problem size remaining constant, the runtime is expected
to remain constant and total energy is expected to increase proportionally to the
103
Figure 5.9: Weak Scaling for HPCCG
number of nodes used for computation. From Figure 5.9, we observe that runtime
remains nearly constant as the number of nodes increases, whereas the total energy
increases at a proportionally higher rate than the increase in the number of nodes.
This proportionally higher rate of increase in the total energy can be attributed to
the overheads of MPI synchronization and communications. The HPCCG code is
very scalable, and given that additional memory resources aid the irregular memory
access pattern, an increase in total cache energy is a concern. Another trend observed
from the scaling experiments with HPCCG is that as the number of nodes increases,
the percentage of leakage increases. In these cases also, the leakage energy accounts
for 65%-70% of total cache energy.
104
To summarize the insights gained from the various experiments conducted to
quantify the dominance of leakage energy in on-chip caches, we observe that the
leakage energy is the primary source of energy dissipation in the L1 caches, and it
dominates the dynamic energy component. We observe that the leakage component
is signicantly more than 30% of total cache energy, which is typically budgeted
by designers. Another observation is that even with various -O1, -O2, and -O3
compiler optimizations, there is not a signicant dierence between the L1 cache
energy proles obtained across the benchmarks and applications. This suggests that
there exist limited opportunities for tuning for energy with current compilers in a
way that reduces energy in L1 caches. More work needs to be done in this area to
ascertain which knobs can be applied to reduce energy in caches. We also observe that
based on the applications proled, each of the compiler optimizations fails to yield
implementations that eciently parallelize code; as the number of nodes increases,
the MPI synchronization overheads become dominant.
5.5 Summary
With L1 being the smallest and most often accessed of all on-chip caches, the leakage
energy of L1 data and instruction caches accounting for such large percentages of total
cache energy is an alarming sign. With L2 and L3 cache being orders of magnitude
larger than L1 cache in size and less often accessed, their energy consumption is likely
to be even more skewed towards leakage energy. Most of the cache leakage-energy re-
duction techniques proposed in prior research have an associated performance penalty
limiting their use to higher levels of caches where the performance penalty can be hid-
105
den in the latency of the cache. Thus, leakage energy in L1 data/instruction caches
is likely to remain a concern for system designers.
In this chapter, we presented a model for fast energy estimation of cache energy
in the various levels of on-chip processor caches. Our scheme for energy estimation
allows us to estimate dynamic and leakage energy for various levels of caches when ap-
plications are running on real machines in their native execution environments. Our
framework provides a lightweight model for use by software developers to estimate
energy consumed in the memory subsystem, potentially providing faster feedback
needed for energy tuning.
Using this scheme, we quantied leakage energy in L1 data caches for various
parallel benchmarks and real-world application kernels. Based on our experiments, we
observed that the leakage energy component of total cache energy varies from 40% to
80% for L1 data caches across a broad range of applications. We also showed that even
compiler tuning options do not have a signicant impact on leakage energy reduction.
The signicance of these percentages emphasizes the importance of addressing leakage
as part of the exploration of architectural and circuit-level ideas to overcome the
energy barriers to the next generation of extreme-scale computing.
106
Chapter 6
Cache Architecture Parameter
Characterization
In the previous chapters, we discussed modeling and understanding the memory refer-
ence pattern and locality characteristics of the applications. In this chapter, we study
how cache architecture parameters aect data movement and memory hierarchy be-
havior. Here we study the sensitivity of cacheline size on the system performance and
cache associativity on the energy and data movement of the cache hierarchy.
6.1 Introduction
Application characterization is an essential tool in understanding how applications
interact with the underlying computing systems/platforms. It is widely used by com-
puter architects to evaluate the impact of new architecture features. Similarly, it is
also used by application designers to study how algorithmic changes aect system
performance. HPC systems designers characterize applications to understand bottle-
107
necks of current systems which in-turn aid and guide new architecture innovations
[BBC
+
13, P. 08]. This continuous application characterization - architecture innova-
tion cycle helps improve system performance by optimizing architecture requirements
for current and emerging applications.
The cache hierarchy is a critical system element that has signicant performance
and energy implications [P. 08]. The memory hierarchy consists of register-les, mul-
tiple levels of on-chip caches, o-chip memory and various levels of secondary and
tertiary storages. The performance of an individual cache is a function of its con-
guration. The main architecture conguration parameters of a cache are capacity,
associativity, cacheline or cache-block size, and access latency. Each cache congura-
tion parameter not only impacts the behavior of that cache but also aects the overall
cache hierarchy behavior. Some parameters have a larger impact than others. For
example, with advances in chip-manufacturing technologies, larger on-chip caches are
possible. The access latency of a cache is proportional to its capacity, i.e., a larger
cache will have larger access latency. Similarly, energy per access is also proportional
to its capacity. Therefore, increasing cache capacity not only impacts its access la-
tency; it also aects system energy, and resulting tradeos must be considered. With
a deep cache hierarchy and multiple options per conguration parameter, there exist
several viable options available for architects to study. Each of these options must be
evaluated to identify the optimum performance within set energy and performance
budgets. Characterizing for architecture congurations not only requires an under-
standing of these architecture parameters but also of the applications.
108
The structure of applications signicantly impacts cache performance and data
movement costs in terms of delay and energy. Each application has a unique cache
and memory prole. Scientic applications that are compute bound tend to have
higher data locality and dierent cache utilization prole compared to newer appli-
cations in cloud computing, data-analytics and graph processing. These applications
have large datasets with poor spatial and temporal locality. The goal is to identify
architecture congurations that are best suited for HPC workloads. Therefore, one
must characterize cache architecture parameters for a wide range of High-Performance
Computing (HPC) applications.
Cacheline, or cache-block size, is crucial to the eciency with which all interac-
tions occur inside the cache hierarchy. It is the fundamental unit of transfer between
caches as well as main memory, and data in the caches are stored at this granularity.
It also dictates the cache tag size. The cacheline size also dictates the organization of
the cache, layout, and the interconnection network (NoC) inside the processor. Tra-
ditionally, processor designers x the cacheline granularity and then design the cache
hierarchy around it. The cacheline size is selected based on the mix of application pro-
ling which may or may not include HPC workloads and cache tag overheads. Given
this fundamental cache architecture parameter and its associated implications on an
individual cache as well as the memory hierarchy, we analyze the eects of changing
the cacheline size for various High-Performance Computing applications. We study
the use of heterogeneous cacheline sizes across dierent levels of cache hierarchy and
its implications on the cache hit rates and the data movement between various levels
of caches as well as between cache and memory.
109
Similarly, associativity dictates the degree of freedom for placement of cacheline
sized blocks in each cache set. It aects tag size and determines the number of com-
parators required for comparing tags and selecting data. By changing the associativity
of a cache, one can aect the reuse distribution and aect cache performance. Given
the tradeos in performance and energy and data movement on an individual cache
as well as memory hierarchy, we analyze the eect of changing cache associativity
for various High-Performance Computing applications. We study how changing the
cache set-associativity size aects the cache hit rates and the data movement between
various levels of caches as well as between cache and memory.
The overall chapter is organized as follows. In section 6.2 we discuss implications of
cacheline and cache associativity on the cache performance and discuss related works
in the recongurable or adaptive caches. Section 6.3 details applications considered
for the study and the modeling framework used. Section 6.4 presents the results of
the study, and conclude the discussion in section 6.5.
6.2 Background
Cache performance can be dened as a function of cache architecture parameters,
namely, capacity, block size, associativity, replacement policy, and the incoming ap-
plication memory reference stream. Cacheline size and associativity are two sides of
the cache architecture ABC triangle, namely, associativity (A), cacheline or cache-
block (B) and capacity (C) [HP03].
110
Cacheline size dictates interactions between various levels of caches and cache-
memory transfer, and it remains transparent from the application view. From the
application standpoint, the whole address space is byte-aligned; whereas inside the
cache hierarchy all the addresses are aligned to the cacheline size granularity. All
incoming memory requests from the processor are rst translated to cacheline bound-
aries and then processed through the cache hierarchy. Typically, processors use a
xed cacheline size. Applications especially in the HPC space exhibit a wide range
of spatial and temporal locality, which aects cache utilization. If an application has
low spatial locality, consecutive words in each cache-block will remain untouched and
unused before the block gets evicted from the cache. Even for applications that have
high spatial locality, there is no guarantee that all the words in a cache-block will be
consumed as other cache organization artifacts may cause these blocks to be evicted
earlier than their reuse. This leads to situations where data is brought into the cache
and remains unutilized, leading to unnecessary movement of cachelines causing excess
data movement and wasted energy. Reports show on-chip interconnection networks
can contribute up to 30% of total chip power [Dav13], minimizing date movement
energy is essential. A transfer of cacheline between two levels of caches requires
many times the energy of cache access, a trend which is expected to be even more
skewed with rising wire-costs as they do not scale well in deep-submicron technologies
[ITR, LAS
+
09, HKC10].
Deciding one xed cacheline size is complex [KZS
+
12]. Small cachelines tend to
fetch data that will mostly be used. Therefore, they have lower unused data, require
111
less time to transfer, lower data movement overheads, and have lower overall inter-
connection network overheads. On the other side, they increase static energy losses in
caches (cache-tag) as they require more cache-tags for a given cache capacity. They
will fetch smaller-sized data elements from the memory and potentially increase main
memory trac depending on how caches utilization. They can also increase spatial
prefetching costs in high-spatial-locality applications. Larger or wider cachelines, on
the other hand, minimize cache-tag overheads for a given cache capacity. Depend-
ing on the application spatial locality prole, wider cachelines will have more unused
words leading to excess data movement across the hierarchy and will require wider
interconnection networks channels to support them and may also increase overall
bandwidth requirements. Wider cacheline caches will perform better if applications
have high spatial locality as nearby (spatial) data will already be present in the cache-
line. To summarize, a cache with small (narrow) cachelines tend to have small data
movement or data-transfer costs in terms of delay and energy and slightly increased
cache access costs compared to wider cachelines caches. Various works have tried to
balance and map cachelines based on applications [KZS
+
12, WMS09]. Some have
tried to lter [QSP07] out unused words, others have tried to select cachelines based
on application [VTG
+
99], but none of them have looked at HPC applications.
Cache associativity dictates the degree of freedom for placement of cacheline sized
blocks in each cache set. It aects tag size and dictates the number of comparators
required for comparing tags and selecting data. By changing the associativity of a
cache, one can aect the reuse distribution and aect cache performance. A cache
with more number of ways will reduce cache contention among blocks in a cache set
112
and allows data to remain in cache longer potentially allowing for reuse; thereby min-
imizing data movement. On the other hand, such a cache consumes larger energy per
access as it performs more tag-comparisons for every cache access. Similarly, a cache
with less number of ways will have more cache sets while increasing contention among
blocks in any given cache set. It consumes less energy per access as it requires less
number of tag-comparisons. In a way, it trade-os contention in one large cache set by
distributing it over multiple smaller cache sets. From the energy standpoint, it is al-
ways preferable to use a smaller way associative cache compared to a larger way cache.
Cache tuning or reconguration has been studied from the cache architecture
standpoint since caches account for a large portion of chip area and energy budgets
[BABD00, DS02, GRLC08, RAJ00, MCZ14]. These hardware-based techniques rely
on monitoring applications to predict the best cache conguration. The monitoring
mechanism consumes energy and has performance implications [ZGR13]; it has not
been implemented in modern processors. Also, the speculative nature of processor ar-
chitecture complicates the design of recongurable caches. Dynamic cache tuning or
reconguring has also been proposed to minimize the energy consumed in caches. For
example, in [ZVN03] the authors present a technique called way-concatenation to re-
congure cache architecture in embedded systems. Approaches like V-way [QTP05]
vary cache associativity in response to application demands. Set-balancing caches
[RFD09] shift cachelines from high activity sets to low activity sets to improve cache
utilization without reconguring caches. Selective cache-ways [Alb99] disable a sub-
set of ways in set-associative caches during periods of low cache activity. In adaptive
caches [SSL06], authors combine two dierent replacement policies to provide an ag-
113
gregate policy that performs within a constant factor of the better replacement policy
to improve cache utilization. In [TTK14] authors propose an adaptive runtime sys-
tem scheme to shut down parts of the cache based on the application characteristics.
While many of these proposed architectures have been evaluated for general-purpose
computing or embedded systems workloads, only one has focused explicitly on the
HPC workloads. For the HPC workloads that have dierent cache and memory char-
acteristics compared to the general-purpose workloads [MK07], more evaluation is
needed to determine which recongurable cache architectures are better suited for
HPC workloads. Given that HPC system designers are adept at tuning their appli-
cations for the given processor architecture, recongurable cache architectures will
provide additional knobs that can be tweaked by HPC system designers to tune their
applications for the HPC systems.
6.3 Modeling
To account for a wide range of cache and memory proles of HPC workloads, we
consider ten benchmarks from GraphBIG, a graph application suite [NXT
+
15], six
benchmarks from NAS Parallel Benchmarks (NPB) [B
+
91], four from Sandia Man-
tevo Miniapps suite (CoMD, HPCCG, miniFE, miniGhost) [man, BCH
+
15], Rando-
mAccess benchmark from the HPC Challenge Benchmark Suite (HPCC) [LBD
+
06],
STREAM benchmark [McC95] and Stride benchmark.
The GraphBIG suite represents dierent graph operations. Graph operations
can be broadly classied into graph traversal, graph update, and graph analytics.
114
Breadth-First-Search (BFS) and Depth-First-Search (DFS) represent graph traver-
sal, a fundamental operation of graph computing. Graph Construction (GCons)
constructs a directed graph with a given number of edges and vertices, while Graph
Update (GUp) deletes a given list of vertices and related edges from the graph, and
Topology Morphing (TMorp) generates an undirected graph from a directed-acyclic-
graph (DAG). The graph analytics group focuses on topological analysis and graph
path/
ow. Shortest Path (SPath) is for graph path/
ow analysis. K-core decom-
position (KCore), Triangle Counting (TC) and Connected Components (CComp)
represent dierent topological graph analytics and Degree Centrality (DC) represents
social-analysis, a subset of graph-analytics, and uses a dierent algorithm compared
to shortest-path. These ten kernels represent a variety of graph workloads.
NPB benchmarks are derived from computational
uid dynamics (CFD) applica-
tions and are widely used in the HPC community. We consider one pseudo application,
BT (Block Tri-diagonal solver), and ve computation kernels, namely, Conjugate-
Gradient (CG) having irregular memory access and communication patterns, Em-
barrassingly Parallel (EP), discrete fast Fourier Transform (FT), Multi-Grid (MG)
(memory intensive) and integer sorting (IS). Mantevo's CoMD is a proxy for molec-
ular dynamics simulations with short-ranged Lennard-Jones interactions. HPCCG
is a linear solver using the conjugate gradient method. It is characterized by un-
structured data and irregular data communication patterns. MiniFE is also a linear
solver that includes all the computational phases. The dierence between HPCCG
and MiniFE is that HPCCG generates a synthetic linear system focusing on the
sparse iterative solver. MiniGhost executes the halo exchange pattern of structured
115
and block-structured nite elements. HPCC RandomAccess, and STREAM exhibit
extremes of memory access patterns and typically exhibit very low spatial and tem-
poral locality. The Stride benchmark is a synthetic code in which a large memory
is allocated and then accessed multiple times over such that one 8B data is accessed
every 64B. This synthetic benchmark will have zero spatial and temporal locality up
to 64B cacheline size, and after that its spatial locality will increase with cacheline size.
For the experiments, we used SST Simulator [RHB
+
11, Lab]. We used Ariel, a
PIN-based processor model from SST running single-thread programs of the above
applications. The simulation assumes the processor is running at the 2GHz clock
frequency and Ariel was congured to issue 2-memory operations per cycle with a 256-
issue queue. The memory hierarchy model consists of L1/L2 and L3 cache followed
by main memory. The L1/L2/L3 cache sizes were set to 64KB, 256KB, and 512KB
respectively. All caches were modeled as an 8-way set associative cache. The access
latency was 1/10/20 cycles for L1, L2, and L3 cache, respectively. The main memory
was assumed to have
at access latency of 100 cycles for a cacheline. While accurately
modeling access latencies of caches/memory provides an ability to compare program
execution time, this work is focused on estimating data movement costs in terms of
cache events. For NPB we used Class A problem size. For GraphBIG kernels, the
input graph was a 1000 node LDBC synthetic graph [PBE13]. For miniapps, we used
default command-line arguments. For all benchmarks, the simulation was limited to
4-Billion memory references, but in many cases, the experiment completed before the
limit. To study how varying cacheline size aects the performance of HPC workloads,
we conducted sensitivity studies using 16B, 32B, 64B, 128B, and 256B cacheline sizes.
116
Most of the current processors use 64B cachelines. A decade ago, 32B cacheline was
the preferred choice for processor architectures [DKM
+
12]. For the cacheline study,
we modied cacheline size while keeping all other parameters constant for both Ariel
and cache/memory components. To study how dierent set-associativity aects the
performance of HPC workloads, we conducted sensitivity studies with 2-way, 4-way,
and 8-way set-associative L1 caches. We varied the L1 cache associativity parameter
while keeping all other L1 cache parameters (size, cacheline size) constant.
6.4 Results
We rst discuss results for the use of heterogeneous cacheline-size cache hierarchy and
then discuss results for cache-associativity sensitivity study.
6.4.1 Cacheline size Sensitivity Study
For each of the results shown below, each benchmark group consists of 12-dierent
L1, L2, and L3 cacheline size combinations. All other cache parameters were kept
constant. The combinations considered are listed in Table 6.1.
Figure 6.1 shows the cache hit rates for L1, L2, and L3 caches. For each bench-
mark, there are 12-points on the graph, the rst ve xed cacheline size cache cong-
uration and remaining seven heterogeneous cacheline size cache hierarchy congura-
tions. From Figure 6.1, we observe that the cache hit rate increases as the cacheline
size increases for all the applications. We observe a larger increase in L1 cache hit
rate when cachelines size for L1 is increased changed from 16B to 32B than the one
117
Table 6.1: Cacheline size (B) for dierent cache congurations
Conguration
Cacheline size
L1 L2 L3
cong1 16 16 16
cong2 32 32 32
cong3 64 64 64
cong4 128 128 128
cong5 256 256 256
cong6 16 32 64
cong7 16 64 128
cong8 16 32 128
cong9 16 64 256
cong10 32 64 128
cong11 32 64 256
cong12 64 128 256
observed when cacheline is changed from 128B to 256B. L2 and L3 cache hit rates
also show signicant increases as cacheline sizes increase. The aggregate data shows
with a larger cacheline size there are more opportunities for cache reuse, which leads
to better cache utilization and increased cache hit rate. GraphBIG kernels have poor
L2/L3 cache hit rates. Application access patterns and limited spatial locality limit
ecient use of larger cachelines. We observe dierent behavior for NPB benchmarks
compared to GraphBIG kernels. Additionally, each NPB benchmark shows a dierent
behavior for heterogeneous cacheline size cache congurations. For BT benchmark,
as L1 cacheline size increases, L1 hit rate also increases. Compared to xed cacheline
size cache hierarchy congurations, cong7 and cong9 show signicantly higher L2
and L3 cache hit rates. A 16B/64B/256B L1, L2, and L3 cacheline size congura-
tion show the best cache hit rate. All heterogeneous cacheline congurations show
improved cache hit ratios as compared to xed cacheline size cache congurations.
In the case of STREAM and Stride benchmarks, they were congured for 64B cache-
118
Figure 6.1: Cache Hit Ratios for various cache hierarchy congurations
lines size, larger cacheline sizes result in an increased locality and lead to some cache
reuse. In all the miniapps, we observed using a larger cacheline increased L1 cache hit
rate indicating better reuse with larger cacheline size. With smaller cachelines, reuse
did not occur due to cache organization artifacts, and cachelines were evicted before
potential reuse, leading to re-fetching of lines from lower caches. A xed cacheline
size across all the levels of the cache hierarchy results in poor L2/L3 cache utilization.
Using heterogeneous cacheline sizes for L1 and L2/L3 caches result in better L2/L3
utilization. A cache conguration with increasing cacheline size shows best hit rates
for all the benchmarks. Using heterogeneous cacheline sizes across the dierent levels
of the cache hierarchy, L2 and L3 cache hit rates improve by over 20%. Using het-
erogeneous cacheline sizes across dierent levels of cache, one can eectively improve
cache utilization for lower-level (L2/L3) caches. To summarize, from the cache hit
119
ratios observed across dierent applications, the L1 cache hit rate increases as the
cacheline size is increased. Instead of using a xed cacheline size across all the levels
of the cache hierarchy, a heterogeneous (potentially increasing) cacheline sizes across
the cache hierarchy allows us to increase L2/L3 cache utilization signicantly.
While cache hit rate provides one quantiable metric of evaluation, we must ex-
amine other metrics to understand other implications of cacheline size changes. We
now focus on the main memory trac that gets generated across dierent cache con-
gurations.
In Figure 6.2 we plot normalized memory trac for all the benchmarks. In the
rst part of Figure 6.2 we show the distribution of memory requests, and in the sec-
ond part, we show trac generated across all the congurations. All the trac is
normalized to a 64B xed cacheline conguration. We observe that as the cacheline
size increases, the total main memory requests decrease. This trend of decrease in
the main memory requests with increasing cacheline sizes is expected as each request
fetches/updates a larger amount of data to/from memory. In an ideal scenario, we
expect the number of main memory requests to half (reduce by a factor of 2) for
a doubling of cacheline size. If caches are not eciently used, then for every 2x
increase in cacheline size, we should observe a smaller decrease in the number of re-
quests. As observed from Figure 6.2, a xed cache conguration of 16B generates
highest memory trac. It varies between 2-4.5x for the benchmarks considered. In
an ideal scenario, one would expect a xed 256B cacheline conguration would re-
sult in the lowest number of main memory requests. In most of the applications, we
120
Figure 6.2: Main Memory Requests (a) Summary (b) All congurations
observed that cong9 and cong11 report lower main memory requests than a xed
64B cache conguration. In GraphBIG kernels, we observed a decrease in the num-
ber of main memory requests, albeit a smaller one than observed for miniapps and
NPB benchmarks. We observed 44% reduction in main memory requests in the best
conguration. This indicates that using a larger cacheline size for graph kernels may
not eectively reduce main memory trac. In NPB benchmarks, we observe the best
cache congurations results in78% reduction in the main memory requests com-
pared to the baseline of xed 64B cache conguration. The results indicate that using
121
Figure 6.3: Total Main Memory Trac (a) Summary (b) All congurations
a heterogeneous cacheline size cache hierarchy congurations results in reducing main
memory requests. Similar behavior is observed for miniapps. HPCC RandomAccess
which inherently has very poor cache utilization show no signicant decrease in the
number of main memory requests. STREAM and Stride benchmarks were congured
with 64B cacheline size; we observe main memory requests decreases only for 128B
and 256B cacheline congurations.
While the number of memory transactions is reduced, we should also examine
122
the total amount of data that was transferred to the main memory as the energy
consumption is proportional to the amount of data moved. Figure 6.3 presents the
normalized total data transferred to the memory across all the cache congurations.
In the rst part, we plot the summarize the distribution across all congurations,
while, in the second part we show the actual distribution for all the cache congu-
rations. All the values are normalized to a xed 64B cacheline cache conguration.
In an ideal scenario, we want data transferred between cache hierarchy and memory
to remain constant irrespective of cacheline size or cache conguration. Given that
when using a smaller cacheline size, one fetches only the data that is required from the
memory, and smaller data-transfers do not allow for much data reuse, we achieve the
lowest amount of data transfer from memory when using smallest possible cacheline
size for caches. In most benchmarks, we observed a xed 16B cacheline across all
levels of caches result in the smallest amount of data transferred to the memory. In
CG, FT, HPCCG and CoMD benchmarks, we observed heterogeneous cacheline size
cache hierarchy congurations achieve lowest data transfer to memory. A wider L3
cacheline size meant that a larger chunk of data was transferred to/from the main
memory. If all the data elements are not utilized during execution, this results in
wasted data transfer and increases total data movement to memory. We observed
that for GraphBIG kernels, congurations with L3 cacheline size of 128B and 256B
resulted in an average 0.5x and 1.5x extra data transfer to the main memory. Using
a 16B cacheline across all levels of the caches results in on an average 47% reduction
in the data transferred to the main memory as compared to the baseline 64B xed
conguration. While 16B cacheline size is good for the amount of data transferred,
it results in an increased number of memory operations; so overall energy increases.
123
In NPB benchmarks, across dierent congurations, we observed 0.3-1.3x more data
transferred to the main memory as compared to the baseline. Only IS benchmark
with 256B xed cacheline size resulted in 4.9x data transfer to main memory. Using
a heterogeneous cacheline size across dierent levels of caches, HPCCG and CoMD
show signicantly less data transfer as compared to the baseline conguration. In
the case of miniFE, all the dierent cacheline congurations resulted in nearly similar
data transfer to the main memory. For miniAMR and miniGhost benchmarks, xed
line conguration performed better in the data-transfer metric.
Next, we look at the total data movement that occurs in the cache hierarchy.
We dene cache data movement as the aggregate of all the data transfers that occur
between various levels of caches. We compute this by adding cache misses and cache
writebacks for each cache across the cache hierarchy. We need to account for cache
writebacks as any modied cacheline that gets evicted will require a write-back to
the next level cache or main memory. Total cache data movement is normalized to
64B cacheline size results. Figure 6.4 shows the total cache data movement across
dierent congurations. In the rst part of Figure 6.4 we summarize results across all
cache hierarchy congurations while the second part shows the actual distribution.
We observe that for GraphBIG kernels as cacheline size increases, a corresponding
increase in the data movement across the cache hierarchy results. Even though cache
misses decrease, each miss required a larger size data transfer from the lower level
cache, and therefore the total amount of data moved increases. Using heterogeneous
cacheline sizes across dierent levels of caches allows a trade-o between total trac
and cache misses. We observe a wide range of behaviors in the total cache data move-
124
Figure 6.4: Total Cache Data movement (a) Summary (b) All congurations
ment. Recall for CG benchmark we observed a decrease in the main memory trac
as cacheline size increased. Here, we observed an increase in the total cache trac
as cacheline size increased from 16B to 256B. For 256B cachelines, we observed a 3x
increase in the total cache data movement compared to the 64B case. By increasing
cacheline size, we transferred some of the o-chip main memory trac to remain
in the cache. Given that o-chip main memory access is orders of magnitude more
expensive than on-chip cache accesses, this trade-o is benecial in terms of energy
and latency. For the FT benchmark, we observe the lowest cache data movement
125
in cong6 and cong8. Heterogeneous cacheline size cache hierarchy congurations
cong6 to cong8 results in lower cache data movement compared to xed cacheline
congurations. For miniapps, heterogeneous cacheline size congurations are within
20% of the total cache data movement of the baseline design.
To summarize, by increasing the cacheline size, we observed most of the applica-
tions show increased L1, L2 and L3 cache hit rates. While the cache hit rate increased,
many applications exhibited increased data movement between various levels of caches
as well as between cache and memory. GraphBIG application suite exhibits dierent
prole compared to the ones observed for NPB benchmarks and miniapps. For appli-
cations in NPB suite, a larger cacheline size resulted in a reduction of main memory
trac. In contrast, GraphBIG applications benet from a smaller cacheline size from
the data-movement standpoint. Using a smaller cacheline reduces the cache trac
(data-movement) and increases the main memory requests; while using a larger cache-
line size reduces the main memory trac and causes signicantly more data transfers
across the cache hierarchy. The heterogeneous cacheline size cache congurations with
smaller L1 cacheline size and wider (larger) L3 cacheline size would provide benets
of reducing cache hierarchy data movement while minimizing o-chip main memory
trac. Our results from the study indicate such a scheme would also improve L2/L3
cache hit ratio while optimizing on-chip cache hierarchy data movement and o-chip
main memory data movement.
126
Figure 6.5: Cache Hit Ratios for dierent L1 cache set-associativity congurations
6.4.2 Cache Associativity Sensitivity Study
Figure 6.5 shows the cache hit rates for L1, L2 and L3 caches for various benchmarks
running. L1 cache hit rates are represented using bars (light-blue) while L2 (blue)
and L3 (orange) are represented as lines. Each benchmark consists of three data-
points for 2-, 4-, 8- way(s) L1 set-associativity cache respectively. From gure 6.5 we
observe that for GraphBIG kernels, L1 cache hit rate does not change with a change
in the L1 cache set-associativity, whereas L2 cache hit rate decreases with increase in
L1 set-associativity. In case of NPB kernels, for BT benchmark with an increase in
L1 set-associativity, we observe L2 cache hit rate increases by 10% whereas for CG
and IS benchmarks, we observe over 10% and 20% decrease when using an 8-way set-
associative cache. The L1 cache hit rate remains constant across all congurations. In
case of HPCCG miniapps, as the L1 set-associativity increased from 2-way to 8-way,
127
Figure 6.6: Normalized L1 cache miss
we observed L1 cache hit rate increased by 10% whereas L2 cache hit rate decreased
signicantly from 65% for a 2-way cache to 25% for 4-way cache and 0% for an 8-way
cache. To summarize the eect on the cache hit rate, across dierent applications, we
observed that L1 cache hit rate did not show any signicant change (except HPCCG)
when L1 cache set associativity changed from 2-way to 8-way. Change in L1 set-
associativity certainly caused L2 cache hit rate to change, some applications hit rate
increased while in others we saw a signicant drop (>10%) drop in hit rate.
Figure 6.6 shows the normalized L1 cache misses for dierent benchmarks. All
L1 cache misses are normalized to 8-way set-associative cache. As seen from Fig-
ure 6.6, for GraphBIG kernels, using 2-way L1 cache misses increases L1 misses by
5%. For BT and FT benchmarks we observe a reduction by 12% and 5% respec-
tively indicating 2-way L1 cache conguration is best suited for these applications.
128
Table 6.2: Energy Costs for dierent cache models
Model 2-way Cache 4-way Cache 8-way Cache
Model 1 0.50 0.75 1.0
Model 2 0.70 0.85 1.0
Model 3 0.80 0.90 1.0
2-way L1 set-associative cache results in nearly 6x increase in the L1 cache misses for
HPCCG benchmark indicating a 2-way cache is not best suited for HPCCG. We ob-
served for all the miniapps, 2-way cache conguration resulted in over 2.5x increased
cache misses as compared to 8-way cache, indicating 8-way cache is most suited for
miniapps. Benchmarks with poor spatial and temporal locality (Stride, STREAM
and HPCC RandomAccess) show no dierence in the L1 misses across dierent cache
conguration.
We also studied the eect on the L2 cache misses and observed that changing L1
cache set-associativity had a minimal (negligible) eect on the number of L2 and L3
cache misses. Similarly, we observed a nominal increase in the overall data movement
across the memory hierarchy when using a 2-way cache conguration. The overall
data movement was proportional to the L1 cache misses, i.e., the cache conguration
in which the number of L1 cache misses increased, overall data movement increased,
while in congurations where the number of L1 cache misses decreased, overall data
movement decreased.
Energy Modeling
In set-associative caches, the energy per access is directly proportional to the
129
number of ways in a cache set. More cache-ways require more comparisons for every
cache access and hence requires more energy. Table 6.2 shows the dierent energy
costs per access considered for various levels of caches.
We normalize energy costs to the 8-way set-associative cache. The L2 and L3
costs were kept constant at 2.0x and 4.0x of the 8-way L1 cache. We also analyzed
the eects of changing L2 and L3 costs and did not nd any signicant change in the
overall system energy.
Table 6.3 shows the normalized cache energy. The energy for each model is nor-
malized to the 8-way L1 cache conguration. From the results, we can observe that
L1 cache energy dominates the overall energy. In most benchmarks, using a 2-way
associative cache, one saves energy costs per access, and even though it results in
slightly increased cache trac, it leads to overall energy savings. Only in cases of
miniapps especially HPCCG, where a 2-way L1 cache results in over 6x more cache
misses, the overall energy exceeds 8-way cache. A 4-way cache also oers energy sav-
ings for miniapps as compared to an 8-way cache conguration. Similarly, when the
access energy gap between 2-way, 4-way, and 8-way cache reduces, the overall energy
savings reduces.
To summarize, in this section, we compared how changing L1 cache set-associativity
aects the performance of dierent applications and benchmarks. By changing L1 set-
associativity and keeping cache size constant, we characterized the eect of L1 cache
set-associativity on the performance of applications. With a 2-way set-associative
130
Table 6.3: Total Energy costs for dierent models
Benchmark
2-way L1 Cache 4-way L1 Cache
Model1 Model2 Model3 Model1 Model2 Model3
BFS 0.54 0.72 0.82 0.77 0.86 0.91
CComp 0.54 0.72 0.82 0.77 0.86 0.91
DC 0.54 0.72 0.82 0.77 0.86 0.91
DFS 0.54 0.72 0.82 0.77 0.86 0.91
GCons 0.56 0.74 0.82 0.78 0.87 0.91
GUp 0.55 0.73 0.82 0.77 0.86 0.91
KCore 0.53 0.72 0.81 0.77 0.86 0.91
SPath 0.54 0.72 0.82 0.77 0.86 0.91
TMorp 0.56 0.73 0.82 0.78 0.87 0.91
TC 0.53 0.72 0.81 0.76 0.86 0.91
bt.A 0.52 0.71 0.81 0.76 0.86 0.90
cg.A 0.80 0.89 0.93 0.91 0.96 0.98
ep.A 0.52 0.71 0.81 0.76 0.86 0.90
ft.A 0.62 0.77 0.84 0.81 0.89 0.92
is.A 0.59 0.76 0.85 0.79 0.88 0.92
mg.A 0.61 0.77 0.85 0.81 0.88 0.92
STREAM 0.71 0.83 0.88 0.86 0.91 0.94
stride 0.93 0.96 0.97 0.96 0.98 0.99
HPCC RA 0.83 0.90 0.93 0.91 0.95 0.97
HPCCG 1.51 1.77 1.90 0.99 1.07 1.11
CoMD 0.65 0.90 1.02 0.79 0.89 0.94
miniFE 1.18 1.48 1.63 0.87 0.96 1.00
miniGhost 1.05 1.29 1.42 0.88 0.97 1.02
cache, we observed that the L1 cache hit ratio decreases slightly as compared to an
8-way set-associative cache. While the cache hit ratio decreased slightly, it resulted in
up to 5% additional L1 cache misses. This did not aect overall L2 misses or system
data-movement. The advantage of using a 2-way cache as compared to an 8-way is in
potential energy savings. Even though using a 2-way cache, the number of L1 cache
misses increased; it was not enough to increase the overall cache energy budget. With
131
lower energy cost per access, and L1 accesses dominating overall cache energy, even
a 20% savings in per access energy costs with 2-way set-associative cache leads to
around 20% total energy savings without signicantly aecting system performance.
6.4.3 Challenges
With a xed cacheline across the cache hierarchy, every transfer between dierent
caches happens at the same address granularity. The additional metadata (valid,
coherence, replacement) for managing caching decision remains constant and deter-
ministic. In a system with heterogeneous cacheline sizes across dierent caches in
the cache hierarchy, one would have to re-map incoming addresses to correct cache-
line size granularity (address calculation). Also, if the system is supporting multiple
cacheline sizes, then it must be designed to meet worst-case scenarios. It would be-
come easier if the cacheline size increases as we go deeper in the hierarchy, i.e., L2
cacheline size is multiple of L1 cacheline size and similarly for L3 and L2 cacheline
sizes, then mapping and/or address calculations becomes easier. Another problem
that arises with dierent cacheline sizes is in maintaining coherence. A practical op-
tion would be to perform coherence at the lowest cacheline size granularity. Lower
level cache directories (L2 and L3) would need to have additional bits in their con-
trollers to identify which of the cacheline chunks are present in the upper-level caches.
Similarly, with changing cache set-associativity during application execution would
become complicated as the virtual to physical address translation is dependent on the
cache set-associativity. One potential solution is to have multiple or larger tags. On
the other hand, if cache set-associativity is set at runtime, then there would not be any
132
need for multiple or wider tags. Ideally, the cache can be designed in a modular way
to support a range of various cache-ways congurations. While these details might
complicate the design of the cache hierarchy, the potential advantages in performance
and energy savings outweigh extra design costs.
6.5 Summary
In this chapter, we characterized the use of heterogeneous cacheline size cache hier-
archy and L1 cache set-associativity on the performance of the cache hierarchy for a
wide range of HPC applications. We evaluated various xed and heterogeneous cache-
line size cache hierarchy congurations. We observed as the cacheline size increased,
the cache hit rate also increased. While the L1 cache rate increased marginally, L2/L3
cache hit rate show wide variations. Cache hit rate increased around 10% for Graph-
BIG applications. For many NPB applications, the L2 cache hit rate increased by over
30% with IS showing nearly 80% improvement. Then we analyzed the data movement
costs in the cache hierarchy by rst comparing the number of main memory requests.
As expected, the total number of main memory requests decreased with increase in
the cacheline size. Next, we compared the total amount of data transferred between
caches and memory and observed that even though the number of main memory
requests decreased, the overall data transferred increased by over 2x for GraphBIG
kernels. For NPB kernels, using heterogeneous cachelines sizes for dierent levels of
the cache hierarchy, the total amount of data movement to main memory decreased,
and similarly, it resulted in comparable cache data movement. It also improved cache
utilization as the cache hit rate improved for L2 and L3 caches.
133
With a 2-way set-associative cache, we observed L1 cache hit ratio decreases
slightly as compared to an 8-way cache. While the cache hit ratio decreased slightly,
it resulted in up to 5% additional L1 cache misses. This did not aect overall L2
misses or system data-movement. The advantage of using a 2-way cache as compared
to an 8-way is in potential energy savings. Even though using a 2-way cache L1 cache
misses increased; it was not enough to increase the overall cache energy budget. With
lower energy cost per access, and L1 accesses dominating overall cache energy, even
a 20% savings in per access energy costs with 2-way set-associative cache leads to
around 20% total energy savings without signicantly aecting system performance.
These experiments show signicant variations in the application behavior with a
change in cacheline size. Using a smaller cacheline reduces the cache trac (data-
movement) and increases the main memory requests; while using a larger cacheline
size reduces the main memory trac and causes signicantly more data transfers
across the cache hierarchy. The heterogeneous cacheline cache congurations with
smaller L1 cacheline size and wider (larger) L3 cacheline size would provide benets
of both, reducing cache hierarchy data movement while minimizing o-chip main
memory trac. Similarly, for many applications and phases in applications when
a 2-way and 8-way achieve similar performance, then one should default to using a
2-way cache conguration as it requires lower energy; with adaptive L1 caches, it
would result in signicant energy savings without any performance overheads. The
insight gained from this study is that the next generation of processors and memory
systems should decouple the cacheline size as the data-transfer unit between various
134
levels of the cache hierarchy and cache-memory transfer so that one uses smaller
cachelines to reduce on-chip cache trac for ecient inter-cache communications and
larger cacheline sizes to transfer data between last-level cache and main memory.
Such a scheme will also help minimize coherence related overheads without aecting
any virtual memory management in the processor as caches are transparent from the
programming models.
135
Chapter 7
Cache Guidance System
In the previous chapter, we studied implications of using heterogeneous cacheline sizes
and cache set-associativity on the cache performance and data movement to motivate
the need for re-congurable cache hierarchies in future systems. In this chapter, we
discuss a technique that adds application-specic adaptability and recongurability
to a xed cache conguration without the need to implement complex dynamically
adaptive caches.
7.1 Introduction
Much of the ineciency in data movement in the current HPC systems is due to
poor matching between applications and cache characteristics [P. 08, S
+
09, R. 14].
Every application has a unique cache prole. Caches in general-purpose processors
have a xed conguration, i.e., cache organization and operation cannot be modied
by users. If an application's caching prole cannot be mapped to the available cache
architecture, it results in cachelines being moved between various levels of the cache
136
hierarchy ineciently; often causing unnecessary data movement and thereby con-
suming more energy.
With a rigid cache hierarchy, there exist no cache hardware tuning opportunities
for hero programmers in the HPC community to better adapt applications to the
caches. Application-algorithm and compiler tuning improves data layout for better
caching but provides no optimization knobs to improve performance in the caches.
Compiler-based and hardware-based approaches to cache tuning have their benets
and disadvantages. Through program analysis, system developers can better char-
acterize their code and identify unique application phases. However, there currently
exists no mechanisms to pass along program analysis insights to individual caches.
Similarly, for hardware-based techniques like recongurable cache architectures, the
overheads in terms of energy and performance make their use complicated. An ideal
scenario would be to have a co-design approach wherein one can leverage advantages
of both software (program analysis insights) and hardware driven approaches (high-
speed execution).
With a xed cache architecture not being able to support wide-range of HPC
workloads eciently and with no mechanism to use software-based program anal-
ysis insights, caches continue to remain bottlenecks when running large-scale HPC
workloads. In this work, we propose a Cache Guidance System (CGS). The Cache
Guidance System utilizes a small user-controlled register that allows users to specify
application-specic guidance to an individual level of cache. The cache then uses
passed guidance to improve its caching decisions. This mechanism adds application-
137
specic adaptability and recongurability in the cache architecture for better support
HPC applications and provides another optimization knob for system designers to
improve cache performance.
In section 7.2 work related to cache reconguration and compiler-based approaches
for improving cache performance is discussed. In section 7.3 we detail hardware
modications to the cache architecture to implement cache guidance mechanism. In
section 7.4 we describe experiment setup, guidance generation framework to minimize
data movement and present results from the study. In section 7.5 we summarize the
work.
7.2 Related Work
Recongurable cache architectures provide a way to tune or adapt cache characteris-
tics based on application demands dynamically. A cache's performance can be dened
as a function of its architecture parameters, namely - capacity, block size, associativ-
ity, replacement policy, and the incoming application memory reference stream. A
recongurable cache allows modication to one or more of the architecture parame-
ters. The eectiveness of the recongurable caches is dependent upon {(a) identifying
when application phase changes during application execution, and, (b) identifying the
optimal cache conguration for that phase.
Identifying phases in the application can either be done in software or hardware.
Typically two dierent approaches exist for identifying phases in the application.
138
Interval-based approaches divide application execution into xed-sized windows of ei-
ther instructions or execution time and then use a function based on the current and
previous state to predict next cache conguration [DS02]. Another approach is to use
application code-based intervals which can have a wide range of instruction and mem-
ory references counts. For example, a loop in the application code can be considered
as an interval and then using a function based on code-based metrics identify and
predict cache congurations. Interval-based phase predictions work well on programs
with steady phases; predicting phases becomes more dicult for applications with
changing demands.
Cache tuning or reconguration has been studied to improve performance as well
as for minimizing energy in the caches [GRLC08, RAJ00, MCZ14, BABD00]. These
hardware-based techniques rely on monitoring applications to predict best cache con-
gurations. The speculative nature of processors, performance, and energy overheads
associated with monitoring mechanism also aects their performance [ZGR13]. Dy-
namic cache tuning or reconguring has also been proposed to minimize cache energy.
For example, in [ZVN03] authors present a technique called way-concatenation to re-
congure cache architecture in embedded systems. Approaches like V-way [QTP05]
vary cache associativity in response to application demands. Set-balancing caches
[RFD09] shift cachelines from high activity sets to lower activity ones to improve
cache utilization without reconguration. Selective cache-ways [Alb99] disable a sub-
set of ways in set-associative caches during periods of low cache activity to reduce
energy. In adaptive caches [SSL06], authors combine two dierent replacement poli-
cies to provide an aggregate policy that performs within a constant factor of the better
139
of the two replacement policy to improve cache utilization. In [TTK14] authors pro-
pose an adaptive runtime system scheme to shut-down parts of the cache based on
application characteristics. While these techniques have been shown to either improve
the performance or minimize energy in embedded systems caches or with general pur-
pose applications, each of them has an additional hardware overhead. None of the
hardware-based techniques allow users to explicitly convey insight or information to
caches to bias its decisions.
Software or compiler based approaches have also been studied to improve caching
decisions. In [WMRW02], authors propose adding an evict-me bit information to
cachelines for them to be evicted instead of LRU. The authors develop a compiler
algorithm that predicts these when to insert these evict-me hints to the architecture.
In [SVMW05] authors propose a compiler algorithm that predicts which cachelines
to keep in a set-associative cache and then the compiler inserts hints to the architec-
ture to keep lines in the caches. They discuss the hardware changes required to the
cache architecture to allow cache lines to be kept in the cache and present a decay
mechanism that invalidates the keep hint for the address. In [JDER01] authors de-
scribe a compiler based algorithm that determines when to perform software-based
replacements. They propose to use keep and kill instructions available in some ar-
chitectures to improve performance and worst-case performance predictability. They
develop the theoretical framework for automatically adding keep and kill instructions
to the program. In [BD05, BD02] authors discuss how cache hints can be attached
to the EPIC architecture. They discuss two types of hints, source hints indicating
true latency of operation and used by the compiler in instruction scheduling and, tar-
140
get hints determine the cache level for data placement to improve cache replacement
decisions. In [VLX03, AP], authors discuss using data cache locking mechanism to
improve predictability for real-time systems. They require explicit management of in-
dividual cachelines and can potentially have signicant overheads and are mostly used
in specialized processing elements blocks. Many of these compiler based approaches
rely on using reuse distance or the distance between two consecutive accesses to the
same memory element. Reuse distance has been used in program analysis to measure
locality in the program. In [SZD04, DZ03] authors present a method that predicts the
locality phases of a program by using a combination of locality proling and run-time
predictions. The reuse distance has also been used to predict cache performance across
dierent cache congurations [ZDD03]. In [GD12, BGBD13] authors discuss collab-
orative caching of using compiler hints to optimize cache management. Their focus
is on improving cache insertion and replacement policy. Compiler-based approaches
like [LMC13] focus on computing reuse distances and quantifying the costs of data
locality bottlenecks to improve cache performance. Estimated or predicted reuse dis-
tance has also been used in cache replacement decisions in LLC's [KPK07, DZK
+
12].
Compiler-based approaches typically rely on static analysis of the code, which could
vary at runtime, leading to dierences between static and runtime decisions. With
reordering that occurs in the processor load-store queues, statically computed reuse
distance can dier at runtime and can aect cache performance. Compared to com-
piler based approaches, our approach diers mainly in four ways. First, our guidance
generation is based on dynamic execution. Second, our guidance does not annotate
any memory instructions and is for an individual cache. It does not propagate across
the cache hierarchy. Third, generated guidance is based on the sensitivity of address
141
and not arbitrary in terms of keep or kill. Fourth, our guidance can be generated
externally using dierent program analysis tools and does not require any changes
to the compiler. Compilers can leverage our cache guidance system to improve its
eectiveness. Also, because guidance is passed only a few times, it minimizes the
runtime management system overheads.
7.3 Design
A cache needs additional information about address locations to aect caching or re-
placement decisions. We call this information guidance. There are two dierent ways
in which a cache can receive guidance from the user. First, the generated guidance
information can be passed along with every memory instruction from the proces-
sor to the memory system using runtime-system modication. This information,
GuidanceInfo (few bits), is then used by the cache replacement logic in selecting a
replacement candidate. A typical memory reference information sent from processor
to memory includes address, data, operation type eld - read/write, and other ad-
ditional control elds. To pass guidance information the GuidanceInfo eld would
also need to be sent along with every memory reference. This requires modication
to the runtime system and leads to increased runtime activity as it needs to generate
GuidanceInfo for every memory reference. Another problem with this approach is
that guidance can only be passed from the processor as an instruction, which means
there is no direct access to dierent levels of caches in the hierarchy.
Instead of passing guidance with every individual memory reference, we propose
142
Figure 7.1: Cache Design with guidance data-structures
that the runtime system load the guidance information at the beginning of the ap-
plication phase execution. This mechanism of loading guidance at the beginning has
additional benets of being able to load guidance into any level of the cache hier-
archy. To allow the runtime system to load guidance for an individual cache, every
cache implements a small register to store guidance. Using the existing framework of
system software-hardware registers, a user or runtime system can eectively load the
guidance into a cache at any point in the execution. The set of address guidance that
gets loaded into the register is considered as hash or signatures of all the addresses
that have guidance. A Bloom Filter [Blo70] can use this register as a data-store (DS)
or hash and check if an incoming address is part of the signature in the data-store.
Accessing this DS to check for guidance is faster and requires less energy than prob-
ing regular cache-structure or tag-arrays. When a new address is allocated an entry
into a cache (on account of a cache miss), the replacement data-structures can access
this DS to update its guidance information about the cacheline address. The circuit
required to generate address signature and comparators required to test guidance
presence is purely combinational and does not require any storage. Every tag-entry
143
Figure 7.2: Guidance Aware LRU Replacement Policy
is augmented with a Guidance-eld, a 1-2 bits entry indicating guidance level. The
entire process of testing for the presence of guidance can be performed oine; i.e., it
is not on the cache's critical path and does not aect cache access time. By using
Dark Silicon principle [HFFA11], the whole guidance hardware can be disabled when
not used and consumes energy only when activated. This approach of using a register
acting as a data-store to keep track of address guidances is preferred as it reduces the
number of system calls needed. The gure 7.1 shows a simple representation cache
design with guidance register.
In addition to incorporating the register to hold user guidance and the Bloom Fil-
ter logic to test the presence of an address in the register, few additional modications
144
in the cache need to be architected. First, the default cache replacement policy must
be augmented to utilize guidance. The guidance is added to each tag-array entry.
Each cache-tag array entry has 1-2 bit guidance eld indicating guidance level. The
guidance aware cache replacement policy tries to select a candidate with the lowest
guidance; if multiple candidates have the same level of guidance, then it reverts to its
original replacement policy. In our experiments, we modied LRU replacement pol-
icy to be guidance aware. The pseudo-code for the guidance aware LRU replacement
policy is presented in Figure 7.2. As shown in Figure 7.2, the replacement candidate
search proceeds by rst trying to nd a cacheline in an invalid state. Then the search
proceeds by checking cachelines that have lowest negative guidance values, followed
by nding replacement candidates from the set of cachelines that have no guidance
and nally from the set of cachelines with positive guidance by looking at increas-
ing levels of positive guidance. During each step, if multiple potential replacement
lines are found, then LRU weights are used to decide a replacement candidate. Sec-
ond, a mechanism that downgrades guidance weight for a cacheline that remains in
the cache without accessed (some threshold) is added. Typically, the threshold for
downgrade can be kept at some multiple of cache associativity in a given cache-set.
The downgrade approach is similar to the decay mechanism in [SVMW05]. Third, a
limit is set on the number of the cachelines in a given cache-set that can have posi-
tive guidance. If many cachelines with positive guidance are present in a cache-set,
then other remaining cachelines with lower guidance weights will get evicted from
the cache earlier than expected, thereby, aecting overall performance. In a sense,
for the cachelines with no guidance, the cache would act as a direct-mapped cache
and evict lines before a reuse opportunity. Typically, the limit for lines with positive
145
guidance in a cache-set is set to half of the cache set-associativity value. This limit is
also a function of the total number of addresses that have positive guidance weights
in a given application phase. While the downgrade mechanism and limiting number
of cachelines with guidance is not necessarily required for cache guidance system, it
improves fairness for other addresses in the cache.
7.4 Modeling, Results
7.4.1 Experiment Setup
To evaluate the ecacy of the cache guidance system, we consider a wide range of
cache and memory proles of HPC workloads and consider ten benchmarks from
GraphBIG, a graph application suite [NXT
+
15], six benchmarks from NAS Paral-
lel Benchmarks (NPB) [B
+
91], four from Sandia Mantevo Miniapps suite (CoMD,
HPCCG, miniFE, miniGhost) [man], RandomAccess benchmark from the HPC Chal-
lenge Benchmark Suite (HPCC) [LBD
+
06], STREAM benchmark [McC95].
For the experiments, we used SST simulator [RHB
+
11, Lab]. We used Ariel, a
PIN-based processor model from SST running single-threaded programs of the above
applications. The simulation assumes the processor is running at the 2GHz clock fre-
quency and Ariel was congured to issue 2-memory operations per cycle with a 256-
issue queue. The memory hierarchy consists of L1/L2 and L3 cache followed by main
memory. The L1/L2/L3 cache sizes were set to 64KB, 256KB, and 512KB respec-
tively. All caches were modeled as 8-way set-associative cache and used 64B cacheline
size. We implemented the cache guidance system, and guidance was attached to the
146
L1 cache. For NPB we used Class A problem size. For GraphBIG kernels, the in-
put graph was a 1000 node LDBC synthetic graph [PBE13]. For miniapps, we used
default command-line arguments. For all benchmarks, the simulation was limited to
1.5Billion memory references, but in many cases, the experiment completed before
the limit. During the baseline run, we generated application trace and then the same
trace was used in generation guidance and in guidance-enabled simulations to avoid
mismatch in the address mapping because of virtual memory translation.
7.4.2 Guidance Generation
The guidance can be generated via oine application proling, application monitoring
tools. The goal is to identify addresses that are crucial to the optimization strategy as
guidance. In this work, with the focus on minimizing data movement across the mem-
ory hierarchy, guidance is generated for L1 cache. Application execution is divided
into a set of xed memory reference phases. Excess and unnecessary data movement
is caused when a cacheline is referenced, evicted and then re-referenced, requiring it
to be fetched from deeper into the cache hierarchy in the L1 cache, and this loop
repeats itself many times over during an application phase. To minimize data move-
ment, we identify addresses that exhibit this trait and keep them in the cache longer.
The idea is by keeping most active address cachelines in the cache longer we minimize
data movement for most-active addresses. By keeping certain cachelines in the cache
longer, we negatively impact other cachelines as they will be early evicted. The key
criteria in determining the appropriate number of cachelines that should be part of
the guidance set is that they not signicantly aect other remaining cachelines, as
it would lead to situations where the data movement is transferred from one active
147
cacheline to other. To identify which set of addresses should be selected as guidance,
we begin by tracking individual cacheline activity during an application phase. For
each application phase, we normalize the activity and select the most active set of
addresses as guidance.
In Figure 7.3 we plot the percentage of total memory references for the top5%
most-active active addresses in a given application phase for miniapps, NPB bench-
marks, and GraphBIG kernels. From the gures, we observe that across all the
applications, in many application phases top5% addresses account for most of the
memory references, and by keeping them in guidance set, we can minimize evictions
for these cachelines from L1 cache and reduce total data movement. In some appli-
cations phases, top5% addresses do not account for a majority of memory references
and keeping them in the guidance set would not improve performance. Similarly, we
also modeled addresses that are least referenced in a given phase to generate negative
guidance to evict them from cache early. In a majority of the application phases
across dierent applications, we observed over 5% of addresses account for around
1% of total memory references in a phase.
Based on the observations from the activity information of various applications,
we develop two algorithms for generating positive and negative guidance for a given
application phase. The pseudo-code for the algorithms are shown in Figure 7.4.
The algorithms can be described as follows. For each application phase, rst, capture
activity information. Activity information is typically per memory location referenced
in the phase. As cacheline is the unit of transfer between caches, all individual
148
Figure 7.3: Percentage of total references for top5% most active address in a phase
for (a) miniapps (b) NPB benchmarks (c) GraphBIG kernels
149
Figure 7.4: Guidance generation algorithms
addresses must be coalesced into cacheline aligned addresses. At the end of this step,
we have cacheline size aligned address activity information. Next, the weight for
each address is computed. The weight can be a function of references, eviction or
other metrics based on the guidance generation model(scheme). For example, if the
guidance generation heuristic is to assign a higher weight to certain address-range,
then weight function assigns a higher weight to addresses in that range and a lower
weight for other addresses. Once weights are computed and address sorted according
to their weights, the algorithm checks to see if guidance should be generated for the
given phase. Guidance is only generated if the MAX ADDR GSET addresses accounts for
at least THRESHOLD of total weight, else guidance is not generated for the phase, and
the next phase is evaluated. In case of positive guidance, we want selected addresses
150
Table 7.1: Weights for Guidance generation
Model THRESHOLD MAX ADDR GSET
A 70 5
B 75 15
C 75 15
to account for more than the THRESHOLD of the total weight, whereas for negative
guidance generation, we want selected addresses to account for less than THRESHOLD
of the total weight.
7.4.3 Results
In our experiments, to minimize data movement, we select memory references as
the activity information. Each application phase is 500K memory references. The
addresses are sorted according to their reference counts. For generating positive guid-
ance, we consider most active addresses, whereas, for negative guidance generation,
we consider least-active addresses. Using the algorithms described above, we gener-
ate three-dierent models A, B, and C with varying THRESHOLD and MAX ADDR GSET
values. The THRESHOLD and MAX ADDR GSET values for dierent models are shown
in table 7.1. The THRESHOLD value indicates that the addresses selected for guid-
ance should account for at least THRESHOLD percentage of total memory references
in the phase and MAX ADDR GSET indicates that the maximum number of addresses
that can be added to guidance-set should not exceed MAX ADDR GSET percentage of
total addresses referenced in the phase. In models A and B, we only consider positive
guidance, whereas in model C we consider both positive and negative guidance. For
generating negative guidance in model C, we select all the address locations that are
referenced up to 4 times in the given phase.
151
Figure 7.5: Total data movement change
Figure 7.5 plots the change in the data movement across the cache hierarchy for
dierent guidance generation models. The Y-axis represents the percentage change
over the baseline design without guidance. As seen from Figure 7.5, with model
A we achieve up to 0.4% reduction in the total cache data movement, while many
benchmarks show up to -0.3% increase in the data movement. With model B, using
more addresses with guidance results in increased total data movement with only DC
kernel showing 0.4% improvement. With model C, where both positive and negative
guidance is used, we observe up to 2.5% decrease in the overall data movement for BT
benchmark. In many applications, using guidance results in an increase in the overall
data movement. We observe this an artifact of using only one guidance generation
algorithm across all the phases of application execution and with use of either mul-
tiple dierent algorithms or same algorithms with dierent weights across dierent
application phases, we can further improve overall cache hierarchy data movement.
152
Figure 7.6: Cache miss improvements in an application phase
In Figure 7.6 we plot the highest percentage reduction in the L1 cache misses in
an application phase over the baseline design for all the benchmarks considered. We
observe that while overall data movement does not show signicant change because of
one guidance generation scheme, individual phases of the application show signicant
improvement in reducing cache misses. In miniFE, miniAMR we observe over 60%
reduction in the number of cache misses in a given application because of guidance.
For the majority of GraphBIG kernels, we observe over 30% reduction in the number
of cache misses in an application phase. Similarly, in many application phases, we
observe that by using only one guidance generation scheme, it resulted in increased
in the number of cache misses leading to excess data movement. This suggests that
(a) we need more parameters in the activity set instead of relying on just reference
information and (b) we need multiple guidance generation mechanisms and select one
that achieves the best performance for a given application phase.
153
In here, we demonstrated one guidance generation scheme. Generating guidance
for dierent targets allows designers to optimize code for various aspects of appli-
cation execution. One problem that aects all guidance generation schemes is the
address translation. Majority of the program analysis tools generate insights using
only virtual addressing. While many L1 caches support virtual address, L2 and L3
caches only use physical addressing. Therefore, the user has to generate guidance in
the correct addressing state. Using runtime system, one can get virtual-to-physical
address mapping information, and user guidance address signatures can be translated
before loading them to individual caches.
7.5 Summary
To summarize, in this chapter we presented cache guidance system. We show by
adding a small user-controlled register, one can eectively improve the performance of
a given cache. Cache guidance system allows for better hardware-software co-design.
It provides another set of performance tuning knobs for system designers to optimize
the code eectively. With one guidance generation scheme, we achieved up to 2.5%
reduction in the total cache hierarchy data movement and over 30% reduction in the
L1 cache misses in an application phase for a wide range of HPC workloads. Our
results show using eective guidance generation, unnecessary data movement can be
reduced. With user-generated guidance targeted for their optimizations, designers can
eectively optimize applications without the need to implement complex dynamically
adaptive caches.
154
Chapter 8
Conclusion
As processor architecture evolves to incorporate an increasing number of cores and a
wide range of heterogeneous processing elements (domain specic accelerators), the
complexity of memory subsystems will increase because of the inherent data orches-
tration needs. Managing on-chip data movement is expected to become even more
challenging in System-on-Chip (SoC) setups wherein a variety of processing elements
share memory resources. Understanding sources of unnecessary data movement in
such systems are of paramount importance. With a better understanding and quan-
tiable models, it will become easier for systems designers to design systems that
minimize unnecessary data movement. Compilers and runtime system adaptations
can help optimize data movement from the application algorithm standpoint. Simi-
larly, with either static (congured at runtime) or dynamically adaptive recongurable
caches, computer architects can assist by providing most-ecient cache architecture
support to applications to improve performance and energy eciency. Optimizing
data movement in HPC system requires a hardware-software co-design eort.
155
8.1 Summary
The focus of this dissertation is to understand the causes of excess data movement and
develop techniques that improve data movement. This work focused on the trifecta of
approaches to optimize data movement across the cache hierarchy. First, as applica-
tion behavior plays an important part in dictating the
ow of data through the cache
hierarchy, we characterize applications. We characterized PathFinder, a proxy appli-
cation for a class of graph-analytics applications crucial to the DOE HPC community
to understand its execution, scalability challenges, and cache characteristics. Then
we extended an architecture-independent methodology for quantifying memory ac-
cess properties of applications. Using various metrics, we quantied dierences in the
memory requirements between various graph kernels and DOE HPC benchmarks and
application proxies. We then analyzed how the application address-space is utilized
during execution to identify address-space hotspots and address ranges that require
dierent priorities for insertion and replacements. Second, cache architecture congu-
ration impacts performance, energy and data movement costs. To better understand
and quantify energy utilization in the cache hierarchy, we developed a cache energy
model, enabling estimation of individual cache energy when executing applications on
real-world HPC systems. We then analyzed the implications of using heterogeneous
cacheline size cache hierarchies and cache set-associativity on the data movement in
the cache hierarchy to motivate and quantify the need for better support for recon-
gurable cache hierarchies in future systems. Third, we developed a cache guidance
system framework that adds application-specic adaptability and recongurability
to a xed cache conguration without the need to implement complex dynamically
adaptive caches. Using the framework, system designers can pass program analysis
156
insights as hints to an individual cache to bias caching decisions, and we evaluate this
framework for minimizing data movement.
8.2 Looking Forward
Exciting times lie ahead for the computing industry with wide-spread adoption of a
wide range of heterogeneous processing elements (domain specic architectures). Ev-
ery processing element accesses memory dierently and often memory demands vary
signicantly. The data orchestration for mapping data to specic processing elements
is becoming a critical issue in these systems; providing ample research opportunities
for designers to optimize data movement. In this section, I try to identify some of
the challenges in this area based on my research experience.
The rst issue is the programmers' view of memory. Programmers view mem-
ory as monolithic, i.e., they consider caches and main memory as the same. Caches
and main memory have signicantly dierent latencies, access energy costs and ca-
pacities. While it simplies the programming model and provides an abstraction to
users, many of the software optimizations that improve main memory utilization do
not translate to ecient cache hierarchy utilization leading to excess and unnecessary
data movement. A more nuanced view of memory will help optimize data layout
transformations for various levels of cache and memory hierarchy.
Second, excess and unnecessary data movement is inherently caused because of the
mismatch between applications and architectures, which requires a hardware-software
157
co-design eort. The hardware needs to provide tuning knobs for software to opti-
mize their codes. The hardware tuning knobs can be dierent cacheline sizes, cache
set-associativity, replacement policies to select from, ability to turn on/o hardware
coherence schemes. The software can select a set of these cache parameter options
at runtime to improve cache behavior for their applications. The implementation of
such cache hierarchies will require solving complex design issues and designing for
worst-case resources requirements.
Third, more tools are needed that can help track and report data movement that
occurs in the cache hierarchy back to the application code and provide a detailed re-
port or visual feedback. Without tracking cache utilization back to application code,
correlating cache hierarchy utilization to a specic section of application code be-
comes an impossible task for application developers. Additionally many abstractions
that exist in the programming model obfuscate optimization opportunities.
While the challenges might seem complex, they provide a clear direction into
research opportunities to improve current state-of-the-art to optimize for data move-
ment both from application and hardware architecture perspectives.
158
Bibliography
[Alb99] D. H. Albonesi. Selective cache ways: on-demand cache resource al-
location. In MICRO-32. Proceedings of the 32nd Annual ACM/IEEE
International Symposium on Microarchitecture, pages 248{259, 1999.
[AMD] AMD, AMD Family 15h Processor BIOS and Kernel Developer Guide,
2011.
[AP] Alexis Arnaud and Isabelle Puaut. Dynamic instruction cache locking
in hard real-time systems. In In RTNS.
[B
+
91] D. H. Bailey et al. The NAS Parallel Benchmarks. Technical report,
The International Journal of Supercomputer Applications, 1991.
[B
+
96] L. S. Blackford et al. ScaLAPACK: A Portable Linear Algebra Library
for Distributed Memory Computers - Design Issues and Performance.
In Supercomputing, 1996. Proceedings of the 1996 ACM/IEEE Confer-
ence on, pages 5{5, 1996.
[BABD00] R. Balasubramonian, D. Albones, A. Buyuktosunoglu, and
S. Dwarkadas. Memory hierarchy reconguration for energy and per-
formance in general-purpose processor architectures. In Proceedings
33rd Annual IEEE/ACM International Symposium on Microarchitec-
ture. MICRO-33 2000, pages 245{257, 2000.
[BBC
+
13] P. Balaprakash, D. Buntinas, A. Chan, A. Guha, R. Gupta, S. H. K.
Narayanan, A. A. Chien, P. Hovland, and B. Norris. Exascale workload
characterization and architecture implications. In 2013 IEEE Interna-
tional Symposium on Performance Analysis of Systems and Software
(ISPASS), pages 120{121, April 2013.
[BC11] Shekhar Borkar and Andrew A. Chien. The future of microprocessors.
Commun. ACM, 54(5):67{77, May 2011.
159
[BCD
+
12] R.F. Barrett, P.S. Crozier, D.W. Doer
er, S.D. Hammond, M.A. Her-
oux, P.T. Lin, H.K. Thronquist, T.G. Trucano, and C.T. Vaughan.
Summary of Work for ASC L2 Milestone 4465: Characterize the Role
of the Mini-Application in Predicting Key Performance Characteris-
tics of Real Applications. Technical Report SAND2012-4667, Sandia
National Laboratories, 2012.
[BCGM07] M.T. Bohr, R.S. Chau, T. Ghani, and K. Mistry. The high-k solution.
Spectrum, IEEE, 44(10):29{35, Oct 2007.
[BCH
+
15] R.F. Barrett, P.S. Crozier, M.A. Heroux, P.T. Lin, H.K. Thornquist,
T.G. Trucano, and C.T. Vaughan. Assessing the Validity of the Role
of Mini-Applications in Predicting Key Performance Characteristics
of Scientic and Engineering Applications. Jounral of Parallel and
Distributed Computing, 75:107 { 122, 2015.
[BD
+
00] S. Browne, J Dongarra, et al. A Portable Programming Interface for
Performance Evaluation on Modern Processors. The Int. Journal of
High Performance Computing Applications, 14:189{204, 2000.
[BD02] Kristof Beyls and Erik H. D'Hollander. Reuse distance-based cache hint
selection. In Proceedings of the 8th International Euro-Par Conference
on Parallel Processing, Euro-Par '02, pages 265{274, London, UK, UK,
2002. Springer-Verlag.
[BD05] Kristof Beyls and Erik H. D'Hollander. Generating cache hints for
improved program eciency. J. Syst. Archit., 51(4):223{250, April
2005.
[Bel00] Frank Bellosa. The benets of event: Driven energy accounting in
power-sensitive systems. In Proceedings of the 9th Workshop on ACM
SIGOPS European Workshop: Beyond the PC: New Challenges for the
Operating System, EW 9, pages 37{42, New York, NY, USA, 2000.
ACM.
[BGBD13] Jacob Brock, Xiaoming Gu, Bin Bao, and Chen Ding. Pacman:
Program-assisted cache management. In Proceedings of the 2013 Inter-
national Symposium on Memory Management, ISMM '13, pages 39{50,
New York, NY, USA, 2013. ACM.
[BJ12] W. Lloyd Bircher and Lizy K. John. Complete System Power Esti-
mation Using Processor Performance Events. IEEE Transactions on
Computers, 61(4):563{577, 2012.
160
[Blo70] Burton H. Bloom. Space/time trade-os in hash coding with allowable
errors. Commun. ACM, 13(7):422{426, July 1970.
[BM84] Richard B. Bunt and Jennifer M. Murphy. The measurement of locality
and the behaviour of programs. Comput. J., 27(3):238{253, August
1984.
[BM05] D.A. Bader and K. Madduri. Design and Implementation of the HPCS
Graph Analysis Benchmark on Symmetric Multiprocessors. In Proc.
12th International Conference on High Performance Computing (HiPC
2005), volume 3769, pages 465{476. Springer-Verlag Berlin Heidelberg,
December 2005.
[BW03] R. Bunt and C. Williamson. Temporal and spatial locality: A time and
a place for everything. In In Proceedings of the International Sympo-
sium in Honour of Professor Guenter Haring's 60th Birthday, 2003.
[CB95] Tien-Fu Chen and Jean-Loup Baer. Eective hardware-based data
prefetching for high-performance processors. Computers, IEEE Trans-
actions on, 44(5):609{623, May 1995.
[Cra] Cray Performance Measurement and Analysis Tools. http://docs.
cray.com/.
[DAS12] Michel Dubois, Murali Annavaram, and Per Stenstrom. Parallel Com-
puter Organization and Design. Cambridge University Press, 2012.
[Dav13] David H. Albonesi, Avinash Kodi, Vladimir Stojanovic. Workshop on
emerging technology for interconnects, 2013.
[DBH
+
06] K.D. Devine, E.G. Boman, R.T. Heaphy, R.H. Bisseling, and U.V.
Catalyurek. Parallel hypergraph partitioning for scientic computing.
In Proc. of 20th International Parallel and Distributed Processing Sym-
posium (IPDPS'06). IEEE, 2006.
[DD15] Aditya M. Deshpande and Jerey T. Draper. Modeling data movement
in the memory hierarchy in hpc systems. In Proceedings of the 2015
International Symposium on Memory Systems, MEMSYS '15, pages
158{161, New York, NY, USA, 2015. ACM.
[DDRB15] Aditya M. Deshpande, Jerey T. Draper, J. Brian Rigdon, and
Richard F. Barrett. Pathnder: A signature-search miniapp and its
161
runtime characteristics. In Proceedings of the 5th Workshop on Irregu-
lar Applications: Architectures and Algorithms, IA3 '15, pages 9:1{9:4,
New York, NY, USA, 2015. ACM.
[DKM
+
12] Andrew Danowitz, Kyle Kelley, James Mao, John P. Stevenson, and
Mark Horowitz. Cpu db: Recording microprocessor history. Commun.
ACM, 55(4):55{63, April 2012.
[DRR06] J org D ummler, Thomas Rauber, and Gudula R unger. Combining mea-
sures for temporal and spatial locality. In Proceedings of the 2006
International Conference on Frontiers of High Performance Comput-
ing and Networking, ISPA'06, pages 697{706, Berlin, Heidelberg, 2006.
Springer-Verlag.
[DS02] A. S. Dhodapkar and J. E. Smith. Managing multi-conguration hard-
ware via dynamic working set analysis. In Proceedings 29th Annual
International Symposium on Computer Architecture, pages 233{244,
2002.
[DSVPDB07] Bjorn De Sutter, Ludo Van Put, and Koen De Bosschere. A practical
interprocedural dominance algorithm. ACM Trans. Program. Lang.
Syst., 29(4), August 2007.
[DZ03] Chen Ding and Yutao Zhong. Predicting whole-program locality
through reuse distance analysis. In Proceedings of the ACM SIGPLAN
2003 Conference on Programming Language Design and Implementa-
tion, PLDI '03, pages 245{257, New York, NY, USA, 2003. ACM.
[DZK
+
12] N. Duong, D. Zhao, T. Kim, R. Cammarota, M. Valero, and A. V. Vei-
denbaum. Improving cache management policies using dynamic reuse
distances. In 2012 45th Annual IEEE/ACM International Symposium
on Microarchitecture, pages 389{400, Dec 2012.
[FFD
+
14] E.J. Fluhr, J. Friedrich, D. Dreps, V. Zyuban, G. Still, C. Gonzalez,
A. Hall, D. Hogenmiller, F. Malgioglio, R. Nett, J. Paredes, J. Pille,
D. Plass, R. Puri, P. Restle, D. Shan, K. Stawiasz, Z.T. Deniz, D. Wen-
del, and M. Ziegler. 5.1 power8tm: A 12-core server-class processor in
22nm soi with 7.6tb/s o-chip bandwidth. In Solid-State Circuits Con-
ference Digest of Technical Papers (ISSCC), 2014 IEEE International,
pages 96{97, Feb 2014.
162
[FJ94] K.I. Farkas and N.P. Jouppi. Complexity/performance tradeos with
non-blocking loads. In Computer Architecture, 1994., Proceedings the
21st Annual International Symposium on, pages 211{222, Apr 1994.
[FKM
+
02] Kriszti an Flautner, Nam Sung Kim, Steve Martin, David Blaauw, and
Trevor Mudge. Drowsy Caches: Simple Techniques for Reducing Leak-
age Power. SIGARCH Comput. Archit. News, 30(2):148{157, May
2002.
[Fre77] L.C. Freeman. A set of measures of centrality based on betweenness.
Sociometry, 40(1):35{41, 1977.
[GCB
+
09] Xiaoming Gu, Ian Christopher, Tongxin Bai, Chengliang Zhang, and
Chen Ding. A component model of spatial locality. In Proceedings
of the 2009 International Symposium on Memory Management, ISMM
'09, pages 99{108, New York, NY, USA, 2009. ACM.
[GD12] Xiaoming Gu and Chen Ding. A generalized theory of collaborative
caching. In Proceedings of the 2012 International Symposium on Mem-
ory Management, ISMM '12, pages 109{120, New York, NY, USA,
2012. ACM.
[GFS
+
10] Rong Ge, Xizhou Feng, Shuaiwen Song, Hung-Ching Chang, Dong Li,
and K.W. Cameron. PowerPack: Energy Proling and Analysis of
High-Performance Systems and Applications. Parallel and Distributed
Systems, IEEE Transactions on, 21(5):658{671, 2010.
[GMG
+
10] B. Goel, S.A. McKee, R. Gioiosa, K. Singh, M. Bhadauria, and M. Ce-
sati. Portable, Scalable, per-core Power Estimation for Intelligent Re-
source Management. In Green Computing Conference, 2010 Interna-
tional, pages 135{146, 2010.
[Gra] Grap500 Benchmark. www.graph500.org.
[GRLC08] Ann Gordon-Ross, Jeremy Lau, and Brad Calder. Phase-based cache
reconguration for a highly-congurable two-level cache hierarchy.
In Proceedings of the 18th ACM Great Lakes Symposium on VLSI,
GLSVLSI '08, pages 379{382, New York, NY, USA, 2008. ACM.
[Her] Heroux, Doer
er, Crozier, Willenbring, Edwards, Williams, Ra-
jan and Keiter, Thornquist, and Numrich. Improving Perfor-
mance via Mini-applications. http://www.sandia.gov/
~
maherou/
docs/MantevoOverview.pdf.
163
[HFFA11] N. Hardavellas, M. Ferdman, B. Falsa, and A. Ailamaki. Toward Dark
Silicon in Servers. Micro, IEEE, 31(4):6{15, 2011.
[HHA
+
03] H. Hanson, M.S. Hrishikesh, V. Agarwal, S.W. Keckler, and D. Burger.
Static energy reduction techniques for microprocessor caches. Very
Large Scale Integration (VLSI) Systems, IEEE Transactions on,
11(3):303{313, June 2003.
[Hil87] Mark Donald Hill. Aspects of Cache Memory and Instruction Buer
Performance. PhD thesis, EECS Department, University of California,
Berkeley, Nov 1987.
[Hil88] Mark D. Hill. A case for direct-mapped caches. Computer, 21(12):25{
40, December 1988.
[HKC10] C. Hughes, C. Kim, and Y. Chen. Performance and energy implications
of many-core caches for throughput computing. IEEE Micro, 30(6):25{
35, Nov 2010.
[HKSW03] J. Haid, G. Kaefer, Ch. Steger, and R. Weiss. Run-Time Energy Esti-
mation in System-on-a-chip Designs. In Proceedings of the 2003 Asia
and South Pacic Design Automation Conference, ASP-DAC '03, pages
595{599, New York, NY, USA, 2003. ACM.
[HL99] T. Horel and G. Lauterbach. Ultrasparc-iii: designing third-generation
64-bit performance. Micro, IEEE, 19(3):73{85, May 1999.
[HMB
+
14] P. Hammarlund, A.J. Martinez, A.A. Bajwa, D.L. Hill, E. Hallnor,
Hong Jiang, M. Dixon, M. Derr, M. Hunsaker, R. Kumar, R.B. Os-
borne, R. Rajwar, R. Singhal, R. D'Sa, R. Chappell, S. Kaushik,
S. Chennupaty, S. Jourdan, S. Gunther, T. Piazza, and T. Burton.
Haswell: The fourth-generation intel core processor. Micro, IEEE,
34(2):6{20, Mar 2014.
[HP03] John L. Hennessy and David A. Patterson. Computer Architecture: A
Quantitative Approach. Morgan Kaufmann Publishers Inc., San Fran-
cisco, CA, USA, 3 edition, 2003.
[HS89] M.D. Hill and A.J. Smith. Evaluating associativity in cpu caches. Com-
puters, IEEE Transactions on, 38(12):1612{1630, Dec 1989.
[IM03] C. Isci and M. Martonosi. Runtime Power Monitoring in high-end
Processors: Methodology and Empirical Data. In Microarchitecture,
164
2003. MICRO-36. Proceedings. 36th Annual IEEE/ACM International
Symposium on, pages 93{104, 2003.
[Int] Intel, Intel Architecture Software Developer's Manual, Volume 3: Sys-
tem Programming Guide, 2009.
[Int10] Intel. Intel energy checker: Software developer kit user guide, 2010.
[ITR] ITRS Roadmap. http://www.itrs.net/.
[JDER01] P. Jain, S. Devadas, D. Engels, and L. Rudolph. Software-assisted
cache replacement mechanisms for embedded systems. In IEEE/ACM
International Conference on Computer Aided Design. ICCAD 2001.
IEEE/ACM Digest of Technical Papers (Cat. No.01CH37281), pages
119{126, Nov 2001.
[JM01] Russ Joseph and Margaret Martonosi. Run-time Power Estimation in
High Performance Microprocessors. In Proceedings of the 2001 inter-
national symposium on Low power electronics and design, ISLPED '01,
pages 135{140, New York, NY, USA, 2001. ACM.
[Jou90] N.P. Jouppi. Improving direct-mapped cache performance by the ad-
dition of a small fully-associative cache and prefetch buers. In Com-
puter Architecture, 1990. Proceedings., 17th Annual International Sym-
posium on, pages 364{373, May 1990.
[KBK03] Changkyu Kim, D. Burger, and S.W. Keckler. Nonuniform cache ar-
chitectures for wire-delay dominated on-chip caches. Micro, IEEE,
23(6):99{107, Nov 2003.
[KCK
+
01] I. Kadayif, T. Chinoda, M. Kandemir, N. Vijaykirsnan, M. J. Irwin,
and A. Sivasubramaniam. vEC: virtual Energy Counters. In Pro-
ceedings of the 2001 ACM SIGPLAN-SIGSOFT workshop on Program
analysis for software tools and engineering, PASTE '01, pages 28{31,
New York, NY, USA, 2001. ACM.
[KFBM02] Nam Sung Kim, Kriszti an Flautner, David Blaauw, and Trevor Mudge.
Drowsy Instruction Caches: Leakage Power Reduction using Dynamic
Voltage Scaling and Cache Sub-bank Prediction. In Proceedings of the
35th annual ACM/IEEE international symposium on Microarchitec-
ture, MICRO 35, pages 219{230, Los Alamitos, CA, USA, 2002. IEEE
Computer Society Press.
165
[KGKH13] G. Kestor, R. Gioiosa, D.J. Kerbyson, and A. Hoisie. Quantifying the
energy cost of data movement in scientic applications. In Workload
Characterization (IISWC), 2013 IEEE International Symposium on,
pages 56{65, Sept 2013.
[KHM01] S. Kaxiras, Zhigang Hu, and M. Martonosi. Cache decay: exploiting
generational behavior to reduce cache leakage power. In Computer
Architecture, 2001. Proceedings. 28th Annual International Symposium
on, pages 240{251, 2001.
[KK98] G. Karypis and V. Kumar. A fast and high quality multilevel scheme for
partitioning irregular graphs. SIAM J. Sci. Comput., 20(1), December
1998.
[KKMR05] C.H. Kim, Jae-Joon Kim, S. Mukhopadhyay, and K. Roy. A forward
body-biased low-leakage sram cache: device, circuit and architecture
considerations. Very Large Scale Integration (VLSI) Systems, IEEE
Transactions on, 13(3):349{357, March 2005.
[Kog14] Peter. M. Kogge. Reading the Entrails: How Architecture Has Evolved
at the High End - IPDPS 2014 Keynote lecture. http://www.ipdps.
org/ipdps2014/IPDPS2014keynote-Kogge.pdf, 2014.
[KPK07] G. Keramidas, P. Petoumenos, and S. Kaxiras. Cache replacement
based on reuse-distance prediction. In 2007 25th International Confer-
ence on Computer Design, pages 245{250, Oct 2007.
[Kro81] David Kroft. Lockup-free instruction fetch/prefetch cache organization.
In Proceedings of the 8th Annual Symposium on Computer Architecture,
ISCA '81, pages 81{87, Los Alamitos, CA, USA, 1981. IEEE Computer
Society Press.
[KZS
+
12] S. Kumar, H. Zhao, A. Shriraman, E. Matthews, S. Dwarkadas, and
L. Shannon. Amoeba-cache: Adaptive blocks for eliminating waste in
the memory hierarchy. In 2012 45th Annual IEEE/ACM International
Symposium on Microarchitecture, pages 376{388, Dec 2012.
[Lab] Sandia National Laboratories. SST: The Structural Simulation Toolkit.
http://sst.sandia.gov.
[LAS
+
09] Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M.
Tullsen, and Norman P. Jouppi. McPAT: An Integrated Power, Area,
166
and Timing Modeling Framework for Multicore and Manycore Archi-
tectures. In MICRO 42: Proceedings of the 42nd Annual IEEE/ACM
International Symposium on Microarchitecture, pages 469{480, 2009.
[LBD
+
06] Piotr R Luszczek, David H Bailey, Jack J Dongarra, Jeremy Kep-
ner, Robert F Lucas, Rolf Rabenseifner, and Daisuke Takahashi. The
hpc challenge (hpcc) benchmark suite. In Proceedings of the 2006
ACM/IEEE Conference on Supercomputing, SC '06, New York, NY,
USA, 2006. ACM.
[LKfT
+
03] Lin Li, Ismail Kadayif, Yuh fang Tsai, N. Vijaykrishnan, Mahmut
Kandemir, Mary Jane Irwin, and Anand Sivasubramaniam. Manag-
ing leakage energy in cache hierarchies. Journal of Instruction-level
Parallelism, 5:2003, 2003.
[LMC13] X. Liu and J. Mellor-Crummey. Pinpointing data locality bottlenecks
with low overhead. In 2013 IEEE International Symposium on Per-
formance Analysis of Systems and Software (ISPASS), pages 183{193,
April 2013.
[L.S97] L.S. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I.
Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stan-
ley, D. Walker, R.C. Whaley. ScaLAPACK: A Linear Algebra Library
for Message-Passing Computers. In In SIAM Conference on Parallel
Processing, 1997.
[M
+
] Naveen Muralimanohar et al. CACTI 6.0: A Tool to Model Large
Caches, Technical Report, HP Laboratories, April 2009. http://www.
hpl.hp.com/techreports/2009/HPL-2009-85.html.
[man] Mantevo Project. http://www.mantevo.org/.
[MBD05] Nasir Mohyuddin, Rashed Bhatti, and Michel Dubois. Controlling
Leakage Power with the Replacement Policy in Slumberous Caches.
In Proceedings of the 2nd conference on Computing frontiers, CF '05,
pages 161{170, New York, NY, USA, 2005. ACM.
[McC95] John D. McCalpin. Memory bandwidth and machine balance in cur-
rent high performance computers. IEEE Computer Society Technical
Committee on Computer Architecture (TCCA) Newsletter, pages 19{
25, December 1995.
167
[MCZ14] S. Mittal, Y. Cao, and Z. Zhang. Master: A multicore cache energy-
saving technique using dynamic cache reconguration. IEEE Trans-
actions on Very Large Scale Integration (VLSI) Systems, 22(8):1653{
1665, Aug 2014.
[MK07] Richard C. Murphy and Peter M. Kogge. On the memory access pat-
terns of supercomputer applications: Benchmark selection and its im-
plications. IEEE Trans. Comput., 56(7):937{945, July 2007.
[MMC00] A. Malik, B. Moyer, and D. Cermak. A Low Power Unied Cache
Architecture Providing Power and Performance Flexibility. In Low
Power Electronics and Design, 2000. ISLPED '00. Proceedings of the
2000 International Symposium on, pages 241{243, 2000.
[NH06] Weste Neil and David Harris. CMOS VLSI Design: A Circuits and
Systems Perspective. Pearson Education, 2006.
[NLJ96] J.J. Navarro, T. Lang, and T. Juan. The dierence-bit cache. In
Computer Architecture, 1996 23rd Annual International Symposium
on, pages 114{114, May 1996.
[Nvi] NVIDIA, NVML Reference Manual, NVIDIA, 2012.
[NXT
+
15] Lifeng Nai, Yinglong Xia, Ilie G. Tanase, Hyesoon Kim, and Ching-
Yung Lin. Graphbig: Understanding graph computing in the context
of industrial solutions. In Proceedings of the International Conference
for High Performance Computing, Networking, Storage and Analysis,
SC '15, pages 69:1{69:12, New York, NY, USA, 2015. ACM.
[P. 08] P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carlson,
W. Dally, M. Denneau, P. Franzon, W. Harrod, K. Hill, J.
Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards,
A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams,
and K. Yelick. Exascale computing study: Technology challenges
in achieving exascale systems. http://users.ece.gatech.edu/
mrichard/ExascaleComputingStudyReports/exascale_final_
report_100208.pdf, 2008.
[PBE13] Minh-Duc Pham, Peter Boncz, and Orri Erling. S3g2: A scalable
structure-correlated social graph generator. In Raghunath Nambiar
and Meikel Poess, editors, Selected Topics in Performance Evaluation
and Benchmarking: 4th TPC Technology Conference, TPCTC 2012,
168
Istanbul, Turkey, August 27, 2012, Revised Selected Papers, pages 156{
172, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg.
[Pop] P. Popa. Managing Server Energy Consumption using IBM PowerEx-
ecutive.
[Pro59] Reese T. Prosser. Applications of boolean matrices to the analysis of
ow diagrams. In Papers Presented at the December 1-3, 1959, East-
ern Joint IRE-AIEE-ACM Computer Conference, IRE-AIEE-ACM '59
(Eastern), New York, NY, USA, 1959. ACM.
[PYF
+
00] M. Powell, S.-H. Yang, B. Falsa, K. Roy, and T. N. Vijayku-
mar. Gated-Vdd: A Circuit Technique to Reduce Leakage in Deep-
Submicron Cache Memories. In Low Power Electronics and Design,
2000. ISLPED '00. Proceedings of the 2000 International Symposium
on, pages 90{95, 2000.
[QSP07] M. K. Qureshi, M. A. Suleman, and Y. N. Patt. Line distillation: In-
creasing cache capacity by ltering unused words in cache lines. In 2007
IEEE 13th International Symposium on High Performance Computer
Architecture, pages 250{259, Feb 2007.
[QTP05] Moinuddin K. Qureshi, David Thompson, and Yale N. Patt. The v-
way cache: Demand based associativity via global replacement. In
Proceedings of the 32Nd Annual International Symposium on Computer
Architecture, ISCA '05, pages 544{555, Washington, DC, USA, 2005.
IEEE Computer Society.
[R. 14] R. Lucas, J. Ang, K. Bergman, S. Borkar, W. Carlson, L. Car-
rington, G. Chiu, R. Colwell, W. Dally, J. Dongarra, A. Geist,
G. Grider, R. Haring, J. Hittinger, A. Hoisie, D. Klein, P. Kogge,
R. Lethin, V. Sarkar, R.Schreiber, J. Shalf, T. Sterling, and R.
Stevens. Top ten exascale research challenges, doe ascac subcommit-
tee report. http://science.energy.gov/
~
/media/ascr/ascac/pdf/
meetings/20140210/Top10reportFEB14.pdf, 2014.
[RAJ00] P. Ranganathan, S. Adve, and N. P. Jouppi. Recongurable caches and
their application to media processing. In Proceedings of 27th Interna-
tional Symposium on Computer Architecture (IEEE Cat. No.RS00201),
pages 214{224, June 2000.
[RBS96] Eric Rotenberg, Steve Bennett, and James E. Smith. Trace cache:
A low latency approach to high bandwidth instruction fetching. In
169
Proceedings of the 29th Annual ACM/IEEE International Symposium
on Microarchitecture, MICRO 29, pages 24{35, Washington, DC, USA,
1996. IEEE Computer Society.
[RFD09] D. Rol an, B. B. Fraguela, and R. Doallo. Adaptive line placement
with the set balancing cache. In 2009 42nd Annual IEEE/ACM Inter-
national Symposium on Microarchitecture (MICRO), pages 529{540,
Dec 2009.
[RHB
+
11] A. F. Rodrigues, K. S. Hemmert, B. W. Barrett, C. Kersey, R. Old-
eld, M. Weston, R. Risen, J. Cook, P. Rosenfeld, E. CooperBalls, and
B. Jacob. The structural simulation toolkit. SIGMETRICS Perform.
Eval. Rev., 38(4):37{42, March 2011.
[RJ98] J.T. Russell and M.F. Jacome. Software power estimation and opti-
mization for high performance, 32-bit embedded processors. In Com-
puter Design: VLSI in Computers and Processors, 1998. ICCD '98.
Proceedings. International Conference on, pages 328{333, Oct 1998.
[Ryf] S. Ryel. LEA2P: The Linux Energy Attribution and Accounting Plat-
form.
[S
+
09] Vivek Sarkar et al. ExaScale Software Study: Software Challenges in
Extreme Scale Systems, 2009.
[SBM09] Karan Singh, Major Bhadauria, and Sally A. McKee. Real Time
Power Estimation and Thread Scheduling via Performance Counters.
SIGARCH Comput. Archit. News, 37(2):46{55, July 2009.
[SC99] Wen-Tsong Shiue and C. Chakrabarti. Memory design and exploration
for low power, embedded systems. In Signal Processing Systems, 1999.
SiPS 99. 1999 IEEE Workshop on, pages 281{290, 1999.
[Seg01] S. Segars. Low power design techniques for microprocessors, 2001.
[SG85] J.E. Smith and J.R. Goodman. Instruction cache replacement policies
and organizations. Computers, IEEE Transactions on, C-34(3):234{
241, March 1985.
[Smi82] Alan Jay Smith. Cache memories. ACM Computing Surveys, 14:473{
530, 1982.
[Smi87] A.J. Smith. Line (block) size choice for cpu cache memories. Comput-
ers, IEEE Transactions on, C-36(9):1063{1075, Sept 1987.
170
[SSL06] R. Subramanian, Y. Smaragdakis, and G. H. Loh. Adaptive caches:
Eective shaping of cache behavior to workloads. In 2006 39th An-
nual IEEE/ACM International Symposium on Microarchitecture (MI-
CRO'06), pages 385{396, Dec 2006.
[SVMW05] Jennifer B. Sartor, Subramaniam Venkiteswaran, Kathryn S. McKin-
ley, and Zhenlin Wang. Cooperative caching with keep-me and evict-
me. In Proceedings of the 9th Annual Workshop on Interaction Between
Compilers and Computer Architectures, INTERACT '05, pages 46{57,
Washington, DC, USA, 2005. IEEE Computer Society.
[SZD04] Xipeng Shen, Yutao Zhong, and Chen Ding. Locality phase prediction.
In Proceedings of the 11th International Conference on Architectural
Support for Programming Languages and Operating Systems, ASPLOS
XI, pages 165{176, New York, NY, USA, 2004. ACM.
[TJYD] Dan Terpstra, Heike Jagode, Haihang You, and Jack Dongarra. Col-
lecting Performance Data with PAPI-C.
[TMW94] V. Tiwari, S. Malik, and A. Wolfe. Power Analysis of Embedded Soft-
ware: A First Step Towards Software Power Minimization. Very Large
Scale Integration (VLSI) Systems, IEEE Transactions on, 2(4):437{
445, 1994.
[Top] Top500 Supercomputer Sites. http://www.top500.org/.
[TTK14] E. Totoni, J. Torrellas, and L. V. Kale. Using an adaptive hpc run-
time system to recongure the cache hierarchy. In SC14: International
Conference for High Performance Computing, Networking, Storage and
Analysis, pages 1047{1058, Nov 2014.
[VLX03] Xavier Vera, Bj orn Lisper, and Jingling Xue. Data cache locking for
higher program predictability. In Proceedings of the 2003 ACM SIG-
METRICS International Conference on Measurement and Modeling of
Computer Systems, SIGMETRICS '03, pages 272{282, New York, NY,
USA, 2003. ACM.
[VTG
+
99] Alexander V. Veidenbaum, Weiyu Tang, Rajesh Gupta, Alexandru
Nicolau, and Xiaomei Ji. Adapting cache line size to application be-
havior. In Proceedings of the 13th International Conference on Super-
computing, ICS '99, pages 145{154, New York, NY, USA, 1999. ACM.
171
[WJK
+
12] V.M. Weaver, M. Johnson, K. Kasichayanula, J. Ralph, P. Luszczek,
D. Terpstra, and S. Moore. Measuring Energy and Power with PAPI.
In Parallel Processing Workshops (ICPPW), 2012 41st International
Conference on, pages 262{268, 2012.
[WMRW02] Zhenlin Wang, K. S. McKinley, A. L. Rosenberg, and C. C. Weems.
Using the compiler to improve cache replacement decisions. In Pro-
ceedings.International Conference on Parallel Architectures and Com-
pilation Techniques, pages 199{208, 2002.
[WMS09] Matthew A. Watkins, Sally A. Mckee, and Lambert Schaelicke. Revis-
iting cache block superloading. In Proceedings of the 4th International
Conference on High Performance Embedded Architectures and Compil-
ers, HiPEAC '09, pages 339{354, Berlin, Heidelberg, 2009. Springer-
Verlag.
[WMSS05] Jonathan Weinberg, Michael O. McCracken, Erich Strohmaier, and
Allan Snavely. Quantifying locality in the memory access patterns of
hpc applications. In Proceedings of the 2005 ACM/IEEE Conference
on Supercomputing, SC '05, pages 50{, Washington, DC, USA, 2005.
IEEE Computer Society.
[YBS05] Jing Yu, Sara Baghsorkhi, and Marc Snir. A New Locality Metric and
Case Studies for HPCS Benchmark. Technical report, University of
Illinois Urbana Champaign, 2005.
[YPF
+
01] S.-H. Yang, M.D. Powell, B. Falsa, K. Roy, and T. N. Vijaykumar. An
integrated circuit/architecture approach to reducing leakage in deep-
submicron high-performance i-caches. In High-Performance Computer
Architecture, 2001. HPCA. The Seventh International Symposium on,
pages 147{157, 2001.
[ZDD03] Yutao Zhong, Steven G. Dropsho, and Chen Ding. Miss rate prediction
across all program inputs. In Proceedings of the 12th International Con-
ference on Parallel Architectures and Compilation Techniques, PACT
'03, pages 79{, Washington, DC, USA, 2003. IEEE Computer Society.
[ZGR13] Wei Zang and Ann Gordon-Ross. A survey on cache tuning from a
power/energy perspective. ACM Comput. Surv., 45(3):32:1{32:49, July
2013.
172
[ZVN03] C. Zhang, F. Vahid, and W. Najjar. A highly congurable cache ar-
chitecture for embedded systems. In 30th Annual International Sym-
posium on Computer Architecture, 2003. Proceedings., pages 136{146,
June 2003.
173
Abstract (if available)
Abstract
High-Performance Computing is entrenched in our lives. It is used to model complex physical processes in the field of sciences, engineering, medicine and requires extremely large computing platforms. To continue new research over the next decade, U.S Department of Energy plans to build exascale systems which requires at least 10x performance improvement while maintaining a flat energy profile. Building exascale systems necessitates that we design and develop high performance and energy-efficient hardware blocks. Traditional computing models rely on a bring data to core paradigm, which means for any operation, data must be fetched from the memory through the cache hierarchy into the processor's register-file, operated upon and then written back to the memory through the cache hierarchy. With increasing cache sizes, deeper cache hierarchies and increasing application problem sizes leads to significant increases in the movement of data through the cache memory hierarchy. This increased data movement often causes performance bottlenecks and leads to poor performance and utilization of computing resources. With the criticality of data movement across the memory hierarchy, my research focuses on optimizing the cache hierarchy through the perspective of minimizing excess and unnecessary data movement. ❧ In this work, I focused on the trifecta of approaches to optimize data movement across the cache hierarchy. First, as application behavior plays an important part in dictating the flow of data through the cache hierarchy, we characterize applications. Specifically, we characterized PathFinder, a proxy application for graph-analytics signature search type algorithms critical to the DOE HPC community. Our results show an inefficient utilization of L2 cache (≈50% cache hit rate) results in increased data movement. Then we characterized applications on various locality metrics to quantify differences between several classes of applications. Graph kernels show 20% less spatial and temporal locality compared to other HPC applications and lower (50%) data intensiveness and 90% data turnover rate. Second, we developed a cache energy model to measure energy in caches when running applications on real-world systems. Using our model, we observed leakage energy is the primary energy dissipation mechanism in L1 caches and accounts for up to 80% of cache energy. We quantified the implications of using heterogeneous cacheline size across different levels of the cache hierarchy and cache set-associativity on the data movement. Using different cacheline sizes across different levels of the cache hierarchy, we observed over 13% data movement savings and over 30% improvement in L2/L3 cache hit rates and using 2-way L1 cache results in over 18% savings over an 8-way L1 cache. Third, we present a cache guidance system framework that allows us to expose cache architecture artifacts to users to explicitly dictate the behavior of an individual cache. Using one guidance generation scheme, we demonstrated over 30% reduction in cache misses in application phases.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Towards data-intensive processing architectures for improving efficiency in graph processing
PDF
Architectural innovations for mitigating data movement cost on graphics processing units and storage systems
PDF
Asynchronous writes in cache augmented data stores
PDF
Thermal analysis and multiobjective optimization for three dimensional integrated circuits
PDF
Hardware techniques for efficient communication in transactional systems
PDF
Understanding and optimizing internet video delivery
PDF
Efficient memory coherence and consistency support for enabling data sharing in GPUs
PDF
Introspective resilience for exascale high-performance computing systems
PDF
Component-based distributed data stores
PDF
Techniques for methodically exploring software development alternatives
PDF
Advanced cell design and reconfigurable circuits for single flux quantum technology
PDF
Efficient processing of streaming data in multi-user and multi-abstraction workflows
PDF
Design, modeling, and analysis for cache-aided wireless device-to-device communications
PDF
Coded computing: a transformative framework for resilient, secure, private, and communication efficient large scale distributed computing
PDF
Theoretical foundations for dealing with data scarcity and distributed computing in modern machine learning
PDF
High-performance distributed computing techniques for wireless IoT and connected vehicle systems
PDF
On efficient data transfers across geographically dispersed datacenters
PDF
Deep learning models for temporal data in health care
PDF
Production-level test issues in delay line based asynchronous designs
PDF
Reliable cache memories
Asset Metadata
Creator
Deshpande, Aditya Madhusudan
(author)
Core Title
Cache analysis and techniques for optimizing data movement across the cache hierarchy
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
05/03/2019
Defense Date
05/02/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
cache data movement,computer architecture,high-performance computing applications,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Draper, Jeffrey (
committee chair
), Gupta, Sandeep (
committee member
), Lucas, Robert (
committee member
)
Creator Email
adityamdeshpande@gmail.com,amdeshpa@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-165870
Unique identifier
UC11659919
Identifier
etd-DeshpandeA-7400.pdf (filename),usctheses-c89-165870 (legacy record id)
Legacy Identifier
etd-DeshpandeA-7400.pdf
Dmrecord
165870
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Deshpande, Aditya Madhusudan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
cache data movement
computer architecture
high-performance computing applications