Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Architectural innovations for mitigating data movement cost on graphics processing units and storage systems
(USC Thesis Other)
Architectural innovations for mitigating data movement cost on graphics processing units and storage systems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Architectural Innovations for Mitigating Data Movement Cost on Graphics Processing Units and Storage Systems by Gunjae Koo A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2018 Copyright 2018 Gunjae Koo Dedication To Jeongeun for being a sincere companion in my life journey To Ayin who teaches me love and joy of a daddy ii Acknowledgements First and foremost, I would like to express my sincere gratitude to my advisor, Professor Murali Annavaram. He was always open to help those who wanted to discuss with him. I was impressed that in discussions he always tried to give valuable suggestions even though he was busy handling his tons of tasks. He always expressed his trust to me even though I was struggling to make a notable progress. He encouraged me to aim high and think positively even under frustrating situations. He has been a great role model for every aspect of life. He was a great teacher who demonstrated how to understand research ideas and explain effectively to other people. He was a brilliant and diligent researcher who always had passions to explored cutting-edge and undiscovered research areas. To his sons, he was also an awesome daddy who enjoyed every play with them. During my Ph.D. I could not stop learning from him. I would also like to thank my qualifying and defense committee members: Professor Timothy Pinkston, Professor Leana Golubchik, Professor Michel Dubois, and Professor Jeffrey Draper, for being always supportive and providing insightful feedback. Their suggestions have been invaluable in elaborating the content of my dissertation. I would iii like to express my deep appreciation to Professor Pinkston for his encouragement and reference for my job search. I would like to acknowledge Professor Hung-Wei Tseng. He gave me tremendous help and invaluable advice when I started to explore the new world of storage systems. He always kindly responded to my tons of questions from East coast. My Summarizer paper could not be born without his supports. I also thank Professor Tseng for his advice, encouragement, and reference for an academic career. I am also grateful to Professor Nam Sung Kim. He was always open to give advice for my GPU research work. I would also like to appreciate Professor Kim for his reference. I had an opportunity to work with the researchers at Intel as a research intern. I would like to thank my internship mentors, Vivek Kozhikkottu and Chris Wilkerson. I was so lucky to have an opportunity to investigate the memory systems with Intel’s in- house computer architecture simulator. During the internship, I was able to gain the momentum for research on memory systems thanks to their supports. I also thank the other co-workers who shared the time during my internship: Yooseong Kim, Wootaek Lim, Kon-Woo Kwon, and Sang-Phill Park. I should also thank my friends and colleagues at the University of Southern Califor- nia. I want to express special thanks to Jinho Suh, who helped me settle down in LA and gave me valuable advice when I started my Ph.D. study. I also thank Daniel Wong, who occupied the left-side of the office and helped me prepare for the job market. I would like to thank Hyeran Jeon who developed many GPU research ideas together. I am also iv grateful to Kiran Matam, who spent the long-struggling time setting up the storage board and implementing the research ideas with me. I want to express thanks to my other office mates, Sang Wook Do and Chie-Ting Huang, from whom I gained a lot of help when I was working as a TA. I would like to thank other SCIP (Super Computing In Pocket) lab members: Mo- hammad Abel-Majeed, Qiumin Xu, Sangwon Lee, Abdulaziz Tabbakh, Krishna Giri Nara, Zhifeng Lin, Waleed Dweik, Melina Demertzi, Lakshmi Kumar Dabbiru, Haipeng Zha, Seyedeh Hanieh Hashemi, and Bardia Zandian. They all were great friends and brilliant collaborators for me. I would also like to thank the students and the professor who visited USC to collaborate with our research group: Sangpil Lee, Keunsoo Kim, Myung Kuk Yoon, Yunho Oh, and Professor Won Woo Ro. I was so fortunate to work with them. I also thank my Korean friends in EE: Woojoo Lee, Soowang Park, Hyunseok Ko, Joongheon Kim, and Jae-Won Nam. They were willing to spend time talking and drinking coffee with me. Last but not least, I express my heartfelt thanks to my family. I deeply appreciate their support, prayer, love, trust, and patience. I could not imagine my first year as a Ph.D. student without my mother-in-law and my mother who helped to care for baby Ayin. My daughter, Ayin has also sacrificed her playing time with dad when I was rushing to paper deadlines. Most of all, I would like to express my sincere gratitude to my wife, Jeoungeun. I could not have completed this long journey without her sacrifice, love, and supports. v Table of Contents Dedication ii Acknowledgements iii List of Tables ix List of Figures x Abstract xii Chapter 1: Introduction 1 1.1 Challenges in Computer Systems for Big Data . . . . . . . . . . . . . . 1 1.2 GPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 GPU Memory Hierarchy Architecture . . . . . . . . . . . . . . 5 1.2.2 GPU Software Execution Model . . . . . . . . . . . . . . . . . 7 1.2.3 Memory Operation: A Primary Performance Bottleneck . . . . 8 1.3 SSD Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3.1 Modern SSD Platforms . . . . . . . . . . . . . . . . . . . . . . 11 1.3.2 Flash Translation Layer . . . . . . . . . . . . . . . . . . . . . 12 1.3.3 Storage I/O Time – A Critical Performance Hurdle . . . . . . . 15 1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 Chapter 2: Revealing Critical Loads in GPU Memory Systems 20 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.2.2 Experiment environment . . . . . . . . . . . . . . . . . . . . . 24 2.3 Classifying Loads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4 Impact of Non-Deterministic Loads on Memory Traffic . . . . . . . . . 29 2.5 Impact of Non-Determinism on Load Instruction Latency . . . . . . . . 32 2.6 Cache Miss Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 vi Chapter 3: CTA-Aware Prefetching and Scheduling 41 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2.1 CTA Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.3 Limitations of Prefetches in GPU . . . . . . . . . . . . . . . . . . . . . 46 3.3.1 Intra-Warp Stride Prefetching . . . . . . . . . . . . . . . . . . 46 3.3.2 Inter-Warp Stride Prefetching . . . . . . . . . . . . . . . . . . 48 3.3.3 Next-Line Prefetching . . . . . . . . . . . . . . . . . . . . . . 49 3.4 Where Did My Strides Go? . . . . . . . . . . . . . . . . . . . . . . . . 50 3.5 CTA-Aware Prefetcher and Scheduler . . . . . . . . . . . . . . . . . . 53 3.5.1 Prefetch-Aware Scheduler . . . . . . . . . . . . . . . . . . . . 54 3.5.2 CTA-Aware Prefetcher . . . . . . . . . . . . . . . . . . . . . . 58 3.5.3 A Simple Prefetch Generation Illustration . . . . . . . . . . . . 61 3.5.4 Hardware Cost . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.6.1 Settings and Workloads . . . . . . . . . . . . . . . . . . . . . . 63 3.6.2 Performance Enhancement . . . . . . . . . . . . . . . . . . . . 65 3.6.3 Coverage and Accuracy of Prefetching . . . . . . . . . . . . . 68 3.6.4 Bandwidth Overhead . . . . . . . . . . . . . . . . . . . . . . . 70 3.6.5 Timeliness of Prefetching . . . . . . . . . . . . . . . . . . . . 71 3.6.6 Energy Consumption . . . . . . . . . . . . . . . . . . . . . . . 73 3.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Chapter 4: Cache Management for Improving Data Utilization in GPU 75 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.2 Cache Access Characteristics . . . . . . . . . . . . . . . . . . . . . . . 78 4.2.1 Data Locality Type . . . . . . . . . . . . . . . . . . . . . . . . 79 4.2.2 Loss of Locality in Cache . . . . . . . . . . . . . . . . . . . . 81 4.2.3 Access Pattern Similarity . . . . . . . . . . . . . . . . . . . . . 82 4.3 Access Pattern-Aware Cache Management . . . . . . . . . . . . . . . . 83 4.3.1 Locality-Specific Cache Management Strategies . . . . . . . . 84 4.3.2 Detection of Locality Types . . . . . . . . . . . . . . . . . . . 87 4.3.3 Protection Algorithm . . . . . . . . . . . . . . . . . . . . . . . 88 4.4 Hardware Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.4.1 Tracking Access Patterns . . . . . . . . . . . . . . . . . . . . . 90 4.4.2 Cache Management . . . . . . . . . . . . . . . . . . . . . . . . 94 4.4.3 An Illustrative Example . . . . . . . . . . . . . . . . . . . . . 95 4.4.4 Hardware Cost . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.5.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.5.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.5.3 Cache Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . 102 vii 4.5.4 Performance with Warp Throttling . . . . . . . . . . . . . . . . 103 4.5.5 Comparison with Other Schemes . . . . . . . . . . . . . . . . . 105 4.5.6 Sensitivity Studies . . . . . . . . . . . . . . . . . . . . . . . . 107 4.5.7 Energy Consumption . . . . . . . . . . . . . . . . . . . . . . . 111 4.6 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Chapter 5: In-Storage Indexing Mechanism 113 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.2 Motivation: Criticality of Indexing . . . . . . . . . . . . . . . . . . . . 116 5.2.1 Scan and Filtering . . . . . . . . . . . . . . . . . . . . . . . . 117 5.2.2 Join Processing . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.3 Flash Indexer (FLIXR) . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.3.1 Overview of FLIXR Model . . . . . . . . . . . . . . . . . . . . 120 5.3.2 Index Creation and Maintenance . . . . . . . . . . . . . . . . . 121 5.3.3 Exploiting Indexes for FLIXR Computations . . . . . . . . . . 124 5.3.4 Supports for Join Processing . . . . . . . . . . . . . . . . . . . 126 5.3.5 Extended NVMe Commands for FLIXR . . . . . . . . . . . . . 129 5.3.6 Cost Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 5.5.1 Index Maintenance Performance . . . . . . . . . . . . . . . . . 137 5.5.2 Query Processing Performance . . . . . . . . . . . . . . . . . . 138 5.5.3 Storage I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 5.5.4 Performance Effect of Secondary Key Indexing . . . . . . . . . 142 5.5.5 Performance by Computation Power of the Host CPU . . . . . 143 5.5.6 State Space Exploration: Internal Bandwidth . . . . . . . . . . 145 5.5.7 Energy Consumption . . . . . . . . . . . . . . . . . . . . . . . 146 5.6 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 5.7 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 Chapter 6: Conclusion 153 Reference List 157 viii List of Tables 2.1 Application characteristics in three categories . . . . . . . . . . . . . . 22 2.2 The profiler counters used in this study [65] . . . . . . . . . . . . . . . 24 2.3 Experiment environments . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1 Database entry size of the prefetcher . . . . . . . . . . . . . . . . . . . 60 3.2 Required hardware for tables . . . . . . . . . . . . . . . . . . . . . . . 63 3.3 GPU configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.4 Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.2 Criteria of locality type decision . . . . . . . . . . . . . . . . . . . . . 87 4.3 Hardware overhead by APCM . . . . . . . . . . . . . . . . . . . . . . 98 4.4 GPGPU-Sim baseline configurations . . . . . . . . . . . . . . . . . . . 99 4.5 Basic APCM configuration . . . . . . . . . . . . . . . . . . . . . . . . 99 5.1 New NVMe commands for FLIXR . . . . . . . . . . . . . . . . . . . . 129 5.2 FLIXR API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.3 Evaluation platform configuration . . . . . . . . . . . . . . . . . . . . 134 ix List of Figures 1.1 GPU architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 LDST unit and memory hierarchy of GPU . . . . . . . . . . . . . . . . 5 1.3 GPU software execution model . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Breakdown of warp issue stall cycles . . . . . . . . . . . . . . . . . . . 8 1.5 Fraction of idle cycles of execution units . . . . . . . . . . . . . . . . . 9 1.6 Breakdown of L1 data cache cycle . . . . . . . . . . . . . . . . . . . . 10 1.7 The architecture of an SSD platform . . . . . . . . . . . . . . . . . . . 11 1.8 An example of FTL table . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.9 Fraction of I/O and processing time . . . . . . . . . . . . . . . . . . . 15 2.1 Deterministic and non-deterministic load distribution . . . . . . . . . . 28 2.2 Average number of memory requests per active thread and warp for de- terministic and non-deterministic loads . . . . . . . . . . . . . . . . . . 30 2.3 Average turnaround time of non-deterministic and deterministic loads . 33 2.4 Load instruction turnaround time w.r.t number of generated requests . . 34 2.5 Example of turnaround time breakdown for the non-deterministic load instruction (PC: 0x110 in bfs) . . . . . . . . . . . . . . . . . . . . . . . 36 2.6 L1 and L2 cache miss ratio for the non-deterministic and deterministic loads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.7 Ratio of shared memory load and global memory load . . . . . . . . . . 38 3.1 Accuracy with stride-based inter-warp prefetch and cycle gaps by dis- tances of warps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2 Example CTA distribution across SMs . . . . . . . . . . . . . . . . . . 46 3.3 The average number of iterations for load instructions in a kernel. Re- peated load instructions / total load instructions (by PC) under names of benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4 Irregular stride among CTAs assigned to the same SM . . . . . . . . . . 49 3.5 Load address calculation examples . . . . . . . . . . . . . . . . . . . . 51 3.6 Hardware structure of CAPS . . . . . . . . . . . . . . . . . . . . . . . 53 3.7 Prefetch-aware warp scheduler . . . . . . . . . . . . . . . . . . . . . . 55 3.8 Cases for prefetch request generation . . . . . . . . . . . . . . . . . . . 61 3.9 Normalized IPC over two level scheduler without prefetch . . . . . . . 65 3.10 Performance by number of concurrent CTAs . . . . . . . . . . . . . . . 67 3.11 Prefetch coverage and accuracy . . . . . . . . . . . . . . . . . . . . . . 68 x 3.12 Bandwidth overhead by prefetching . . . . . . . . . . . . . . . . . . . 70 3.13 Timeliness of prefetching . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.14 Energy consumption by CAPS . . . . . . . . . . . . . . . . . . . . . . 73 4.1 Miss rate change per load by warp throttling . . . . . . . . . . . . . . . 76 4.2 Ratio of data regions by data locality types . . . . . . . . . . . . . . . . 80 4.3 Ratio of number of cache lines by access count . . . . . . . . . . . . . 81 4.4 APS for infinite sized cache . . . . . . . . . . . . . . . . . . . . . . . . 83 4.5 Reuse distance for different locality type data . . . . . . . . . . . . . . 86 4.6 Load dependency and consumer load ID . . . . . . . . . . . . . . . . . 89 4.7 Hardware architecture of APCM . . . . . . . . . . . . . . . . . . . . . 90 4.8 Additional fields in tags . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.9 Fields in an entry of CAIT . . . . . . . . . . . . . . . . . . . . . . . . 92 4.10 Cache data protection control . . . . . . . . . . . . . . . . . . . . . . . 96 4.11 Performance of APCM normalized to the baseline Configuration 1 . . . 100 4.12 Performance on different GPU configurations . . . . . . . . . . . . . . 101 4.13 L1 cache miss rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.14 Performance with warp throttling methods . . . . . . . . . . . . . . . . 103 4.15 L1 cache miss rate with warp throttling methods . . . . . . . . . . . . . 104 4.16 Performance by bypassing methods . . . . . . . . . . . . . . . . . . . 105 4.17 Performance by protection methods . . . . . . . . . . . . . . . . . . . 106 4.18 Performance by warp schedulers . . . . . . . . . . . . . . . . . . . . . 107 4.19 Performance by L1 cache size . . . . . . . . . . . . . . . . . . . . . . 108 4.20 Performance by L1 cache size (normalized to 16KB cache) . . . . . . . 109 4.21 Performance by MTA configurations . . . . . . . . . . . . . . . . . . . 109 4.22 Performance by LAT and CAIT depth . . . . . . . . . . . . . . . . . . 110 4.23 Normalized energy consumption . . . . . . . . . . . . . . . . . . . . . 111 5.1 Hierarchical indexing operation . . . . . . . . . . . . . . . . . . . . . 117 5.2 Scan and filtering in TPC-H Query 6 . . . . . . . . . . . . . . . . . . . 118 5.3 Join processing in TPC-H Query 14 . . . . . . . . . . . . . . . . . . . 118 5.4 Overview of FLIXR operation model . . . . . . . . . . . . . . . . . . . 120 5.5 An example of per-table indexing by FLIXR . . . . . . . . . . . . . . . 128 5.6 Evaluation platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.7 Index maintenance performance . . . . . . . . . . . . . . . . . . . . . 137 5.8 Query processing performance . . . . . . . . . . . . . . . . . . . . . . 138 5.9 Data traffic from storage . . . . . . . . . . . . . . . . . . . . . . . . . 141 5.10 Performance and data traffic with the secondary key . . . . . . . . . . . 142 5.11 Performance by throttling host processor . . . . . . . . . . . . . . . . . 143 5.12 Performance by SSD’s internal bandwidth . . . . . . . . . . . . . . . . 145 5.13 Energy consumption (Normalized to Baseline) . . . . . . . . . . . . . . 146 xi Abstract Recent exponential growth of the data sets size demanded by modern big data applica- tions requires innovative computer systems architecture design. In order to accelerate processing huge volumes of data sets, modern datacenter server systems harness graph- ics processing units (GPUs) as high-performance compute nodes that support massively parallel processing of data-intensive applications. The datacenter server systems also de- ploy non-volatile memory (NVM)-based storage devices to provide lower data transfer latency from storage to compute nodes. Despite employing the advanced parallel pro- cessors and storage systems, the server systems suffer from the performance overhead in transferring massive amounts of data sets. For instance, long data fetch latency is one of the critical performance bottlenecks in GPUs. Big data applications that rely on storage resident data pay a significant fraction of their execution time on data input/output (I/O) time. In this dissertation we present the architectural approaches for improving the perfor- mance of GPUs and storage systems, which are critical hardware components in modern datacenter server systems. First we analyze the characteristics and performance impacts of global load instructions on GPU memory hierarchy. Based on the index generation xii rule of global load instructions we categorize load instructions into two different types - deterministic and non-deterministic types. We reveal that the non-deterministic type loads create a burst of irregular memory requests, which drains the resource of GPU memory hierarchy. The long latency of global memory instructions is one of the critical performance bottlenecks in GPU since the long data fetch latency cannot be hidden by GPU’s quick context switching among tens of concurrent warps. In order to mitigate the performance overhead by load instructions, we propose the efficient prefetch mechanism combined with the prefetch-aware warp scheduler. Our proposed prefetcher estimates accurate prefetch addresses by detecting the base address of each thread block and the common stride observed between neighbor warps within a single thread block. In order to provide better timeliness for prefetch requests, the prefetch-aware warp scheduler reorganizes the warp execution priority to detect the required information for prefetch address estimation as quickly as possible. Hence the propose prefetcher is able to issue the accurate prefetch requests sufficiently ahead of time before the target demand fetch requests. GPU employs the local data cache to mitigate the performance overhead by long latency of load instructions. However the GPU data cache is not utilized efficiently since the small data cache shared by dozens of warps suffers from significant cache contention and premature data eviction. In order to improve the data cache utilization we propose Access Pattern-Aware Cache Management (APCM), which provides the fine-grained per- load cache management scheme. We discover that individual global load instructions xiii exhibit different types of warp-based locality behaviors and data allocated by different locality types of has diverged reuse intervals and lifetime in the data cache. Furthermore the load instructions that share the same program counter exhibit the consistent locality properties across all warps originated from the same kernel. Based on this discovery APCM applies the per-load cache management scheme for all warps once the optimal cache management scheme is determined for each global load in a single warp. In order to detect the locality types of each load APCM first tracks the cache access history in runtime using the dedicated tag array structure for single monitored warp. Then APCM selectively applies cache bypassing or cache line protection for all warps based on the detected locality types of individual loads. Our evaluation shows that APCM improves the performance by 34% for data-intensive GPU applications. Storage I/O time is becoming more critical in datacenter server systems since modern big data applications demand huge volumes of data sets resident in storage devices and these large data sets cannot fit in external memory (such as DRAM) of compute nodes. Thus modern database systems exploit index structures to reduce data accesses to stor- age. However, as database size grows exponentially the storage space overhead for index structure also increases significantly and scanning indexes creates frequent accesses to storage. Moreover updating large indexes is a heavy computation burden for host sys- tems. In this dissertation we propose in-storage indexing mechanism, called FLIXR, using the embedded processor and memory systems enclosed in modern storage devices. FLIXR builds and maintains page-level indexes while database table data is written or xiv updated to SSDs. Exploiting the native address translation and page I/O process in SSDs, FLIXR efficiently performs index comparison and join processing functions to filter out unneeded page data fetch from SSD. FLIXR shows 21.7% performance improvement for index maintenance workloads and 1.81 performance uplifts for a wide range of query processing benchmarks. To summarize, in this dissertation we present several architectural approaches that can mitigate critical data movement cost in GPUs and storage systems that are widely deployed in datacenter server systems or cloud infrastructure. Hence this dissertation makes contributions to improving the performance of high-performance computer sys- tems for modern big data applications. xv Chapter 1 Introduction 1.1 Challenges in Computer Systems for Big Data Big data applications are the driving force behind the current revolution in computer architectures and systems design. To compute huge volumes of data demanded by big data analytics applications, a large amount of data sets in storage devices (the lowest layer of the memory hierarchy) needs to be delivered to the local cache (the highest layer of the memory hierarchy) of compute cores. As the amount of data sets required by modern big data applications increases, data movement cost (latency and energy) in the memory hierarchy is getting more critical in data processing systems. Hence, such rapid growth of the data sets size demands architectural innovations in storage systems as well as compute nodes. 1 Modern datacenter servers and cloud infrastructure deploy graphics processing units (GPUs) to support massive data computation demanded by a wide range of big data applications. GPUs exploit a large number of concurrent thread contexts to support mas- sive parallel computation with single instruction and multiple threads (SIMT) architec- ture. Provisioning sufficient compute cores and large unified register files, GPUs support quick context switching among available threads to mitigate pipeline stalls by long la- tency data fetch operation. However irregular memory accesses frequently observed in data analytics applications are critical performance bottleneck of GPUs. Such memory requests cause significant congestion in the memory hierarchy, thus the context switch- ing mechanism of SIMT architecture cannot cover longer data fetch delays. In addition the burst data requests from data-intensive applications lead to extremely low resource utilization in GPU’s memory system. As the data sets used by modern big data applications do not fit in the external mem- ory (DRAM) of compute nodes, data movement cost from storage devices to compute nodes gets extremely high. Thus the big data applications that rely on storage resident data pay a significant fraction of their execution time on data input/output (I/O) time. Re- cently high-performance datacenter servers deploy solid-state drives (SSDs) that enclose NAND flash memory as primary storage media. SSDs can provide lower access latency compared to the conventional hard disk drives (HDDs) that employ magnetic platters as storage media. However, the data access latency of NAND flash memory enclosed in SSDs is significantly higher than DRAM. In addition the data traffic between compute 2 nodes and storage devices is limited by the external interconnection network which pro- vides low data bandwidth. Hence, the data movement from storage is still a significant performance bottleneck for big data applications even with the enhanced non-volatile memory (NVM)-based storage devices. In this dissertation, we propose the architectural approaches to tackle these critical performance challenges in the parallel processors and storage systems widely deployed in high-performance datacenter server systems. Before presenting our contributions we will briefly introduce the hardware/software architecture of GPUs and SSDs. 1.2 GPU Architecture GPUs exploit massive thread level parallelism (TLP) to maximize throughput with thou- sands of compute cores. In order to minimize power consumption overhead the compute cores of GPU have simple in-order pipeline structure without data forwarding, branch prediction, and register renaming included in the modern out-of-order processors. In- stead, in order to hide the pipeline stalls by execution latency and data dependency, GPUs rely on tens of concurrent warps or wavefronts (a group of threads executed together) by quick context switching among all the available warps. For instance, if one warp issues a data fetch instruction, which will take hundreds of cycles if a cache miss occurs, and the following instruction in the warp has data dependency with the issued load, one of the other available warps is scheduled to issue instructions. 3 Streaming Multiprocessor (SM) #1 Local Cache Instruction Fetch Warp Scheduler / Scoreboard / Issuance SP SFU LD/ST Unit Address Generator / Coalescer Register File Constant Texture Shared / Data SM #2 SM #3 … SM # N Interconnection Network Shared L2 Cache Memory Controller External Memory Shared L2 Cache Memory Controller External Memory Shared L2 Cache Memory Controller External Memory … Figure 1.1: GPU architecture Figure 1.1 shows the GPU hardware architecture, which is based on the NVIDIA GPU design disclosed to the public [69, 68, 2]. A single GPU is composed of multiple streaming processors (SMs) and shared L2 cache partitions connected via an intercon- nection network. Each SM is equipped with streaming processors (SPs, also known as CUDA compute cores) that execute integer and floating-point instructions, special func- tion units (SFU) for complex arithmetic instructions, and load store (LDST) units. An SM has 32 SIMT lanes where each lane is linked to the corresponding execution unit. Therefore, a group of 32 threads, called a warp (or a wavefront by AMD terminology), share the same program counter (PC). A warp is the basic execution unit issued to com- pute cores. Consequently, the number of the compute cores decides the maximum num- ber of warps executed concurrently. For instance, Kepler has 192 compute cores per SM 4 and thus can issue 6 warps simultaneously [68]. Issuance of warps is managed by warp schedulers which select a warp to be issued from a pool of ready warps, whose readi- ness is monitored by a scoreboard. As long as ready warps exist and execution units are available, an SM is able to execute instructions without stall. 1.2.1 GPU Memory Hierarchy Architecture In this dissertation we focus on the performance challenges of the memory operation in GPUs. Hence, we will describe the hardware architecture of GPU memory system including the LDST units in an SM, cache hierarchy, and the interconnection network. L1 data cache T T T T T T T T Address Gen. Coalescer Warp (load) Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data MSHRs Shared L2 Cache Memory Controller External Memory Shared L2 Cache Memory Controller External Memory Shared L2 Cache Memory Controller External Memory … … Interconnection Network Figure 1.2: LDST unit and memory hierarchy of GPU Figure 1.2 depicts the architecture of the LDST unit and memory hierarchy for global load instructions. The GPU programming model supports different types of data spaces (local, global, shared, constant, and texture) which can be specified by programmers’ purposes [64]. An SM also embeds different types of L1 caches (data, constant, and texture) to deal with these specific data spaces. In this dissertation we focus on the memory operation and hardware architecture for global space data (specified by .global 5 or .local) since the memory instructions of general-purpose applications usually rely on the global memory space data. Moreover, the other data spaces are designed for supporting special operations such as texture mapping of graphics programs. In an SM memory-related instructions are managed in the LDST unit, which has several pipeline stages including operand collection, address generation, coalescing, and cache access operations. As a single warp is composed of dozens of threads each memory access address is computed per thread. In the coalescing stage multiple memory requests that access the data within the same cache line region (a size of 128 bytes for NVIDIA GPU) are merged in one memory request. Contrary to arithmetic instructions, which may exploit massive parallelism with more compute cores, only a limited number of memory instructions may access the cache at the same time. This limitation is due to the complexity of supporting multiple read/write ports on the memory structures. Thus memory requests generated from the 32 threads in a warp are coalesced to one or two requests if these requests all access contiguous data space of 64 or 128 bytes. Such coalescing hardware is common in GPUs. For instance, AMD GPUs have coalescing unit for vector load operation [5]. GPU caches are designed to have wide cache lines to maximize throughput of mem- ory operations, particularly in the presence of coalesced memory operations [27, 81]. L1 data cache in Figure 1.2 handle requests bound to global memory space. Memory re- quests from the coalescer access L1 cache and occupy cache resources such as cache tags and MSHR entries if the requests encounter cache miss. If there is no available cache 6 resource, trailing requests are stalled in LDST unit pipelines until occupied resources are released after one of the reserved requests receives data from the interconnection net- work. Missed requests are injected into interconnection network queue and transferred to a destination L2 partition, which is composed of L2 cache and a memory controller di- rectly associated with external DRAM. Data fetch from DRAM takes hundreds or even thousands cycles depending on data traffic and queuing delays in the interconnection network and memory channels [103, 47]. 1.2.2 GPU Software Execution Model CTA Application Kernel0 Kernel1 Kernel2 Kernel CTA0 CTA1 CTA2 CTA3 CTA4 CTA5 . . . W a r ps Figure 1.3: GPU software execution model Figure 1.3 shows the GPU software execution model. GPU applications consist of multiple kernels, which are massively parallelized tasks executable on GPU hardware. Each kernel is composed of multiple groups of threads called cooperative thread arrays (CTA) or thread blocks. The dimension of a CTA can be configured by programmers and is also limited by the GPU hardware. CTAs are allocated to SMs in a round-robin fashion until hardware resources such as a register file (RF) and shared memory in an 7 SM is exhausted, whichever limit is reached first. The number of concurrent CTAs is also constrained by GPU hardware limit (Fermi: 8 CTAs, Kepler: 16 CTAs) [69, 68]. Threads in a CTA are split into warps, whose width is identical to SIMT lane width. Therefore, warps originated from the same kernel shares the same kernel code and have similar characteristics [18]. 1.2.3 Memory Operation: A Primary Performance Bottleneck 0% 20% 40% 60% 80% 100% SG LI MC WP KM BF SV FD MU BT LP NW FW CF LK DW PF DC S1 BP S2 MT LB ST AE SF MQ SD Fraction of Total Cycles Long-latency RAW Stall Load/Store Unit Stall Other Stalls Figure 1.4: Breakdown of warp issue stall cycles Figure 1.4 shows the ratio of warp issue stall cycles to total execution cycles for GPU benchmarks. The causes of warp issue stall are classified as memory instructions (long latency RAW stall and load/store unit full stall) and others (other stalls). When no warp can be scheduled due to data dependency from former long latency data load instructions, it is categorized as long latency RAW stall. Memory instructions cannot be executed if resources such as tags, queues and MSHR entries of load/store units are all occupied by in-flight memory request. Such a case is categorized as load/store unit full stall and it is resolved when the requests reserving the load/store unit resources are serviced by L2 cache or external memory. The results shown in Figure 1.4 reveal that on average 8 38% (23% + 15%)of execution cycles are wasted by memory operation related stalls for all tested benchmarks. Especially for the first 14 benchmarks, long latency RAW stall and load/store unit full stall take up 40% and 20% of total execution cycles respectively. Consequently, this data shows that long latency of memory operation which cannot be hidden is prominent performance bottleneck of GPU. 0 0.2 0.4 0.6 0.8 1 2mm gaus grm lu spmv htw mriq dwt bpr srad bfs sssp ccl mst mis avg Fraction of idle cycles SP SFU LD/ST Figure 1.5: Fraction of idle cycles of execution units Lack of available resources in the memory systems is another factor of performance bottlenecks in GPU. Figure 1.5 shows the fraction of cycles that each of the three main execution units (SP: compute core, SFU: special function unit, LD/ST: load/store unit) are idle. Each execution unit has simple pipeline structure, thus a warp can issue an instruction if the first pipeline stage of the corresponding execution unit is idle. On the other hand, an execution unit is unavailable for current ready warps if the first pipeline stage of the execution unit is still occupied by previous instructions. Therefore, the occupation of the first pipeline stage signifies that the execution unit is in a busy state. As shown in Figure 1.5, LDST unit is occupied more frequently than other execution units for the most of GPU benchmarks. This result implies that resources of LDST units and 9 associated memory system can be frequently saturated due to lack of available resources compared to significant demand from many concurrent threads. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 2mm gaus grm lu spmv htw mriq dwt bpr srad bfs sssp ccl mst mis avg Fraction of L1 cache cycles L1 hit L1 hit reserved L1 miss reserve fail by tags reserve fail by MSHRs reserve fail by interconnection Figure 1.6: Breakdown of L1 data cache cycle Figure 1.6 showing breakdown of L1 data cache cycles also reveals wasted cycles in the memory system. When a memory request touches the cache line, it encounters one of three possible outcomes (hit, miss and hit-reserved) if the request can be ser- viced from the cache. A hit-reserved means the case when the current request address is in a reserved cache tag but the data for the cache line is still in-flight from a previous cache miss request. Clearly hit or hit-reserved outcome is preferable. On a cache miss resources like tags and MSHRs are reserved for the outstanding requests. If all cache resources are reserved by previous outstanding requests trailing memory requests cannot be serviced from the cache for cache miss cases. It is a reservation failure case and we categorize the reasons for the reservation failure as lack of available tags, MSHRs and interconnection queues in Figure 1.6. For the reservation failure case, memory requests fail to be serviced from the cache until preoccupied resources are released, thus cycles are wasted obviously. It is observed that on overage over 70% of cache cycles are wasted 10 due to reservation failures. It implies that memory requests waste cycles frequently due to lack of available resources in the memory system. 1.3 SSD Architecture 1.3.1 Modern SSD Platforms Flash channel interface PCIe interface / DMA engine Embedded Core Processing Units I$ D$ ECC / Accelerator Flash controller Flash channel arbiter On-chip interconnection DRAM controller DRAM NAND flash die Flash channel interface NAND flash die Flash channel interface NAND flash die Flash channel interface NAND flash die SSD controller SoC Figure 1.7: The architecture of an SSD platform Figure 1.7 illustrates the hardware architecture of a modern SSD. Most of modern SSDs contain several packages of NAND flash memory for non-volatile data storage media. NAND flash manufacturers may use multiple-chip packages to further increase storage capacities within limited space of circuit boards. This die-stacking technology also facilitates multi-channel topology between NAND flash dies and an SSD controller. Using multi-way and multi-plane structure in a NAND flash die, an SSD can achieve even higher bandwidth for each channel. The smallest granularity for accessing flash memory is one page, which is around 4–16KB. Multiple pages (64–256 pages) are grouped into 11 a block. Each NAND flash chip can read data as fast as in 20ms, but no longer than 200ms, depending on the technology node and the physical data locations. But the write latency is about 20 longer [33]. To communicate with the host system, modern SSDs can use NVM Express (NVMe) protocol via the underlying system interconnect, like PCI Express (PCIe) [70]. In PCIe Gen.3 standard, one PCIe lane can provide up to 1 GB/s bandwidth. A datacenter class SSD uses 4 or 8 PCIe Gen.3 lanes [78], allowing the device to obtain about 4 to 8 GB/s of external bandwidth. In order to process the host I/O requests and manage flash memory, modern SSDs equip a general-purpose multicore embedded processor to execute the SSD controller firmware. The firmware handles NVMe commands and data transfer between the host system and NAND flash memory, manages flash translation layer (FTL) table, and per- forms garbage collection (GC) and wear-leveling (WL). Modern SSDs also provision large DRAM (e.g. 2 GB) to cache FTL tables and as a buffer space for frequently used data. Due to these advanced features SSDs can finish NVMe operations within tens of microseconds, thereby making the host-side command processing overhead a non- negligible fraction of the overall I/O access time [12, 48]. 1.3.2 Flash Translation Layer The read/write process of flash memory is significantly asymmetrical due to its physical constraints. The data in flash memory cell is readable at the page granularity once the flash page is written to contain valid data. On the other hand writing a flash cell is only 12 possible when the target page is empty. If not, the target page should be erased before write. Another critical constraint of the flash memory is that erasure is possible only at the granularity of a block, which is a collection of hundreds of flash pages. Thus hundreds of pages in the same block should be erased before any page in the block is rewritten. This erasure process is extremely slow compared to the read speed since it requires higher voltage and longer time to reinitialize bit cells in a block. Hence it is more efficient to write the updated page data to an empty page space rather than erasing the block and re-write. This mechanism is also desirable to increase the lifetime of an SSD since each block of flash memory can be erased limited times. That means the physical page address (PPA) of the flash data changes at runtime by data updates and the PPA is not equal to the logical block address (LBA) of the file system. Consequently SSDs maintains the LBA-to-PPA mapping table, called the FTL table, for address translation. Valid LBA PPA Metadata 1 0 x 1 0 0 0 x 2 0 0 0 0 x 1 0 4 0 x 3 0 0 1 0 x 1 0 8 0 x 3 0 4 Block # Page status Erase cnt 0 E 2 1 1 2 5 V V I I I I I E E E E <FTL table> <Block status table> 0 0 x 1 0 A 0 x 2 3 0 1 0 x 1 1 0 0 x 2 3 4 Scan empty pages Page status - E: empty - V: valid - I: invalid Figure 1.8: An example of FTL table Figure 1.8 shows an example of FTL table. In this example we show the page-level mapping table structure. Since the read/write process is managed by page-level granu- larity in the SSD, each entry represents the mapping information of the corresponding page. The target PPA translated from the LBA is stored in the PPA field. The valid flag 13 indicates the validity of the mapping information. The metadata field contains additional data required for page management. The firmware also manages the block status table to track the page status (empty, valid, invalid) of all pages in the block, which is then used to perform garbage collection and wear-leveling. The FTL table is accessed on each I/O command (read/write/flush) from the host. Thus every storage access command must check the FTL table first to reach the target data page. The following are the I/O operations issued to SSDs. Write: When a write command is issued from the host CPUs, this write request is decoded into single or multiple page-level write commands in an SSD. The firmware searches for empty pages to write the new data. Once the PPA of the empty page is determined for the target LBA, the PPA field of the FTL table entry is updated and the valid flag is set, and the data is written to the flash memory. Then the firmware issues flash write command for the target PPA and transfers the buffered data to the NAND flash controller. If the page update command is issued again for the same valid LBA, the firmware invalidates the current PPA and then assigns a new empty PPA for the updated data. The FTL is again updated. Read: For each page-level read the firmware scans the FTL table to find the LBA- to-PPA mapping information. If the valid mapping information exists in the FTL table the flash read command for the target PPA can be issued to the NAND flash controller. The data returned from the flash memory is buffered in the SSD DRAM first and the data is then transferred to the host DRAM by the DMA protocol. 14 Erase: Once a page-erase command is decoded the firmware invalidates the entry of the FTL table for the target LBA. The status of the physical page is also set as invalid. This invalidated physical page is finally initialized by the block-erase command at some future time by the garbage collection function if all the pages in the block are invalidated. After that the pages are set as empty and new data can be written. Given the importance of FTL in accessing the data in SSDs, typically the SSD con- troller caches much of the FTL in the SSD DRAM for fast access. For a 2TB SSD using 16KB page size the total size of the FTL is about 1GB. As modern SSDs provision large DRAMs even larger FTLs may be cached in memory. 1.3.3 Storage I/O Time – A Critical Performance Hurdle 0% 20% 40% 60% 80% 100% Query 1 Query 6 Query 4 Query 12 Query 14 Storage I/O P-join P-others Figure 1.9: Fraction of I/O and processing time Even though the enhanced storage media that benefit from advanced NVM technol- ogy is deployed in modern storage systems, the data I/O time from the storage is still a critical performance hurdle for big data analytics. We measure the SSD access time (I/O time) and CPU processing time (P-others and P-join) of the popular online analyt- ics processing (OLAP) benchmark applications (TPC-H) [94]. Figure 1.9 exhibits our 15 measurement results. P-join represents the CPU computation time for join processing. We will describe the examplar join processing in Section 5.2.2 of Chapter 5. Note that queries 2, 12, and 14 includes join procesing which requires the inter-table correlation process among multiple database tables. P-others includes all other CPU computation time for data analytics except the join processing. Our analysis reveals that the data I/O time occupies significant fraction (66.7% on average) of the entire execution time of data analytics applications. For the join processing that requires multiple table accesses, the database systems pay more than 75% of execution time on storage accesses. 1.4 Contributions As data size demanded for big data analytics grows exponentially, efficient data transfer and data utilization in memory systems is critical for the performance of modern data processing systems. As briefly introduced, the long data fetch latency and inefficient data utilization in GPU memory hierarchy is a primary performance bottleneck of GPUs, which are harnessed as massively parallel compute nodes in datacenter servers. In this dissertation we first analyze the characteristics of data fetch operation in GPU memory systems, and we then present the efficient prefetch scheme and novel cache management approach for GPUs. As data access to storage is also critical for big data applications, we present in-storage indexing mechanism which can reduce data traffic from the modern storage systems. This dissertation makes following contributions: 16 In Chapter 2, we show that there are two distinct classes of load instructions, cat- egorized as deterministic and non-deterministic loads. Using a combination of profiling data from a real GPU card and cycle-accurate simulation data we reveal that there is a significant performance impact disparity when executing these two types of loads. We propose CTA-aware Prefetcher and Scheduler (CAPS) consisting of a thread group-aware prefetcher and a prefetch-aware warp scheduler for GPUs in Chap- ter 3. GPU kernels group threads into CTAs. Even though each thread within a CTA accesses the regularly aligned data space indexed by equations from thread IDs and CTA IDs, the starting address of each CTA is difficult to predict. In order to achieve performance benefit from prefetch, we propose the accurate prefetching scheme that predicts the future access addresses using the detected bases address of each CTA and stride observed among threads within a CTA. We also present the prefetch-aware warp scheduler which can improve the timeliness of the prefetcher by reorganizing the warp priority to detect the required the base address of each CTA and stride per load as quickly as possible. Hence, CAPS allows prefetch re- quests to be issued sufficiently ahead of time before the demand requests. CAPS predicts addresses with over 97% accuracy and is able to improve GPU perfor- mance by 8% on average with up to 27% for a wide range of GPU applications. In GPU the data cache is not utilized efficiently since the small data cache that must be shared across dozens of warps suffers from significant cache contention 17 and premature data eviction. We propose the access pattern-aware cache man- agement (APCM) which can improve the utilization of cached data. In this work we discover that individual load instructions in a warp exhibit different types of data locality behavior and data allocated by different types of loads has diverged re-reference intervals and lifetime. APCM first dynamically detects the locality type of each load instruction by monitoring the cache accesses from one exem- plary warp. APCM then uses the detected locality type to selectively apply cache bypassing and cache pinning of data based on load locality characterization. Using an extensive set of simulations we show that APCM improves the performance of GPUs by 34% for cache sensitive applications while saving 27% of energy con- sumption over baseline GPU. Details of this study are presented in Chapter 4. In Chapter 5, we propose the in-storage indexing mechanism working in SSDs. Modern database systems exploit index structures to reduce unnecessary data ac- cesses to storage. However, as the database size grows the index structure itself occupies large space on storage, thus scanning indexes incurs frequent accesses to storage. Furthermore index maintenance which is required when updating database is a very compute intensive task. As SSDs enclose general-purpose embedded pro- cessors to manage NAND flash memory, we propose to marshal this computation ability in the SSD to automatically build and manage data indexes. Our approach called FLIXR builds and maintains page-level indexes automatically whenever a page data is being written into the flash memory. FLIXR stores index data within 18 the flash translation layer (FTL) for fast index lookups. The proposed mechanism can update the indexes alongside any data updates and the cost of updating the index is entirely hidden in the page write delays. FLIXR further takes advantage of the page level indexes to perform filtration or join processing in SSD when the page data is demanded by host applications. FLIXR shows 21.7% performance improvement for TPC-C benchmarks which require index maintenance and 1.81 performance improvement on query response time in data analytics workloads. 19 Chapter 2 Revealing Critical Loads in GPU Memory Systems 2.1 Introduction GPUs employ a large number of thread contexts and provide sufficient hardware re- sources to quickly switch between thread contexts. In spite of the enormous hardware resources expended to hide the memory access latency, it is still one of the primary performance bottlenecks of GPUs as introduced in Section 1.2.3. Memory accesses to the DRAM memory typically take hundreds of GPU cycles. To mitigate the impact of long memory latency, several warp schedulers and prefetching algorithms have been proposed [39, 51, 83, 30]. Many of the proposed prefetching and scheduling algorithms are application oblivious mechanisms that do not take into account application-specific 20 behaviors. Recently, as the importance of application-specific optimization is empha- sized, detailed application characterization studies have been conducted by many re- searchers [36, 17, 15, 11, 101]. The purpose of these studies is to identify specific application characteristics that may not map well to the GPU microarchitecture. For instance, some of these studies analyzed graph analytics and other irregular applica- tions [15, 11, 101, 50] and showed that these applications have many uncoalesced mem- ory accesses which result in significant memory system bottlenecks. The common con- clusions of these studies are that coalescing memory accesses can reduce memory traffic, DRAM bandwidth is the main performance bottleneck, and warp scheduling algorithm plays an important role in preserving data locality in small caches. Most of these stud- ies, however, only present aggregate statistics on how memory system behaves over the entire application run. In particular, they do not consider how individual load instruc- tions in a program contribute to the observed memory system behavior. However, as we show in this chapter, within a single application execution run two different types of load instructions are executed and there is a significant performance impact disparity when executing these two types of loads. We make the case that GPU microarchitec- ture must handle these two types of loads differently in order to reduce memory system bottlenecks. 21 2.2 Methodology 2.2.1 Applications As described in Section 2.1 the purpose of this study is to characterize the behavior of individual load instructions within an application and identify the unique properties of different loads that result in vast disparity in memory system interactions across different loads. To aid in this goal we first select 15 GPU applications from various benchmark suites [16, 32, 11, 90]. Descriptions for the applications used in this study are briefly summarized in Table 2.1. Since the goal of this study is to characterize memory system behavior we used large data sets where available. Large data sets stress the memory system as the working sets do not fit within the level-1 or even level-2 cache. The selected applications may be broadly grouped into three categories based on their functionality - linear algebra, image processing, and graph applications. Category Name Description Fraction of global loads Linear 2mm [32] matrix multiplication 18.10% gaus [16] Gaussian elimination 3.04% grm [32] Gram-Schmidt decomposition 24.75% lu [32] LU decomposition 6.65% spmv [90] sparce matrix dense vector multiplication 11.73% Image htw [16] Heartwall tracking 8.56% mriq [90] MRI calibration 0.03% dwt [16] 2D discrete wavelet transform 2.41% bpr [16] back propagation for image recognition 3.71% srad [16] Speckle reducing anisotropic diffusion 3.57% Graph bfs [16] breadth first search 1.17% sssp [11] single source shortest path 5.66% ccl connected component labeling 5.78% mst [11] minimum spanning tree 1.19% mis maximal independent set 0.19% Table 2.1: Application characteristics in three categories 22 Linear algebra: The linear algebra applications implement various matrix arith- metic operations. For effective parallel computation, large matrices are split into lots of smaller matrices, which are then mapped to parallel threads that run concurrently on a GPU. Typically splitting matrices and other vector elements leads to fairly simple index- ing mechanisms for accessing the individual elements of a sub-matrix by a thread as such the sub-matrix data accesses by each thread are indexed using a linear function of thread ids and CTA ids. One exception is spmv which handles sparse matrix computations. Only non-zero elements from a large sparse matrix are stored and hence when splitting the sparse matrices the resulting sub-matrix indices are computed from non-linear equa- tions of thread and CTA ids. Image processing: Image or video data can be easily represented as 2-D or 3-D arrays. Thus input data sets for image processing applications can also indexed with thread ids and block ids similar to the linear algebra applications. However many image processing algorithms have multiple sub-tasks which are executed in a pipelined fash- ion. Thus once a sub-region of raw data (pixels or transformed data) is fetched into local memory in an SM, each sub-task processes the data and then passes on its output to the next sub-task in the pipeline. Even though image processing algorithms basically have series of linear algebra operations, control paths of the image processing applications may be diverged since specific algorithms are selectively applied based on the proper- ties of image data. For instance data sets are replicated or padded with zeros for pixels out of a frame area as wavelet algorithm, applied to dwt, is performed for regions near 23 frame boundaries. We also include bpr in the category of image processing applications although bpr applies a machine learning algorithm to neural network layers. The under- lying computation layer, however, uses pattern recognition algorithms which have strong similarity with many image processing algorithms. Graph: The last group of applications implement graph algorithms. Input data sets for the graph applications consist of a large number of vertices connected via edges which have a weight. Graph applications visit various vertices by traversing the edges, and thus data fetched by a thread is dependent on graph connectivity. For example, bfs performs the breadth first search algorithm to traverse vertices of graphs, thus the index for fetching data of a next vertex is decided by the index of the current vertex and edge connectivity between the current and the next vertex. Note that edges are randomly distributed between vertices for real graph data. Therefore, indices for data fetching are irregular for the graph applications. 2.2.2 Experiment environment Counter Description gld request Number of executed global load instructions per warp in an SM shared load Number of executed shared load instructions per warp in an SM l1 global load hit Number of global load hits in L1 cache l1 global load miss Number of global load misses in L1 cache l2 subp0 read hit sectors Number of read requests from L1 that hit in slice 0 of L2 cache. l2 subp1 read hit sectors Number of read requests from L1 that hit in slice 1 of L2 cache. l2 subp0 read sector queries Accumulated read sector queries from L1 to L2 cache for slice 0 of all the L2 cache units l2 subp1 read sector queries Accumulated read sector queries from L1 to L2 cache for slice 1 of all the L2 cache units Table 2.2: The profiler counters used in this study [65] 24 The selected applications were written in CUDA and were compiled with the NVIDIA CUDA Toolkit 4.0 [63]. We collected a wide range of statistics by running these applica- tions on the native GPU hardware as well as on a cycle-accurate software simulator. For measurement on real hardware, we utilize the CUDA Profiler [65] to measure memory characteristics of the application while running on an NVIDIA Tesla M2050 GPU, which has 14 CUDA SMs operating with 1.15 GHz. The profiler counters used in this study are listed in Table 2.2. GPU Model Tesla M2050 [55] Core 14 CUDA SMs@1.15GHz Memory 2.6GB, GDDR5@1.5GHz Comm. PCI-E GEN 2.0 Simulator Version GPGPU-Sim v3.2.2 [7] Configs Tesla C2050 Core 14 CUDA SMs@1.15GHz, 32 SIMT width Memory GDDR5@1.5GHz Register file 128KB Shared memory 48KB Const cache 8KB Texture cache 12KB L1D cache 16KB, 128B line, 4-way, 64 MSHR entries [62] L2D cache Unified, 786KB, 128B line, 8-way, 32 MSHR entries ROP latency 120 DRAM latency 100 Table 2.3: Experiment environments We also ran all the applications on the GPGPU-Sim simulator [7] with the NVIDIA Tesla C2050 configuration parameters in order to collect other statistics that were not supported by the CUDA Profiler. It is our understanding that Tesla C2050 and M2050 have identical architecture except they use different heat sinks. Since simulation is a very slow process, compared to running on native hardware, we only simulated applications 25 until they commit the first billion instructions. Detailed specification of the native GPU hardware and the GPGPU-Sim simulator configuration is shown in Table 2.3. 2.3 Classifying Loads While previous studies presented aggregate memory system behavior of each applica- tion we study the behavior of individual load instructions within each application. In this section we present one approach we used to classify the load instructions into two categories. This approach uses how load instructions compute their effective memory address to distinguish the two load categories. i n t t i d = b l o c k I d x . x MAX THREADS PER BLOCK + t h r e a d I d x . x ; i f ( t i d<n o o f n o d e s && g graph mask [ t i d ] ) f g graph mask [ t i d ]= f a l s e ; f o r ( i n t i = g g r a p h n o d e s [ t i d ] . s t a r t i n g ; . . . ) f i n t i d = g g r a p h e d g e s [ i ] ; i f ( ! g g r a p h v i s i t e d [ i d ] ) . . . g g Code 2.1: bfs code example As briefly described in Section 2.2, the threads in each of the three application cat- egories use different types of indexing operations to compute the effective address of a data item they need to access. Code 2.1 shows an example of how a thread in the bfs benchmark computes the effective address of a graph node that it accesses. In this code, an array g graph mask is indexed by tid calculated as a linear equation of blockIdx.x and threadIdx.x, which are thread-specific parameters. Thus the g graph mask is indexed 26 entirely using thread ids and block ids. Since the thread and block ids are constant values that do not change from one run to another the load instruction that accesses this array is termed as a deterministic load. In other words deterministic loads compute their ef- fective address based on parameters that are known at the time a kernel is launched from the host to the device and these parameter values do not change during the kernel exe- cution. Furthermore, as threadIdx.x value increases by one between threads in a warp, it is expected that consecutive threads within a warp access consecutive elements in the ar- ray. Hence, deterministic load instructions tend to generate coalesced memory accesses. On the other hand, g graph edges is indexed by id, which is itself loaded from an other vector g graph edges[i]. Since the index is computed from non-parameterized data such as user inputs or indirectly accessed using indices stored in another vector we categorize such a load as a non-deterministic load. In order to accurately classify a load instruction into the two categories, we rely on backward data flow analysis which is routinely used in compiler analysis [4]. We trace the dependency graphs backward for a source register that is used in the address computation of a load. We identify the parent instructions that define the source register. We then recursively trace the source registers of the parent instructions to identify the grand-parents of the load. Tracing back of source registers is continued until we reach the point when it becomes clear how the load’s source register is being defined. If the source register of a load is defined from parameterized data such as block ids, thread ids, and constant parameters then that load is classified as deterministic. In the CUDA 27 environment it is easy to identify when a parameterized data value is used to define a load’s source register. All the parameterized data values are loaded using a special ld.param instruction in CUDA. On the other hand, if a load’s source register is defined from prior load instructions such as ld.global, ld.local, ld.shared or ld.tex we classify that load as non-deterministic. While any load in the program can be classified using this approach we only classify global load instructions into these two categories. The reason for this selection is that global load instructions cause the most significant performance hurdles in the memory system. 0.0 0.2 0.4 0.6 0.8 1.0 2mm gaus grm lu spmv htw mriq dwt bpr srad bfs sssp ccl mst mis Linear Image Graph Fraction of global load warps Non-deterministic Deterministic Figure 2.1: Deterministic and non-deterministic load distribution Figure 2.1 shows the distribution of global loads into deterministic and non-deterministic load instructions. Non-deterministic loads are frequently found in graph applications which traverse vertices indexed by user data. Most linear algebra and image processing applications require deterministic loads since those applications fetch data arrays in a regular fashion. However, spmv uses some non-deterministic loads as it fetches irreg- ularly indexed arrays from a sparse matrix representation. Even in graph applications more than 50% of the global load instructions on average are deterministic loads, which 28 typically produce coalesced memory accesses. Hence, the large number of uncoalesced memory accesses which were observed to cause significant performance hurdles in prior studies [101] actually originate from a smaller fraction of non-deterministic loads. 2.4 Impact of Non-Deterministic Loads on Memory Traffic We instrumented GPGPU-Sim to track non-deterministic and deterministic loads sep- arately. We counted the number of memory requests generated by each warp as well as each active thread within a warp for the two load categories. As discussed earlier non-deterministic loads rely on non-parameterized data to compute effective address and hence it is highly probable that the addresses of such loads are not coalesced. Thus non-deterministic loads tend to generate multiple memory requests. GPUs coalesce data accesses from multiple threads in a warp if they all access consecutive memory locations. The coalescer sits before the L1 cache and hence each coalesced request generates one memory access request to the L1 cache. The average number of memory requests generated to L1 per warp and per active thread, for both the non-deterministic and deterministic loads, is shown in Figure 2.2. An active thread denotes a thread whose active mask is set as valid in a warp. The non-deterministic load data is shown under the label N and deterministic load data is indicated under the label D in the figure. This notation is used from now on whenever we present data for the two load categories. It is clear from this data that for benchmarks that have non-deterministic loads the number of memory requests generated per each load 29 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 1 2 3 4 5 6 7 8 N D N D N D N D N D N D N D N D N D N D N D N D N D N D N D 2mm gaus grm lu spmv htw mriq dwt bpr srad bfs sssp ccl mst mis Linear Image Graph Requests / active thread Requests / warp Requests per warp Requests per active thread Figure 2.2: Average number of memory requests per active thread and warp for deter- ministic and non-deterministic loads instruction is significantly higher for non-deterministic loads than deterministic loads. The figure also plots the number of memory requests generated per load instruction per active thread. Ideally when all the 32 threads within a warp are active (when there is no branch divergence) and when a load request can be coalesced perfectly then each active thread may generate only 1=32 memory requests per thread per each load instruction. However, non-deterministic loads generate a vastly higher number of memory requests per each thread. For instance, bfs generates on average 0.8 memory requests per active thread per each non-deterministic loads. Thus about 26 memory access requests can be generated from single non-deterministic warp load instruction if all threads of a warp are activated. A large number of memory requests generated per each non-deterministic load is agnostic to the benchmark category. Spmv, even though is a linear algebra program, generates six memory requests per warp for each non-deterministic load instruction. The number of memory requests generated per load is a critical parameter that de- termines overall memory system behavior. If a larger number of warps issue multiple 30 non-deterministic loads within a short time window they cause significant resource con- tention in the memory hierarchy. GPU memory system is well designed to handle coa- lesced memory accesses where a single memory request reads a wide cache line worth of data and feeds all the threads within a warp. However, when non-deterministic loads are encountered multiple narrow-width read requests stress cache access resources. When a memory request accesses the cache line it encounters one of three possible outcomes: hit, miss and hit reserved. A hit reserved outcome implies the case when the current request address is found in cache tags but the data for that cache line is still in-flight from a previous cache miss request. Clearly hit or at least a hit-reserved outcome is preferable for each memory request. On a cache miss outcome the memory request tries to evict the cache line and fetch its own data. But such an eviction process may suffer an extremely long delay in GPUs. There are three reasons for why a cache line cannot be evicted. First is the case of reservation fail by tags which means that the current memory request cannot evict a cache line since the eviction candidate cache line itself is currently fetching in-flight data for a previous memory request. Until the in-flight data comes into the cache and is provided to the prior memory request the new memory request cannot be issued due to the lack of available cache tags. Second, even if there are available cache tags, the current memory request cannot be handled if the miss request has no available miss status handling registers (MSHR) to track the request. We call this event as reservation fail by MSHRs. Finally, even if the current miss has a tag and MSHR available there is 31 a limited interconnection bandwidth between L1 and L2. If the new request cannot be injected into the input buffers of the interconnection network then the miss cannot be serviced either. We call this event reservation fail by interconnection. If the memory request cannot be handled due to any of the above three cases of reservation failures, cache access is retried at a future time. Hence, cycles are wasted until the corresponding hardware resources become available in the cache. 2.5 Impact of Non-Determinism on Load Instruction Latency As memory traffic increases, the latency of memory operation is elongated due to re- source contention, such as queueing delays, congestion in memory partitions and in- terconnection network [31]. The bursty issuance of multiple memory requests from a single non-deterministic load can significantly degrade the service time of the memory sub-system. Furthermore, the longer latency of a non-deterministic load itself degrades overall performance since data dependency stalls warps until the non-deterministic load completes its execution. Figure 2.3 shows the turnaround time of a global load warp on average, which is the time from when the warp is issued to LD/ST units to the time when the load data is written back to the destination register. The turnaround time is the minimum delay in executing an instruction that is dependent on a global load. We present the data for deterministic and non-deterministic loads separately. The bottommost component is the memory system latency when the memory system is not loaded and a global load does 32 0 100 200 300 400 500 600 700 800 900 N D N D N D N D N D N D N D N D N D N D N D N D N D N D N D 2mm gaus grm lu spmv htw mriq dwt bpr srad bfs sssp ccl mst mis Linear Image Graph Cycles Un-loaded memory system latency Rsrv_fails by previous warps Rsrv_fails by a current warp Wasted cycles in L2 and DRAMs Figure 2.3: Average turnaround time of non-deterministic and deterministic loads not encounter any reservation failures or interconnections network queueing delays. The second bar (Rsrv fails by previous warps) represents the waste cycles while a current warp is waiting until resources of L1 data cache is available. Recall that a miss request needs an available cache tag, MSHR, and interconnection input buffers before it can be issued. The third component from the bottom shows the required cycles until the last request from the current warp is reserved in the L1 cache. All requests cannot be reserved instantaneously in the L1 cache if multiple requests are generated from a warp while most of cache resources are already occupied. Then trailing requests must wait even longer until cache resources are available even if former requests could access the cache. Since non-deterministic loads generate multiple memory requests, overheads due to these two types of reservation fails are higher. The topmost component represents wasted cycles in memory partitions including interconnection network, L2 caches, and DRAMs. Due to imbalanced traffic in memory channels and the difference in data paths (L2 caches and DRAMs), the flight time of memory requests diverge, however, the overall time is 33 determined by the lastly arrived data packets since completion of thread execution is synchronized within a warp. Not surprisingly the biggest difference between deterministic and non-deterministic loads is that non-deterministic loads are delayed for longer periods of time waiting for resource reservations as well as wasted cycles by imbalanced service time in memory partitions. As discussed earlier, as non-deterministic loads generate multiple memory requests they are more likely to encounter resource reservation stalls. Also, as the over- all turnaround time is determined by the lastly serviced requests, non-deterministic loads have a higher probability of getting delayed when they touch critical paths in the memory partition suffering from heavy data traffic. Due to these reasons, non-deterministic loads have longer turnaround time than deterministic loads and result in a significant perfor- mance bottleneck. Furthermore, bursty memory requests from a non-deterministic load can also adversely impact deterministic loads by increasing memory congestion. 0 500 1000 1500 2000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Turnaround cycles (issue to writeback) Number of memory requests generated bfs (0x0F0, N) bfs (0x110, N) bfs (0x148, D) sssp (0x9F0, N) sssp (0xA08, N) sssp (0x9C8, D) spmv (0x2A8, N) spmv (0x280, D) Figure 2.4: Load instruction turnaround time w.r.t number of generated requests 34 To take a closer look at how non-determinism impacts total turnaround time we iden- tify a few deterministic and non-deterministic loads from the bfs, sssp and spmv bench- marks. Figure 2.4 plots the turnaround time as a function of the number memory requests generated by each of the identified load in the three benchmarks. Each line graph is la- beled by the benchmark name followed by (in parenthesis) the PC of the specific load instruction and whether that load is categorized as non-deterministic (N) or deterministic (D) load. Each deterministic load creates one or two memory requests, irrespective of which benchmark that load is part of. Hence, even in graph applications which have been demonstrated to exhibit poor memory system behavior deterministic loads do not generate large memory traffic. On the other hand, the number of memory requests gen- erated by the non-deterministic loads varies from one instance of that load instruction execution to another instance. The same non-deterministic load instruction generates one to 32 memory requests per each warp during different instances of its execution. The randomness of the number of memory requests per each load is both a function of how many active threads are in the warp as well as how much coalescing is possible at some specific instances. Figure 2.4 shows that the average turnaround time of a warp increases with the num- ber of generated memory requests. An interesting point is that the average turnaround time of a warp of the deterministic load is similar to the average turnaround time of a non-deterministic load that creates one memory request. Hence, when non-deterministic 35 loads are able to coalesce there is no significant difference between the two load cate- gories. Consequently, longer turnaround time of the non-deterministic loads is primarily a function of the larger number of uncoalesced memory requests. Difference of arrival time between the first and the last data causes the increased turnaround time for a warp even if the average memory latency per each request is similar. 0 200 400 600 800 1000 1200 1400 1600 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Cycles Number of memory requests generated Common latency Gap at L1D Gap at icnt-L2 Gap at L2-icnt BFS (0x110, N) Figure 2.5: Example of turnaround time breakdown for the non-deterministic load in- struction (PC: 0x110 in bfs) To provide further insights into how the number of memory requests generated im- pacts the total turnaround time we breakdown the turnaround time of one specific non- deterministic load that missed in L1, (PC: 0x110) from bfs as shown in Figure 2.5. The bottommost component is labeled as common latency, which represents the latency that each individual request pays if it encounters miss in L1 cache. The common latency is equivalent to a warp’s turnaround time subtracted by all wasted cycles from reservation fails and unbalanced paths in memory partitions. The additional latency encountered when the number of requests increase is broken down into three categories. The sec- ond component from the bottom is labeled as Gap at L1D, which is the added latency 36 paid to wait for all the resource reservations to complete before initiating the cache miss request. This component of the latency increases as the number of requests increases. This data confirms our prior observations that as non-deterministic loads generate more memory requests per each load instruction the reservation stall time increases. The next component is labeled as Gap at icnt-L2, which is the wasted cycles in accessing the in- terconnection network between L1 and L2 cache. This component does not change very much as the memory request count increases. The top most component is labeled as Gap at L2-icnt, which is the cycle gap between the firstly and the lastly serviced data in accessing the network between L2 and interconnection network. The large average cycle gap at L2-icnt increases with the number of memory requests as increasing traffic leads to congestion and imbalanced used of memory partitions and usage differences across different data paths between L2 cache partitions and their associated DRAMs. 2.6 Cache Miss Rate 0 0.2 0.4 0.6 0.8 1 N D N D N D N D N D N D N D N D N D N D N D N D N D N D N D 2mm gaus grm lu spmv htw mriq dwt bpr srad bfs sssp ccl mst mis Linear Image Graph Miss ratio L1 miss ratio L2 miss ratio Figure 2.6: L1 and L2 cache miss ratio for the non-deterministic and deterministic loads 37 Recall that all the data presented in the previous sections focused on global loads that missed in the L1 cache. In this section we show how often a global load hits in the L1 cache, in which case it would not have to suffer the long latencies. Even though lots of memory requests are generated per warp for a non-deterministic load, the burden on the memory sub-system is mitigated if requested data hits in the cache. Figure 2.6 shows L1 and L2 cache miss ratios of all the applications for both categories of load instruc- tions. The miss rates of both deterministic and non-deterministic loads exceed 50% in most cases, and in particular there is no significant improvement in the cache hit rate for deterministic loads. Furthermore, both non-graph and graph applications suffer from high cache miss rates. We surmise that prior studies that showed poor memory system behavior of graph applications [15, 11, 101, 71, 50] are essentially observing the poor behavior of non-deterministic loads that tend to occur more often in graph applications. Nearly in all cases it appears that L1 cache is highly ineffective in filtering accesses to the L2 cache. 0 1 2 3 4 2mm gaus grm lu spmv htw mriq dwt bpr srad bfs sssp ccl mst mis Linear Image Graph No. of shared memory load per global memory load Figure 2.7: Ratio of shared memory load and global memory load 38 One way to reduce the need for global memory is to use shared memory. GPUs provide programmers the ability to manage shared memory by making explicit memory allocation requests to the shared memory. With such explicit control data can be reused efficiently in shared memory with programmers’ effort. Figure 2.7 shows the number of shared memory accesses per each global memory access. This data is collected by the CUDA Profiler running on the GPU hardware. Image applications use share memory 2.5 times frequently than global memory. However, other application categories do not use shared memory effectively, and in most applications shared memory is entirely unused. Shared memory can be efficiently exploited in image processing because the same set of image data is processed multiple times through several steps of processing. This data also explains the lower overheads in LDST unit for the image processing application in spite of very high miss ratio in the L1 data cache. 2.7 Chapter Summary This study provides a detailed characterization of the memory system behavior of GPU applications. Rather than looking at the memory system behavior aggregated over all loads, this study classifies loads into deterministic and non-deterministic categories and presents how these two load categories exhibit vastly different memory system behavior. Deterministic loads whose address is calculated by the parameterized values are likely to have coalescing memory access patterns. On the other hand, non-deterministic loads whose address is determined by user input data and other non-parameterized values tend 39 to have uncoalesced memory access patterns. By measuring memory behaviors of these two types of load instructions separately, we found that non-deterministic loads are the primary performance bottleneck due to a large number of memory requests generated by their uncoalesced accesses. 40 Chapter 3 CTA-Aware Prefetching and Scheduling 3.1 Introduction Long latency memory operation is one of the most critical performance hurdles in any computation. GPUs rely on dozens of concurrent warps to hide the performance penalty of long latency operation by quickly context switching among available warps. When warps issue a long latency load instruction and following instructions are dependent on the load, they are descheduled to allow other ready warps to issue. Several warp schedul- ing methods have been proposed to efficiently select ready warps and to deschedule long latency warps to minimize wasted cycles by long latency instructions. For instance two- level schedulers [60, 30] employ two warp queues: pending queue and ready queue. 41 Only warps in the ready queue are considered for scheduling and when a warp in the ready queue encounters a long latency load instruction, it is pushed out into the pending queue. Any ready warp waiting in the pending queue is then moved to the ready queue. In spite of these advancements, memory access latency is still a critical bottleneck in GPUs, as has been identified in prior works [14, 45]. Especially, we observe GPUs can- not effectively hide long latency of load instructions as L1 cache misses occur in a burst manner for many memory-intensive applications. This leads to severe congestion in GPU memory subsystem to increase queuing delays superlinearly. Hence, many warps stall for hundreds of cycles as they end up waiting for its memory requests sent to L1 cache to be serviced. For instance our analysis for nearest neighborhood, a benchmark from CUDA SDK [66], reveals GPU pipelines are stalled for 62% of total execution cycles since all the warps end up waiting for the memory requests to be serviced from L1 cache. Such performance hurdle by memory operations will worsen as GPUs have supported more concurrent warps over generations while L1 cache size has not proportionally in- creased [68, 67]. Namely the number of L1 cache lines per warp has decreased, which leads to more bursty L1 cache misses due to severe interferences among warps [49]. To tackle this challenge researchers began to adopt memory prefetching techniques, which have been well-explored in the CPU domain, to the GPU domain. In CPU ap- plications strided accesses are common across multiple iterations of a load in a loop. Accesses to data arrays found in iterative loops seen in CPU applications get spread over a large number of concurrent threads in GPUs. GPUs use the notion of thread blocks 42 or cooperative thread arrays (CTAs) to manage these threads. Hence, inter-thread stride prefetching is a good way to capture the stride behavior in GPUs since regular strides are detected among different threads which access these data arrays indexed as a function of thread id. Prefetches can be issued for trailing threads using the base addresses and strides computed from the currently running threads [51]. Since GPUs execute a warp (or wavefront) as the basic execution unit inter-thread prefetching essentially must be implemented as inter-warp prefetching. 0 50 100 150 200 250 300 350 400 450 0% 20% 40% 60% 80% 1 2 3 4 5 6 7 8 9 10 Gap (cycles) Accuracy (%) Distance between load instructions Accuracy Gap Figure 3.1: Accuracy with stride-based inter-warp prefetch and cycle gaps by distances of warps One of the fundamental challenges with prefetching is the tradeoff between accuracy and prefetch distance. Accuracy is measured as the fraction of demand fetches that were correctly targeted by prefetching. Prefetch distance is defined as the cycle gap from when the prefetch is issued to when the demand fetch requests the data. Figure 3.1 shows the accuracy and prefetch distance of the simple inter-warp stride prefetching when warps execute loads of the same PC for the stride-friendly benchmark matrixMul. The x-axis in the figure indicates differences of warp ids from the baseline warp to the target warp. 43 The primary y-axis shows the accuracy. When the distances between warps are short then accuracy is high. The line plot shows the prefetch distance, which measures the difference of clock cycles between load execution by the baseline warp and the target warp. If the distance is just one, prefetch can be performed targeting the very next warp, then the number of cycles between the prefetch and demand fetch is just a few tens of cycles. Since global memory access takes hundreds of cycles a short prefetch distance cannot hide this access latency. In order to increase the distance between prefetch and demand fetch one has to increase the distance between the baseline and the target warps. But as we move along the x-axis prefetch accuracy drops gradually and then suffers a steep drop at a distance of seven. Further analysis shows that the main culprit for this drastic reduction in accuracy is that at the beginning of each CTA boundary the base address of the stride changes. When a prefetch is issued for a target warp in a different CTA, the prefetch address does not match the demand fetch. Since matrixMul has 8 warps per CTA, over the distance of seven every prefetch crosses the CTA boundary and the accuracy drops dramatically. That means the inter-warp prefetching is accurate for just a few neighboring warps within each CTA, which have low potential to hide memory latency, or the technique may send inaccurate prefetch requests, which will negatively impact performance. In order to overcome the limitations in accuracy and timeliness of prefetching in GPU, we present CTA-Aware Prefetcher and Scheduler (CAPS). The prefetch engine computes accurate prefetch addresses by detecting a baseline address per CTA and a 44 common stride in data arrays. The prefetch-aware warp scheduler works in conjunction with the prefetcher to improve the timeliness of prefetching. 3.2 Background 3.2.1 CTA Distribution GPU compilers estimate the maximum number of concurrent CTAs that can be assigned to an SM by determining the resource usage information of each CTA, such as the register file size and shared memory usage – the available resources within an SM must meet or exceed the cumulative resource demands of all the CTAs assigned to that SM. The GPU hardware places a limitation on the number of warps that can be assigned to each SM. For example, NVIDIA Fermi can run up to 48 warps in an SM. Thus if a kernel assigns 24 warps per CTA, each SM can accommodate up to two concurrent CTAs. For load balancing, current GPUs assign a CTA to each SM in a round-robin fashion until all SMs are assigned up to the maximum concurrent CTAs that can be accommodated in an SM. Once each SM is assigned the maximum allowed CTAs a new CTA is assigned to an SM only when an existing CTA on that SM finishes execution. As a result, irrespective of how the initial CTA assignment process starts eventually CTA assignments to SMs are purely demand-driven. 45 time CTA 0 CTA 7 CTA 1 CTA 9 CTA 2 CTA 6 SM 0 SM 1 SM 2 CTA 3 CTA 4 CTA 5 CTA 8 CTA 10 CTA 11 Figure 3.2: Example CTA distribution across SMs Figure 3.2 shows an example CTA distribution across three SMs. Assume that a kernel consists of 12 CTAs and each SM can run two concurrent CTAs. At the begin- ning of the kernel execution, SM 0, 1, and 2 are assigned two CTAs, one at a time in a round-robin fashion. Once the six CTAs are first allocated to all the SMs, the remaining six CTAs are assigned whenever any of the assigned CTAs terminates. When CTA 5 execution is finished first, CTA 6 is assigned to SM 2, and then CTA 7 is assigned to SM 0 after termination of CTA 3. Therefore, CTA assignments to an SM are determined dynamically based on CTA termination order. 3.3 Limitations of Prefetches in GPU 3.3.1 Intra-Warp Stride Prefetching Prefetching of strided data requests is a basic prefetching method that was explored for CPUs and has been shown to be effective when array data is accessed with regular indices in a loop [6]. In the context of a GPU application if each thread within a warp loads array data from memory repeatedly in a loop then stride prefetching is initiated to prefetch data for future loop iterations of each thread. Since each prefetch targets the load instruction 46 of a future loop iteration of the same thread within the warp, this approach is called intra- warp stride prefetching. Intra-warp stride prefetching was recently proposed for graph applications which have iterative irregular and diverged memory accesses [50]. The effectiveness of the intra-warp prefetching depends on the presence of load in- structions that are repeatedly executed in loops. But there is a growing trend towards replacing deep loop operations in GPU applications with parallel thread operations with just a few loop iterations in each thread. Thus deep loop operations are being replaced with thread-level parallelism with reduced emphasis on loops. Figure 3.3 shows the av- erage iteration number of the four common loads in the selected benchmarks, which are described in Section 3.6. We measured the execution number of each load, distinguished by the PC value, in a warp and picked the four most frequently executed loads. If a load instruction is part of a loop body then that PC would have repeatedly appeared in the execution window. Also the number of loads within loops over the total loads found in a kernel code is also shown under each benchmark name on the x-axis. These re- sults show that when a loop intensive CPU program is ported to CUDA (or OpenCL), loops are reduced to leverage massive thread level parallelism. This observation has also been made in a prior study that showed deep loop operations are seldom found in GPU applications [61, 64]. CUDA and OpenCL favor vector implementation over loops because of its scalability. By enabling thread level parallelism the software becomes more scalable as the hardware thread count increases. For instance, if the number of hardware threads double then 47 0 2 4 6 8 10 CP LPS BPR HSP MRQ STE CNV HST JC1 FFT SCN MM PVR CCL BFS KMN Numer of iterations 0/2 2/4 0/14 0/2 0/7 8/12 0/10 1/1 0/4 0/16 0/1 2/2 4/32 1/22 5/9 10/144 99 99 62 62 62 15 33 33 24 72 Figure 3.3: The average number of iterations for load instructions in a kernel. Repeated load instructions / total load instructions (by PC) under names of benchmarks each hardware thread is assigned half the number of vector operations without re-writing the code. Favoring thread parallelism, over loops, results in loss of opportunities for intra-warp prefetching. Thus a prefetch scheme should not only capture iterative loads appearing in a load, but it should also target loads that are not part of any loop body. 3.3.2 Inter-Warp Stride Prefetching Strides existing among warps can be extended to inter-warp stride prefetcher approaches [51, 87]. If regular offsets of memory addresses are detected between warps, then inter-warp prefetching detects a base address and a stride value across different warps based on warp id. Thus inter-warp stride prefetcher issues prefetches for future warps from a cur- rent warp using the base address and warp-id differences. The inter-warp prefetcher has potential to cover thousands of threads if prefetch requests are predicted correctly as a number of concurrent warps supported by GPU hardware increases. (Fermi: 48, Kepler: 64) [69, 68]. The CTA distribution algorithms employed in current GPUs limits applicability of inter-warp stride prefetching to warps within a CTA. As shown in Figure 3.2, SMs are not assigned consecutive CTAs. Thus within a CTA all the warps are able to see stride 48 CTA3 SM0 CTA2 SM2 CTA1 SM1 CTA0 SM0 W0 W0 W1 W1 W2 W2 W0 W1 W2 W0 W1 W2 W3 W3 W4 W4 W5 W5 Δ Δ Δ Δ Δ Δ Δ Δ D Figure 3.4: Irregular stride among CTAs assigned to the same SM accesses but the prefetcher is unable to prefetch across CTAs assigned to the same SM. This is because inter-warp prefetchers simply expect continuous warps access the next stride, regardless what CTA each warp belongs to. Even if we assume that there is an inter-warp stride prefetcher that also considers CTA id, as each SM does not run consecutive CTAs, it is hard to predict the accurate address without knowing the inter- CTA stride. The inter-CTA stride may differ from inter-warp stride, which means that the base address of a CTA is not always distant by inter-warp stride from the address of the last warp of the previous CTA, because load addresses are typically calculated by using CTA id aside from thread id and warp id. Thus accurate prefetch is limited to warps within a CTA. As quantified earlier in Figure 3.1 inter-warp prefetching suffers from loss of accuracy as we go across CTA boundaries. 3.3.3 Next-Line Prefetching The last category of GPU prefetching is next line prefetching, which fetches the next one or two consecutive cache lines alongside the demand line on a cache miss. The basic next line prefetch is agnostic to application access patterns and hence it leads to a significant increase in wasted bandwidth. Next line prefetching in conjunction with warp 49 scheduling policies for GPUs was proposed in [39, 38]. The proposed warp scheduler assigns consecutive warps to different scheduling groups. The warp in one scheduling group can prefetch data for the logically consecutive warp which will be scheduled later in different scheduling group. While the cache miss rate is in fact reduced with this next-line prefetching scheme, prefetch requests are issued too close to the demand fetch, resulting in small performance improvements. 3.4 Where Did My Strides Go? In this section we provide some insights into how GPU execution model perturbs stride access patterns seen at the application level. Figure 3.5 shows two example codes, from the LPS and BFS benchmarks [7, 16]. The bold code lines (also shown in red color) of the left-hand side code box are the CUDA code statements that calculate the indices used in accessing the array data (array d u1 in LPS, and arrays g graph mask, g graph nodes, g cost in BFS). The right-hand side box represents the corresponding equation to show how the array indexes will be computed. Many GPU kernels use thread id and block id (also called CTA id) to compute the index values for accessing the data that will be manipulated by each thread [96]. Parameters such as BLOCK X and BLOCK Y are compile-time known values that can be treated as fixed values across all the CTAs in each kernel. Parameters such as blockId.X and blockId.Y are CTA-specific values that are constant only across all the threads in a CTA. Thus load address computations rely on a mix of constant parameters, CTA-specific parameters and thread-specific parameters 50 #define INDEX(i,j,j_off) (i +__mul24(j,j_off)) __shared__ float u1[3*KOFF]; i = threadIdx.x; j = threadIdx.y; i = INDEX(i,blockIdx.x,BLOCK_X); j = INDEX(j,blockIdx.y,BLOCK_Y); indg = INDEX(i,j,pitch); active = (i<NX) && (j<NY); if (active) u1[ind+KOFF] = d_u1[indg]; ...... Indg = threadIdx.x + blockIdx.x * BLOCK_X + (threadIdx.y + blockIdx.y * BLOCK_Y) * pitch C 1 blockIdx.x * BLOCK_X C 2 blockIdx.y * BLOCK_Y C 3 pitch = threadIdx.x + C 1 + (threadIdx.y + C 2 ) * C 3 = threadIdx.x + threadIdx.y * C 3 + ( C 1 + C 2 * C 3 ) = threadIdx.x + threadIdx.y * C 3 + Θ Warps within a CTA have fixed distances (a) LPS [7] int tid = blockIdx.x * MAX_THREADS_PER_BLOCK + threadIdx.x; if( tid<no_of_nodes && g_graph_mask[tid]) { g_graph_mask[tid]=false; for(int i=g_graph_nodes[tid].starting; i<(g_graph_nodes[tid].no_of_edges+g_graph_nodes[tid].starting); i++) { int id = g_graph_edges[i]; if(!g_graph_visited[id]) { g_cost[id]=g_cost[tid]+1; g_updating_graph_mask[id]=true; } } } g_graph_mask[tid] = g_graph_mask[0] + (blockIdx.x * MAX_THREADS_PER_BLOCK + threadIdx.x) * 4 C 1 g_graph_mask[0] C 2 blockIdx.x*MAX_THREADS_PER_BLOCK C 3 4 = (C 1 + C 2 * C 3 ) + threadIdx.x * C 3 = Θ + threadIdx.x * C 3 Likely, g_graph_nodes[tid] and g_cost[tid] are predictable (b) BFS [16] Figure 3.5: Load address calculation examples within each CTA. In the example, values computed from CTA-specific parameters are represented as C 1 and C 2 . The pitch value, C 3 , is the constant parameter used across all threads in the kernel. Thus each CTA needs to compute its own C 1 and C 2 values first to compute the base address represented as q = C 1 +C 2 C 3 . Once a CTA’s base address is computed, each thread can then use its thread id (represented by threadIdx.x and threadIdx.y) and the stride value represented by C 3 to compute the effective array index. 51 For example, the CTA of LPS consists of a (32, 4) two-dimensional thread group. Given that a warp consists of 32 threads, each CTA has four warps. The threads in the same SIMT lane position in all four warps have the same thread x dimension id (from 0 to 31), and the y dimension id of which distance between consecutive warps is one. Therefore, the load address difference between two consecutive warps within each CTA is a fixed value, represented by the C 3 in the equation. This distance can be easily calculated at runtime by subtracting the load addresses of any two consecutive warps in the same CTA. This distance then can be used across all the CTAs. However, the CTA-specific constant values C 1 and C 2 must be computed for each CTA separately. Note that the base address of a CTA is cannot be predicted easily even if it appears to be a function of CTA id. Because this function varies from one load to another load instruction in the same kernel, and differs across kernels. Also inter-CTA distances (dif- ference of base addresses between two CTAs) in an SM is irregular. For example, CTAs (0,0), (3,3) and (7,2) are all initiated in the same SM for LPS in our simulation run. The example load shown in the LPS figure when executed in the same warp ids across dif- ferent CTAs do not exhibit any stride behavior across CTAs. For instance, the distance between the load address executed in the first warp of CTA(0,0) and CTA(3,3) is 5184, while the distance between the same load in CTA(3,3) and CTA(7,2) is 6272. Based on these observations, the prefetch address of all the warps within each CTA can be calcu- lated only once the base address and stride values are computed. The stride value can be computed by subtracting the load addresses of two consecutive warps within the CTA 52 for the same load. But the base address must be computed first by at least one warp associated with each CTA. Across a range of GPU applications we evaluated, the stride value in fact can be computed from two consecutive warps within a CTA. One exception is the indirect refer- ences that graph analytics applications normally use to find neighboring node and edges as shown in Figure 3.5b. In the BFS code g graph visited is indexed by variable id which is a value loaded from g graph edges[i]. Therefore, the address of these indirectly ref- erenced variables cannot be predicted using stride prefetcher. However, the metadata addresses (g graph mask, g graph nodes and g cost) are all thread-specific references and these addresses can be calculated using thread id and CTA id. 3.5 CTA-Aware Prefetcher and Scheduler LD/ST Unit Instruction Queue PC, Address, CTA id, warp id Base address, Stride Two-level Scheduler for CTAA Leading CTA/warp selector DIST Table PerCTA Table Prefetch Request Generator L1 Cache Figure 3.6: Hardware structure of CAPS In order to increase the accuracy of prefetching, CTA-Aware Prefetcher (CAP) com- putes prefetch addresses across all concurrent warps in an SM by detecting address 53 changes across CTA boundaries, and regular strides between warps within a CTA. In addition, Prefetch-Aware Scheduler (PAS), which can be implemented as a simple en- hancement to the two-level scheduler, cooperates with CAP to improve the timeliness of prefetching. Figure 3.6 shows overall hardware structure of CAPS. The prefetcher relies on PerCTA table to store the base address of each CTA which is computed as early as possible using a leading warp from each CTA. The DIST table is a single global struc- ture across all CTAs and it tracks the stride distance for a few loads that are selected for prefetching. The prefetch request generator consists of simple adder logic blocks to compute prefetch addresses using the base addresses and the strides from PerCTA and DIST tables. Prefetch requests access L1 data cache with lower priority than demand fetches. We will now describe the operation of the two components in details. 3.5.1 Prefetch-Aware Scheduler CAP requires the base address of each CTA and the stride between two warps within a CTA to generate prefetch requests for all warps across other CTAs. CAP tracks the base address for a given CTA by executing one warp. We call the warp that computes the base address of a given CTA as the leading warp, W lead . When a load of a certain PC is issued for the first time, the CTA containing the warp issuing the load is set as the leading CTA, say CTA lead . The warp issuing the load earliest becomes the leading warp of CTA lead naturally. Once we execute one more warp from CTA lead then it is possible to compute the stride value. Once the stride value is computed then we need to compute 54 the base address of each trailing CTA as early as possible. All other CTAs except the CTA lead will be referred to as the trailing CTAs, CTA trail . We pick one warp from each of the trailing CTAs to compute the base addresses. Each warp that is picked to compute the base address of the corresponding trailing CTA will become the leading warp of that trailing CTA. Using this terminology we describe the operation of PAS. A 0 A 1 A 2 B 0 B 1 B 2 C 0 C 1 C 2 Ready queue Pending queue A 0 A 1 A 2 B 0 compute phase B 1 B 2 C 0 C 1 BaseA Δ 2Δ BaseB D e t e c t i on : P r e f e t c h : Pr(B1) Pr(B2) BaseC Pr(C1) Pr(C2) (a) Conventional two-level scheduler A 0 B 0 C 0 A 1 A 2 B 1 B 2 C 1 C 2 Ready queue Pending queue A 0 B 0 C 0 A 1 compute phase A 2 B 1 B 2 C 1 BaseA BaseB BaseC Δ D e t e c t i o n : P r e f e t c h : Pr(B1) Pr(B2) Pr(C1) Pr(C2) (b) Prefetch-aware two-level scheduler Figure 3.7: Prefetch-aware warp scheduler Let’s assume that 3 CTAs (CTA A, CTA B, and CTA C) are running concurrently in an SM and each CTA is composed of 3 warps that exhibit stride behavior. The conven- tional two-level scheduler initially enqueues warps from each CTA to the ready queue in CTA order; warps of the CTA A are first enqueued to the ready queue and then the warps of the following CTAs are enqueued until the ready queue is filled up as shown in Figure 3.7a. In the example scenario warp A0 issues the load for the first time thus 55 the base address of the CTA A is detected. The stride is also detected when warp A1 executes. Even though we have the stride value the base address of CTA A is not useful for computing the prefetch address for CTA B and CTA C. Hence only when warp B0 is executed the base address of CTA B is known. Then it is possible to issue prefetches for the other warps (B1 and B2) in CTA B. Similarly only when C0 executes the base address of CTA C is known and prefetches can be issued for C1 and C2, but the prefetch distance is quite small since these warps are executed back-to-back. If the base addresses of all CTAs are computed eagerly by the two-level scheduler then prefetches could have been issued much earlier. Figure 3.7b shows the modified prefetch-aware scheduler. One leading warp (A0, B0 and C0) is selected from every CTA and these are enqueued in the ready queue first. Then warp A1 fills the remaining slots of the queue. As A0, B0 and C0 issue load instructions, base addresses of CTA A, CTA B and CTA C are computed eagerly. The stride can be computed after the issuance of A1. Since we know the base addresses of all CTAs and the stride value is same across all CTAs, prefetch addresses can be computed for other warps (B1, B2, C1 and C2) of trailing CTAs. Scheduler implementation: PAS is a simple enhancement to the conventional two- level scheduler. To implement the desired functionality we divide the ready queue into leading warp queue and trailing warp queue. The pending queue design is unaltered. One warp from every CTA is marked as a leading warp using a one-bit leading warp marker. PAS pushes warps that have this marker set to the front of the ready queue. Unlike the 56 conventional two-level scheduler that chooses the oldest ready warp, PAS chooses the leading warps first. Note that while we discussed PAS implementation on top of a two- level scheduler it is also possible to make simple enhancements to the loose round-robin scheduler to achieve the same effect of prioritizing leading warps. Also, in the GTO, when a warp is greedily scheduled until it encounters a memory operation, our approach can be applied by prioritizing the leading warps so that the leading warps are greedily scheduled until they compute the base address. Then the trailing warps can continue to execute. In our evaluations we implement the algorithm on the top of two-level scheduler to test the impact of PAS for better timeliness because the two-level scheduler already mitigates hurdles of memory operations by dispersing data fetch groups [30, 39]. Warp wake-up: To avoid eviction of prefetched data before demanding, the warps are woken up when the data arrives. If the warp is already in the ready queue, nothing happens. Otherwise, the warp is moved to the ready queue eagerly by pushing one of ready warps forcibly into the pending queue. The similar approach was proposed by OWL [38]. Only minimal change is needed for implementing the eager warp wake-up. When a warp sends a load request to L1 cache, the warp id is bound with the request so that the returned value is sent to the right warp. For the warp wake-up, the id of the warp that will be fed by the prefetched data is bound to the memory request. When the data arrives, warp scheduler is requested to promote the warp that is bound to the prefetch memory request. 57 3.5.2 CTA-Aware Prefetcher PerCTA table: The purpose of the PerCTA table is to store the base address of a targeted load from each CTA using the early base address computation from the leading warp. Since each CTA has its own base address it is necessary to store this information on a per CTA basis. Even though each leading warp in a CTA has 32 threads and hence can potentially compute 32 distinct base addresses (one per each thread) our empirical evaluations showed that prefetching is ineffective when the load instruction generates many uncoalesced memory accesses. Thus we only target those loads that generate no more than four coalesced memory accesses. A single 44 byte base address vector is used to store the base address of a targeted load within each CTA. In our design we only target prefetching at most four distinct loads (identified by their program counters) within each CTA. Hence, the PerCTA table has four entries. Each entry of PerCTA table stores the load PC, leading warp id, and base addresses. When a warp executes a load, the PC is used to search the entries in the corresponding PerCTA table. If the load PC is not found in the table then it indicates that no warp in that CTA has reached that load PC and hence the current warp is considered as the leading warp for that CTA. Then the leading warp id, load PC, and the access addresses from that warp are stored in the PerCTA table. Since the PerCTA table has limited number of entries, if there is no available entry in the PerCTA table, the least recently updated entry is evicted and the new information is registered in that entry. But in most of our 58 benchmarks the targeted prefetch loads are two to four load instructions and hence this replacement policy did not significantly alter the performance. DIST table: An entry in the DIST table contains the load PC, stride value, and a mis- prediction counter. Unlike PerCTA table, DIST table is shared by all CTAs since stride value is shared across all warps and across all CTAs. Each entry of DIST table is asso- ciated with a load instruction indexed by the load PC. When a load instruction is issued, the DIST table is accessed alongside the PerCTA table with the load PC. If no matching entry is found in the DIST table while an associative entry is found in the PerCTA table then it indicates that the base address of the CTA is already calculated while the stride value is not. Therefore, the stride needs to be calculated by using the stored base address and the current warp’s load address. Note that the stride computation between two warps can generate potentially up to four different values across the maximum of four distinct memory requests from the same load instruction. If the stride values for all memory requests between the two warps are not identical then we simply assume that the PC is not a striding load and the PerCTA entry for that PC is invalidated.If on the other hand the stride computation returns just one value then that stride value is stored in the DIST table. We also set the misprediction counter to zero at that time. To throttle inaccurate prefetches, the address of each prefetch is verified by compar- ing with the address of the actual demand fetch. Thus every warp instruction that issues a demand fetch also calculates the prefetch address to detect a misprediction. The mis- prediction counter increases by one whenever the calculated prefetch memory address is 59 not equivalent to the demand fetch. If the misprediction counter is larger than a threshold then no prefetch is issued to prevent inaccurate prefetches. Otherwise, prefetches are generated as described in the next section. The misprediction counter is one byte and the threshold is set to 128 by default. The entry structure of PerCTA and DIST tables described in this section is summa- rized in Table 3.1. Table Fields Total PerCTA PC (4B), leading warp id (1B), base address (44B) 21B DIST PC (4B), stride (4B), mispredict counter (1B) 9B Table 3.1: Database entry size of the prefetcher Handling indirect accesses: In indirect accesses regular strides between different warps are rarely observed since addresses of the indirect accesses are computed from random data fetched by other global loads as shown in the example of BFS code (Figure 3.5b). Thus incorrect prefetch requests for the indirect accesses can degrade the performance of memory systems, furthermore prefetcher resource is wasted. In order to avoid such situation indirect accesses are detected and excluded from prefetch process. Indirect accesses can be detected by tracing the source registers of global loads backward [47]. If the source register of a load is originated from thread ids, block ids, and constant parameters, CTA-aware prefetcher can predict prefetch addresses. Otherwise, the load is not considered for prefetch. 60 A 0 B 0 C 0 A 1 A 2 1 2 3 4 5 Leading CTA: CTA A CTA A: PC: A0, base addr 1 CTA B: PC: B0, base addr 2 CTA C: PC: C0, base addr 3 P e r C T A t a b l e Δ PC: stride(Δ) 4 D i s t t a b l e Prefetch for B1: addr(B0) + Δ 4 Prefetch for C1: addr(C0) + Δ P r e f e t c h r e q ue s t ge ne r a t or (a) Case 1: base addresses are settled before stride detection A 0 A 1 A 2 B 0 C 0 1 2 3 4 5 Leading CTA: CTA A CTA A: PC: A0, base addr 1 CTA B: PC: B0, base addr 4 CTA C: PC: C0, base addr 5 P e r C T A t a b l e Δ PC: stride(Δ) 2 D i s t t a b l e Prefetch for B1: addr(B0) + Δ 4 Prefetch for B2: addr(B0) + 2Δ P r e f e t c h r e q ue s t ge n e r a t or (b) Case 2: stride is detected before base ad- dresses are settled Figure 3.8: Cases for prefetch request generation 3.5.3 A Simple Prefetch Generation Illustration We illustrate the entire prefetch algorithm with a simple illustration showing how prefetches are issued. Prefetches are triggered under two different scenarios. In the first case, prefetch requests are generated when trailing warps of the leading CTA execute a load in- struction after the base addresses of the CTAs are registered to the PerCTA table by their leading warps. This case is illustrated in Figure 3.8a. The number in the circle above each warp id indicates the order of each warp’s load instruction issuance. We assume warps A0, B0 and C0 have already finished execution and they have updated the PerCTA table. But there is no stride value that is stored in the DIST table as yet since none of the trailing warps has been executed. When A1, which is a trailing warp of CTA A, issues the load instruction the stride (D) value is computed. Then the prefetcher traverses each of the PerCTA tables with the PC value of A1. Whenever the PC is matched in a given PerCTA table, the base addresses are read from the table and then new prefetches are 61 issued for the matched CTA. In this example as base addresses and stride are ready for CTA A and CTA B, prefetch requests for B1 and C1 are generated. The second scenario for prefetching occurs when the stride value is calculated before the base addresses of the trailing CTAs are registered to the PerCTA table. This happens when the leading warps of trailing CTAs are executed behind the trailing warps of the leading CTA. In spite of the best effort by PAS to prioritize all the leading warps to the front of the ready queue, it is possible that some of the trailing warps of leading CTA are executed ahead of the leading warps of trailing CTAs. Figure 3.8b shows an example of this case. In this example A1 executes ahead of B0 and C0. As A1 of CTA A issues a load instruction before B0 and C0 of CTA B and CTA C, thenD is computed and stored in DIST table before the PerCTA table is updated with base addresses for CTA B and CTA C. After B0 is issued with updating PerCTA table with the base address as in the first scenario, prefetch requests for other warps in CTA B are also generated by using the stride value that is already computed in DIST table. Thus in this scenario the leading warp of a trailing CTA enables to issue prefetches for all the trailing warps of its own CTA. 3.5.4 Hardware Cost CAPS uses two tables: DIST and PerCTA. One DIST table per SM, one PerCTA table per CTA. Both tables are accessed by a load instruction. By default, there are four entries per PerCTA and four entries per DIST table. In Fermi architecture, each SM can run at 62 most eight CTAs. Therefore, the area overhead per SM for these two tables is 708 Bytes as summarized in Table 3.2. Table Configuration Total DIST 9 bytes per entry, 4 entries 36 bytes PerCTA 21 bytes per entry, 4 entries, 8 CTAs 672 bytes Table 3.2: Required hardware for tables We modeled CAPS in RTL level and synthesized with FreePDK 45nm library [1] to estimate hardware cost of CAPS. We assumed DIST table consists of arrays of simple flip-flops since the required data space for DIST is small. We used CACTI [97] to esti- mate area and power dissipation of perCTA table, which needs more data entries because an SM runs multiple CTAs concurrently. The estimated area of CAPS is 0.018 mm 2 , which occupies 0.08% of one SM given that the area of one SM measures 22 mm 2 base on the die photo of GF100. Under the 45 nm FreePDK library and CACTI configuration it is estimated that CAPS consumes 15.07 pJ per access and 550 mW of static power. 3.6 Evaluation 3.6.1 Settings and Workloads We implemented CAPS on GPGPU-Sim v3.2.2 [7]. The baseline configuration listed in Table 3.3 is similar to Fermi (GTX480) [69]. Detailed configuration parameters are listed in Table 3.3. We chose Fermi architecture because GPGPU-Sim generates performance statistics with very high correlation with actual NVIDIA Fermi GPU [7]. However, our 63 prefetching mechanism does not depend on a specific architecture because CAPS ex- ploits base address computations for concurrent CTAs, which don’t change in newer GPU architectures. Likely, we believe our scheme is applicable to other architectures that run a kernel in multiple groups. For example, AMD GPUs execute OpenCL pro- grams run by multiple workgroups, which are equivalent to CTAs, and a workgroup is split into wavefronts, basic groups of threads scheduled in compute units [83]. Even though AMD APU uses a unified system memory that can be accessed by both CPU and GPU, the GPU side memory access patterns are determined by the application character- istics. Parameter Value Core 1400MHz, 32 SIMT width, 15 cores Resources / core 48 concurrent warps, 8 concurrent CTAs Register file 128KB Shared memory 48KB Scheduler two-level scheduler (8 ready warps) L1I cache 2KB, 128B line, 4-way L1D cache 16KB, 128B line, 4-way, LRU, 32 MSHR entries L2 unified cache 64KB per partition (12 partitions), 128B line, 8-way, LRU, 32 MSHR entries DRAM 924MHz,4 interface, 6 channels, FR-FCFS scheduler, 16 scheduler queue entries GDDR5 Timing t CL =12, t RP =12, t RC =40, t RAS =28, t RCD =12, t RRD =6, t CDLR =5, t WR =12 Table 3.3: GPU configuration We used 16 benchmarks selected from various GPU benchmark suites as listed in Table 3.4. Most of the benchmarks are memory latency sensitive applications whose performance is improved with reduced memory channel delays, thus the influence of re- duced memory latency can be tested. We also studied irregular applications (PVR, CCL, BFS, and KM) to test the quality control mechanism of CAPS for divergent memory 64 accesses [71]. All applications were simulated until the end of their execution or when the simulated instruction count reached one billion. CAPS performance is compared to the baseline architecture using two-level warp scheduler with the ready warp queue size of 8 entries. Additionally, several previously proposed GPU prefetching methods are implemented to compare the relative performance benefits of CAPS. Benchmark Abbr. Benchmark Abbr. Coulombic Potential [7] CP laplace3D [7] LPS backprop [16] BPR hotspot [16] HSP mri-q [90] MRQ stencil [90] STE convolutionSeparable [66] CNV histogram [66] HST jacobi1D [32] JC1 FFT [21] FFT scan [66] SCN MatrixMul [66] MM PageViewRank [35] PVR Connected Comp. Label [101] CCL Breadth First Search [16] BFS Kmeans [35] KM Table 3.4: Workloads 3.6.2 Performance Enhancement 1 . 0 9 1 . 0 6 1 . 0 8 0 . 8 0 . 9 1 . 0 1 . 1 1 . 2 1 . 3 C P LP S B P R H S P M R Q S TE C N V H S T J C 1 F F T S C N M M M e a n ( r e g) P V R C C L B FS K M M e a n ( i r r e g ) M e a n ( a l l ) I N TR A I N TE R M T A N LP LA P O R C H C A P S Figure 3.9: Normalized IPC over two level scheduler without prefetch Figure 3.9 shows normalized IPC of various prefetching methods to the baseline con- figuration using two-level warp scheduler without prefetching. CAPS is our proposed CTA-aware prefetcher and scheduler. INTRA is a simple intra-warp stride prefetching introduced in Section 3.3.1, which produces prefetch requests when iterative loads have 65 regular strides in a loop. INTER is an inter-warp stride prefetching to generate prefetch requests for loads having regular strides across warps as described in Section 3.3.2. MTA is the many-thread aware prefetching combining both intra-warp and inter-warp stride prefetching as described in [51], which applies the intra-warp prefetching for iterative loads having regular strides within a warp, otherwise makes prefetch requests for other warps when regular address distances are detected among warps. Specifically, we imple- mented a hardware-based prefetcher among various mechanisms presented in [51]. NLP means the simple next-line prefetcher introduced in Section 3.3.3, which sends prefetch requests for the next cache line if one cache line is missed. LAP is the locality-aware prefetching built on top of the two-level scheduler, where a macro block of 4 cache lines is prefetched if more than or equal to two cache lines are missed within each macro block of L1 data cache [39]. ORCH is the orchestrated prefetching where LAP is further enhanced with the prefetch-aware warp scheduling as described in [39]. Figure 3.9 shows CAPS improves overall performance by 8% on average, with up to 27% performance improvement for CNV . Performance of irregular applications is im- proved by 6% with CAPS, which represents irregular applications can be benefited by accurate prefetch for strided loads while throttling divergent memory requests from in- direct accesses. INTRA shows performance improvement for several applications since INTRA uses intra-warp prefetching which tend to benefit only loop intensive kernels; and in the absence of loops intra-warp prefetching by itself is ineffective. Performance benefit by INTER is negative as it produces incorrect prefetch requests for trailing warps 66 over CTA boundaries as described in Section 3.3.2. The performance gain by MTA is not better than INTRA since MTA uses inter-warp prefetching if iterative loops are not detected within a warp and inter-warp prefetching causes negative effect with multi- ple concurrent CTAs running in an SM. NLP doesn’t provide better performance either because prefetch requests for next lines in the case of cache misses guarantee neither accuracy or timeliness. LAP and ORCH improve performance by about 1% and it is lower than the performance improvements reported in [39]. Note that in [39] the larger performance improvements were shown on top a round robin scheduler. The authors also showed that when a 2-level warp scheduler is used in the baseline the performance im- provements are much smaller. Hence, the 1% performance improvement is in line with prior work results. 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 Concurrent CTA = 1 Concurrent CTA = 2 Concurrent CTA = 4 Concurrent CTA = 8 Normalized IPC INTRA INTER MTA NLP LAP ORCH CAPS Figure 3.10: Performance by number of concurrent CTAs Figure 3.10 shows the performance of prefetchers by the number of CTAs assigned to each SM. By default Fermi allows 8 CTAs, and newer Kepler architecture runs up to 16 CTAs per SM [69, 68]. The increasing CTA count accommodated per SM only makes the CTA-aware prefetching even more critical. The figure shows the average IPCs normal- ized to the average IPC of the base configuration (the maximum concurrent CTAs = 8). 67 As shown in Figure 3.10 performance of INTRA and MTA is better than other prefetch- ers when only one CTA is allowed. The reason for this higher performance is that these prefetchers do not have to cross CTA boundaries. CAPS does not provide any benefits since by definition it prefetches across CTAs. But note that curtailing CTAs to remove uncertainty in data accesses at the CTA boundaries is not beneficial. The performance of all prefetchers is worse than the baseline that uses 8 CTAs without any prefetch- ing. Hence, increasing CTA count improves performance that cannot be overlooked. As the number of concurrent CTAs increases CAPS outperforms other prefetchers. On the other hand, the performance of MTA degrades because of increasing discontinuities at CTA boundaries. This performance trend shows why CAPS works well for modern GPU architecture which exploits massive thread parallelism using many concurrent CTAs. 3.6.3 Coverage and Accuracy of Prefetching 0 % 2 0 % 4 0 % 6 0 % 8 0 % 1 0 0 % C P L P S B P R H S P M R Q S TE C N V H S T J C 1 F F T S C N M M P V R C C L B F S K M M e a n C ov e r a ge I N TR A I N TE R M TA N LP LA P O R C H C A P S (a) Coverage 0 % 2 0 % 4 0 % 6 0 % 8 0 % 1 0 0 % C P L P S B P R H S P M R Q S T E C N V H S T J C 1 F F T S C N M M P V R C C L B FS K M M e a n A c c ur a c y I N TR A I N TE R M TA N LP LA P O R C H C A P S (b) Accuracy Figure 3.11: Prefetch coverage and accuracy 68 Figure 3.11 shows the coverage and accuracy of prefetchers. Coverage is defined as the ratio of the number of issued prefetch requests compared to the total demand fetch requests. Higher coverage ratio doesn’t always improve performance since inac- curate prefetches only consume resources like cache space and memory channel band- width. Accuracy is the ratio of correctly estimated prefetch requests that were actually consumed by the demand requests. Accuracy is an important performance factor of a prefetcher because unnecessarily prefetched data increases bandwidth and cause cache pollution. GPU’s memory subsystem is remarkably vulnerable to inaccurate prefetches since local cache resources are already under immense pressured due to requests from thousands of threads in an SM. CAPS on average provides 18% coverage. Coverage is lower for PVR, BFS, and KM that traverse graphs and have irregular data access patterns caused by indirect memory accesses. These loads are excluded from prefetch, thereby curtailing coverage. Never- theless, CAPS can test and generate prefetch requests with high accuracy ratio for strided loads in irregular applications. CAPS has also low coverage ratio for HSP. These applica- tions have irregular strides between warps, thus CAPS recognizes mismatches between prefetch and demand requests and then avoids issuing wasteful prefetch requests quickly. The accuracy of CAPS is very high for most benchmarks. Even though divergent memory accesses are frequently observed in irregular applications, CAPS employs the throttling mechanism to avoid indirect accesses which will result in inaccurate prefetch requests. Furthermore CAPS uses the PC-based detection mechanism for inaccurate 69 prefetch requests, described in Section 3.5.2, to recognize mispredicted prefetch requests quickly and shuts down prefetching for the corresponding loads. Thus CAPS has 18% of coverage with very high accuracy (97%). Low accuracy is the main reason of the low performance of INTER and MTA despite their high coverage ratio. INTRA has decent accuracy and coverage if applications have deep iterative loops and regular ac- cess patterns like LPS and STE, nevertheless, the performance of INTRA is constrained by a large amount of useless prefetch resulted from too early prefetch as shown in Fig- ure 3.13a. 3.6.4 Bandwidth Overhead 1 . 0 3 0 1 2 3 C P LP S B P R H S P M R Q S T E C N V H S T J C 1 F F T S C N M M P V R C C L B F S K M M e a n I N TR A I N TE R M T A N LP LA P O R C H C A P S (a) Fetch request from cores 1 . 0 1 0 1 2 3 C P L P S B P R H S P M R Q S TE C N V H S T J C 1 F F T S C N M M P V R C C L B F S K M M e a n I N TR A I N T E R M TA N LP L A P O R C H C A P S (b) Data read from memory Figure 3.12: Bandwidth overhead by prefetching 70 Data bandwidth increase is prefetcher’s prominent overhead since additional request and data traffic consumes resources in memory channels. Figure 3.12 presents the band- width overhead of prefetchers. Figure 3.12a and Figure 3.12b respectively show in- creased data request traffic from SMs to memory partitions, and increased read data traffic from DRAM compared to a baseline with no prefetching. If prefetch requests are actually consumed by a later demand fetch, the increase in data traffic should be minimal. Otherwise, wasted prefetches will increase data bandwidth which degrades performance. As shown earlier, CAPS has very high accuracy. Hence, most of the prefetched data is consumed by a demand fetch later. INTER on the other hand increases traffic by over 2 since its coverage is high but accuracy is low. MTA also increases data bandwidth significantly for the same reason. The bandwidth overhead of CAPS is less than 3%. 3.6.5 Timeliness of Prefetching 0 2 4 6 8 10 12 MEAN Early prefetch ratio (%) INTRA INTER MTA CAPS CAPS w/o Wakeup 0.91 1.16 (a) Early prefetch ratio 64.3 145.0 172.7 0 50 100 150 200 MEAN Average cycles LRR TLV PA-TLV (b) Prefetch distance of timely prefetches Figure 3.13: Timeliness of prefetching When a prefetch is issued too early, the prefetched data can be evicted before the target demand load is issued due to the limited size of L1 data cache. Such early prefetch 71 only increases memory traffic without benefit. As stated in Section 3.5.1, CAPS adjusts warp priority to detect the stride and the base addresses of CTAs as early as possible to increase the distance between prefetch and demand requests. Additionally, a warp in the pending queue is awakened when the corresponding data prefetch reaches L1 data cache. Hence, CAPS can adjust prefetch timing for target load instructions effectively to improve performance. Figure 3.13a shows the percentage of prefetched data that was evicted before use. On average only 0.91% of the prefetched data was evicted from L1 by an intervening demand fetch before the prefetched data is consumed. The early prefetch ratio increases to 1.16% even without the early wake-up mechanism. Hence, for a small fractional increase of early prefetch ratio it is possible to completely avoid the early wake-up process, if desired. If the distance between prefetching and demand requests is too short, prefetcher can- not effectively hide the long latency of memory operation. Given that latency of memory operation of GPUs is hundreds of cycles, prefetch requests should be issued sufficiently far ahead before demand requests are issued. Figure 3.13b shows the distance between prefetch and demand results when CAPS is applied for the various schedulers. When CAPS is implemented on the unmodified round-robin and two-level schedulers, average distances between prefetch and demand requests are 64.3 cycles and 145.0 cycles respec- tively. When CAPS works cooperatively with the prefetch-aware scheduler, CAPS can issues a prefetching request on average 172.7 cycles before the target demand request. 72 This result represents the prefetch-aware scheduler effectively improve the timeliness of CAPS. 3.6.6 Energy Consumption 0.8 0.9 1.0 1.1 CP LPS BPR HSP MRQ STE CNV HST JC1 FFT SCN MM PVR CCL BFS KM Mean Normalized Energy 0.98 Figure 3.14: Energy consumption by CAPS Figure 3.14 shows energy consumption of CAPS normalized to the baseline configu- ration. We estimate the energy consumption of CAPS and baseline GPU with GPUWattch [53]. Power metrics of CAPS is estimated with CACTI and the synthesized RTL models based on 45 nm technology node as mentioned in Section 3.5.4. The simulation results show CAPS consumes 2% less energy on average. 3.7 Chapter Summary Due to the nature of computations GPU applications exhibit stride access patterns. But the starting address of a stride access is a complex function of the CTA id and thread id and other application-defined parameters. Hence, detecting stride patterns across CTAs is a challenge. To tackle this challenge we propose CTA-aware prefetcher and scheduler for GPUs. CAPS hoists the computation of the base address of each CTA by scheduling 73 one leading warp from each trailing CTA to execute alongside the warps of a current leading CTA. The leading warps compute the base address for each trailing CTA, while the stride value is detected from the execution of trailing warps of the leading CTA. Us- ing the per-CTA base address and combining with the global stride value that is shared across all CTAs, CAPS is able to issue timely and accurate prefetches. The evaluation re- sults show that CAPS predicts prefetch addresses with over 97% accuracy and improves performance by 8% on average with maximum 27%. 74 Chapter 4 Cache Management for Improving Data Utilization in GPU 4.1 Introduction GPUs exploit thread level parallelism (TLP) to maximize throughput with thousands of compute units. But GPU’s memory hierarchy is designed for regular address accesses generated from vector-based operations such as those seen in multimedia applications. Hence, even though a warp or a wavefront (a collection of 32 threads) may generate many memory requests in practice it is expected that most of the individual thread requests can be coalesced into a single wide memory access. As such, GPUs are provisioned with 128 byte wide cache lines to enable a warp to fetch data from a single cache line. Traditionally GPUs have been provisioned with a small L1 cache which can accommodate only a few 75 wide cache lines in each cache. For instance NVIDIA’s Kepler architecture has 16KB of local data cache (128 cache lines), but allows up to 64 concurrent warps per core [68]. Theoretically each warp has just two lines worth of data that can be held in the local cache. To reduce cache contention from multiple warps, warp throttling schemes have been proposed [83, 44, 19, 54]. These schemes essentially reduce the number of active warps contending for cache. However, in this work we show that each load instruction within a warp exhibits a specific type of data locality and would benefit from having a cache management that is tuned for that specific load instruction. As such warp-level cache management schemes, such as bypassing the cache for the entire warp, are generally too coarse since they cannot take advantage of per-load locality information. 0.0 0.2 0.4 0.6 0.8 1.0 Ld1 Ld2 Ld3 Ld4 Ld1 Ld2 Ld3 Ld4 Ld1 Ld2 Ld3 Ld4 Ld1 Ld2 Ld3 Ld4 BFS SPMV GC SSP Miss rate Max. TLP (max. 48 warps per SM) CCWS-SWL-best Figure 4.1: Miss rate change per load by warp throttling To demonstrate this observation, Figure 4.1 shows the change in L1 data cache miss rate for the top four (based on dynamic access count) global load instructions in cache sensitive applications configured with the maximum TLP (48 active concurrent warps allowed per streaming multiprocessor) and the best case static warp throttling that limits 76 the number of active warps that maximizes performance (CCWS-SWL-best) [83]. As can be seen in the figure, throttling warps has uneven impact on the cache miss rate reduction across the global loads. For the BFS benchmark, the miss rate of Ld2 is dra- matically reduced with warp throttling, but the other three loads (Ld1, Ld3 and Ld4) see marginal reduction in miss rate. The reason for improved miss rate for Ld2 is that the accesses from that load exhibit strong temporal locality. But when dozens of active warps are accessing the cache this temporal locality cannot be exploited because requests from other warps evict the cache lines before temporal reuse. But with warp throttling the cache lines that were demand fetched by Ld2 stay in the cache longer, thereby im- proving the possibility for temporal reuse. On the other hand, the three other loads do not exhibit much temporal locality. In fact Ld4 exhibits streaming behavior. However, streaming cannot benefit from warp throttling, rather what is needed for Ld4 is to bypass the cache entirely which will eliminate unnecessary cache occupancy thereby allowing Ld2 to exploit temporal locality even further. The above analysis shows that warp-based cache management does not exploit the per-load cache preferences. In other words, with per-warp cache management schemes cache space is still wasted by streaming data in active warps. On the same note, a warp- level cache bypassing scheme is also not appropriate since any load instruction that ex- hibits strong temporal locality is also forced to bypass the cache if the loads are issued from bypassing warps [54]. These observations motivate the need for a cache manage- ment scheme that uses the per-load locality behavior. Another interesting observation 77 in GPU applications is that each global load instruction has a fairly stable behavior dur- ing the entire application execution time. Namely, whether a load instruction benefits from warp throttling or benefits from cache bypassing is independent of the warp ID or when that load is executed in the code. This property is based on the GPU’s unique software execution model where all warps originate from the same kernel code. Hence, cache-preference properties of a load, such as data locality types or cache sensitivity of a certain load instruction detected in one warp can be widely applied to the same load execution in all other warps. 4.2 Cache Access Characteristics In this section we study the cache access characteristics of several benchmarks selected from various GPU benchmark suites. We classify the selected benchmarks to three types - cache sensitive (CS), cache moderate (CM) and cache insensitive (CI). We use the following criteria for classification; the IPC of CS benchmarks is improved over 1.5 when 4 the baseline cache size of 32KB is used. On the other hand, the performance change of CI benchmarks is less than 10% when using the 128KB cache compared to the 32BK cache baseline. The performance of CM benchmarks falls in between the two extremes. The benchmarks studied in this work are listed in Table 4.1. 78 Abbr. Description Cache Sensitive (CS) BFS Breadth-First Search [16] KMN K-means [16] IIX Inverted Index [35] WC Word Count [35] GC Graph Coloring [101] SSP Single-Source Shortest Path [11] Cache Moderate (CM) SPMV Sparse-Matrix Dense-Vector Multiplication [90] MM Matrix Multiplication [35] SS Similarity Score [35] CCL Connected Component Labeling [11] Cache Insensitive (CI) GE Gaussian Elimination [16] SRD Speckle Reducing Anisotropic Diffusion [16] MRI Magnetic Resonance Imaging - Gridding [90] SGM Register-Titled Matrix-Matrix Multiplication [90] STN Stencil 2D [21] APS All Pairs Shortest Path [101] Table 4.1: Benchmarks 4.2.1 Data Locality Type The locality exhibited by data fetched from each warp load instruction can be broadly classified into four types: streaming, inter-warp, intra-warp and inter+intra-warp locality. Streaming data is brought into the data cache on a demand fetch but is never reused. Hence, it has zero temporal locality. If the data fetched by a load instruction from one warp is also accessed by the same load PC across multiple warps, it is defined as inter- warp locality. If the data fetched by a load instruction from one warp is exclusively used within the same warp that data is supposed to exhibit intra-warp locality [83, 84]. The last category is when data is brought into cache by one warp and then repeatedly re- referenced by other warps as well as the original warp. Figure 4.2 shows the breakdown 79 0.0 0.2 0.4 0.6 0.8 1.0 BFS KMN IIX WC GC SSP mean (CS) SPMV MM SS CCL mean (CM) GE SRD MRI SGM STN APS mean (CI) Chart Title streaming inter-warp intra-warp inter+intra-warp Figure 4.2: Ratio of data regions by data locality types of all data accesses into the four categories. To generate the data in Figure 4.2, we simulated an infinite sized L1 data cache so as to get a fundamental understanding of how data is reused within and across warps, without worrying about cache replacement interference. The ratio of each type in the figure is computed as a number of cache lines (128 byte size) having a specific locality type divided by all allocated cache lines in the infinite cache. Comparing the data presented in this figure with the benchmark categorization, it is clear that CI applications have a large fraction of streaming and some inter-warp locality type data. As such providing larger cache to these applications will not improve miss rate. Note that STN is categorized as CI in Figure 4.2, although it has some inter-warp locality. Further analysis revealed that most of this locality exists within a short time interval. Once a cache line is brought into the cache it is consumed in quick succession by multiple warps. As such there is no need to preserve cache lines for long time period and hence STN is categorized as CI. CS applications exhibit strong intra-warp locality, thus these applications have potential to reuse data in L1 cache if the data is not evicted before re-reference. CM applications have a mix of streaming and intra-warp locality type data, 80 hence it is likely that the cache utilization suffers from the interplay of streaming and non-streaming data, where the streaming data may evict non-streaming data that may have temporal re-use. 4.2.2 Loss of Locality in Cache 0.0 0.2 0.4 0.6 0.8 1.0 BFS KMN IIX WC GC SSP SPMV MM SS CCL Ratio by reuse count 1 2 3~4 5~16 16~32 32~ Figure 4.3: Ratio of number of cache lines by access count The locality properties seen clearly with the infinite sized cache are completely oblit- erated when using a finite sized cache, when tens of concurrent warps share a small L1 cache. Figure 4.3 shows two stacked bars for each benchmark. Each category in the stacked bar shows the access count to a given cache line before it is evicted. The left stacked bar presents the data for a 16KB L1 data cache and the right bar is for an infinite sized cache. Looking at the 16KB data the results reveal that most of the cache lines are accessed only few times (in fact just once or twice). This data may give the impression that the data has limited reuse although our earlier results showed that there is plenty of reuse in the cache sensitive applications. The main reason for the dramatically different view seen on the 16KB cache is that the data locality that exists for a given load in one warp is severely perturbed by tens of other warps running concurrently. Thus frequent 81 cache line eviction is the biggest culprit that hides the fundamental data sharing behavior that is prevalent in many of benchmarks. This fact is evident when looking at the right stacked bars which simulates an infinite data cache. When the cache size is infinite there is no eviction and hence much of the intra-warp locality that was seen in Figure 4.2 is clearly manifested in the results. This data implies that identifying the locality types cor- rectly is not feasible with the small sized local cache since sharing behavior is frequently lost due to severe thrashing. 4.2.3 Access Pattern Similarity As threads originating from one kernel share the same program code, it is intuitive to expect that the same load instruction executed in different warps in an SM exhibit similar data access behavior. To investigate this intuition we use the notion of access pattern similarity (APS) exhibited by a load instruction across multiple warps in a given kernel. APS is quantified using Equation 4.1. APS= S i max(N t i of Ld i ) S i (N i of Ld i ) (4.1) N i is the number of unique cache lines requested by a load Ld i . Thus the denominator is simply the sum of the number of unique cache lines accessed by all loads in a given kernel. Then for each load i that brings a given line into cache we categorize that cache line into one of the four warp-based locality types discussed in Section 4.2.1. N t i of Ld i means the number of cache lines whose access patterns fall into access locality type 82 t. We then compute the maximum across all categories t to find max(N t i ) for each Ld i . Thus the numerator shows how dominant is a particular access type for each load. APS values closer to one indicate each load instruction has a dominant locality type across various warps executed in the kernel. Consequently, APS quantifies consistency of access patterns of a load instruction in a kernel. 0 0.2 0.4 0.6 0.8 1 BFS KMN IIX WC GC SSP SPMV MM SS CCL GE SRD MRI SGM STN APS mean Access pattern similarity Figure 4.4: APS for infinite sized cache Figure 4.4 shows the APS of the benchmarks run with an infinite L1 data cache configuration. Overall, most applications, especially cache sensitive applications, have an APS value close to one. The average APS for all tested applications is 0.90. Thus we conclude that each load in a GPU application has a consistent cache access pattern across all warps. 4.3 Access Pattern-Aware Cache Management In the previous section we showed that data fetched exhibits one of four dominant lo- cality types. Furthermore, the load instruction that fetches a particular locality type data tend to fetch the same locality type of data for the entire kernel execution. Based on these 83 observations we propose Access Pattern-aware Cache Management (APCM). APCM ex- ploits the observation that each load exhibits a persistent data locality type to improve utilization of GPU L1 data cache. APCM first detects a data locality type of each load in- struction by monitoring cache access patterns of one exemplary warp. Since data locality type cannot be inferred from a regular cache due to severe interference in the cache from multiple warps, APCM uses a dedicated cache tag array to track data sharing behavior from only one warp. The locality type inferred for each load in the monitored warp is then applied for the same load across all warps within a kernel with confidence since the cache access patterns exhibit strong consistency among warps as shown in Section 4.2.3. APCM then applies load-specific cache management scheme for each data locality type as described below. 4.3.1 Locality-Specific Cache Management Strategies As stated earlier, data requested by global loads exhibits load-specific locality patterns, and these loads have strong access pattern similarity where the load exhibits a given locality type across all warps and for the entire kernel execution time. Since effective lifetime of cache lines is characterized by data locality we argue that locality-specific cache management strategies are desired for each load instruction. Streaming data: As shown in Figure 4.2, streaming data occupies significant fraction of demanded data in CM and CI applications. That means resources like cache lines and MSHR entries are wasted when streaming data is fetched into the cache. To make 84 matters worse even if there are a few cache lines with strong locality in these applications they may be evicted by the streaming data. Thus the best way to treat streaming data is to bypass the L1 cache entirely and provide data to compute cores directly from the L2 cache and its associated interconnection network. Inter-warp locality: Using a combination of address traces and source code analysis we discern that the primary reason for inter-warp locality is stride accesses to large data arrays across different warps. The index of the data array accessed by each thread is typ- ically computed using a linear function of thread IDs and/or CTA IDs [51, 87]. Threads’ data addresses are merged into one cache line space, however coalesced requests straddle two cache line regions if addresses are misaligned. In that case a single warp access may bring in two cache lines, but only a part of second cache line is accessed by the current warp. The neighboring warp will access the leftover data in the second cache line and then fetch a new cache line which is again partially used. Such misaligned data accesses cause inter-warp sharing. The other reason of inter-warp locality is small data request size. Even if threads’ request addresses are merged, small data size, for example 1 byte per thread, cannot occupy an entire cache line. Then the neighbor warps consume the remaining part of the cache line space. Such inter-warp locality among neighbor warps or thread blocks has also been observed before [52, 47]. Figure 4.5 shows distributions of the reuse distance, which is defined as the number of cache accesses until a cache line is reaccessed, for inter-warp and intra-warp locality type data found in SGM and BFS, respectively, with infinite sized cache. As GPUs 85 interleave warps in quick succession data that is accessed by neighboring warps is in fact accessed in a short time interval, and hence inter-warp locality loads have a short reuse distance. Using a simple LRU policy keeps most of the inter-warp locality data in cache and hence, no specific cache management approach is necessary for inter-warp locality type as long as neighboring warps are scheduled in quick succession, which is generally the case with round-robin scheduling policies. 0.0 0.2 0.4 0.6 0.8 1.0 ~20 50 100 200 500 1000 1500 2000 2000~ Ratio Reuse distance Inter-warp Intra-warp Figure 4.5: Reuse distance for different locality type data Intra-warp locality: Intra-warp locality is the dominant type for CS applications. Cache lines allocated by loads of the intra-warp locality type are not efficiently reused, even though they are referenced multiple times. Figure 4.5 shows that intra-warp locality type has long reuse distance, which is a result of GPU warp schedulers that interleave warps; even instructions that are close-by in a warp are effectively separated by a large time interval. As such intra-warp locality type data suffers frequent interference by many accesses from other warps leading to premature eviction of cache lines. We propose to exploit this data locality by protecting the cache lines allocated by loads of intra-warp locality type until they are mostly done with their reuse. Details of the process will be explained shortly. 86 4.3.2 Detection of Locality Types APCM first detects a locality type per load before applying the specific cache manage- ment policy. The locality types of cache line data can be detected based on access counts by the same or different warps. For instance, if a cache line is first allocated by warp A and then requested by warp B, then the total access count for the cache line becomes two and the access count from the warp that initially allocated the cache line (warp A) is one. This cache line can then be inferred to exhibit inter-warp locality. Locality detection criteria by APCM is summarized in Table 4.2. Locality type Total access count Access count by allocating warp Streaming 1 1 Inter-warp N (bigger than 1) 1 Intra-warp N (bigger than 1) N Inter+intra-warp N (bigger than 1) M (less than N) Table 4.2: Criteria of locality type decision However, detecting locality types by keeping track of the access counts to a cache line is difficult in GPUs. Access profiles stored in cache tags are frequently lost as cache lines are evicted. In order to solve this challenge APCM tracks access patterns of one warp, called a monitored warp, with a small tag array structure, called monitor tag array (MTA). MTA works like a private cache tag array that tracks the accesses from a monitored warp. Each tag in the MTA is augmented to collect total access counts and access counts by the monitored warp. The detected locality types by the monitored warp can be applied to other warps with high accuracy since locality types are consistent among warps from 87 the same kernel as mentioned in Section 4.2.3. The microarchitecture details of MTA are described shortly. 4.3.3 Protection Algorithm As explained in Section 4.3.1 APCM applies cache bypassing and cache line protection for data allocated by load instructions detected as streaming and intra-warp locality types respectively. Cache bypassing can be simply implemented by allocating demand requests to the injection queue of the interconnection network without perturbing the L1 data cache. On the other hand, cache line protection requires pinning a cache line for the lifespan of data (from the first allocation to the last access) for optimal cache resource utilization. Counter-based or reuse distance-based cache line protection algorithms have been explored in CPU domains, however those methods are ineffective for GPU’s due to concurrent and interfering accesses from dozens of warps. As shown in Figure 4.5 reuse distance for intra-warp locality type data is widely distributed, thus it is difficult to estimate effective protection distance. In order to estimate accurate lifespan of cached data, APCM tracks data access de- pendency between load instructions. If a certain cache line is first allocated by load A and then re-referenced lastly by load B, the lifetime of the corresponding cache data can be estimated between execution of load A and B. Therefore, APCM also tracks the IDs (hashed PC) of the first allocating load and the last accessing load for the monitored warp 88 in MTA tags, and then exploits this dependency information to estimate the lifetime of the protected lines by all other warps (more details in the next section). load A Branch 1 RPC RPC of the branch ID=1 First load ID 1 Last load ID (a) Hit by the same load load A load B ID=1 ID=2 1 First load ID 2 Last load ID (b) Hit by different load Figure 4.6: Load dependency and consumer load ID Figure 4.6 shows two examples of load access dependency scenarios. The left-hand case represents the allocated cache line is reaccessed by the same load executed in a loop. For this case the hashed PC of load A is logged in First load ID field in an MTA tag and then the same hashed PC is stored in Last load ID field. Consequently, APCM estimates the valid life of the cache lines allocated by load A ends when the loop is escaped. The right-hand case shows the cache line is re-referenced by the different load instruction. The logged IDs in the first (load A) and the last (load B) load ID fields are different in this case, then APCM disables protection of the cache line after executing load B. 4.4 Hardware Architecture The hardware architecture of APCM is described in Figure 4.7. The modifications are primarily made to the L1 data cache access pipelines in LDST units. The additional struc- tures added in the figure are: monitor tag array (MTA), cache access information table 89 L1 data cache T T T T T T T T Address Gen. Coalescer Warp (load) Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data To interconnection network Tag Tag Tag Tag Access Info. Access Info. Access Info. Access Info. Monitor Tag Array (MTA) Cache Access Information Table (CAIT) PC ID PC ID Load Alias Table (LAT) Path for cache bypassing 0 1 2 3 √ √ Protection Status Board (PSB) Figure 4.7: Hardware architecture of APCM (CAIT), load alias table (LAT) and protection status board (PSB). As a brief overview of the proposed hardware, MTA is a tag array structure to track access count and data access dependency for the monitored warp. CAIT manages the detected locality types per each load instruction. LAT converts 32-bit PC address of load instructions to a shorter hashed ID, primarily to save storage space in other structures tracking the per load information. PSB maintains information on which warps are currently protecting cache lines. 4.4.1 Tracking Access Patterns APCM tracks the access history of just one monitored warp in MTA. Figure 4.8b shows the structure of one entry in MTA. There are four additional fields in each tag entry along- side the usual tag information. First and last load ID fields store which load instruction 90 first allocated a given cache line and the last load instruction that accessed that cache line. The access count field stores the total number of time a given cache line is accessed by any warp (including monitored warp), and intra-warp access count tracks how many times the monitored warp alone accessed that cache line. Protection bit (1) Warp ID (6) Access count (4) Tag address (a) Data cache First load ID(4) Access count (4) Intra-warp access count (4) Last load ID (4) Tag address (b) MTA Figure 4.8: Additional fields in tags The process for accessing MTA is as follows. When a global load instruction is first executed, the load PC is hashed to create a shorter load ID. The load PC and the load ID are then stored in a 16-entry content-addressable memory (CAM) of load alias table (LAT). If the load has already been executed at least once before, the LAT will have an entry for that load PC and that load ID is retrieved. Normally a GPU kernel contains small number of global loads, thus tracking the first 16 global loads is enough to capture nearly all the global memory accesses in a kernel. Results from the LAT size sensitivity study in the next section verifies this claim. For simplicity, if the LAT is full then only the first 16 loads are tracked and the remaining loads are treated as normal loads. If the load instruction originates from a monitored warp then the load address is used to generate an index into MTA. MTA works like the cache tag array. When a memory request from the monitored warp misses in MTA, a tag address of this request is allocated in MTA. Also the load ID (hashed PC of the load instruction) is logged in the First load ID field, and the 91 Access count field and Intra-warp access count field are set as one respectively. If this tag is hit by other requests after allocation then the load ID of the requests is stored in the Last load ID field. Future accesses to this cache line from the monitored warp increment both the Access count and Intra-warp access count fields. Only the Access count field is incremented if load instructions from all other warps (other than the monitored warp) hit the MTA entry. If a single warp load generates more than two memory addresses (uncoalesced loads) then only the first two requests can allocate the MTA tag while the other accesses from the same load are simply dropped. It is also possible that a request from the monitored warp hits in L1 cache and misses in MTA. That situation occurs when the cache line is first allocated by any warp other than the monitored warp, and then the request from the monitored warp hits in the cache. In that case the access count from the L1 cache tag is used to initialize the MTA tag. To support this case, L1 cache tag is augmented to track access count (Access count field) as shown in Figure 4.8a, which simply tracks the number of times that cache line is accessed by any warp. Other two fields (Protection and Warp ID) are used for cache line protection, which will be explained later. Management method (2) Access count (4) Last load ID (4) Valid bit (1) Figure 4.9: Fields in an entry of CAIT CAIT manages the tracked data locality type and data access dependency information per load. Basically an entry of CAIT is updated when an MTA tag is evicted due to address conflicts, or the access count of an MTA entry exceeds the predefined threshold 92 value. Figure 4.9 shows the content of a CAIT entry. Each entry of CAIT is indexed by the load ID stored in the First load ID field from the MTA entry. The Management method field stores the cache management scheme that is selected for the corresponding load instruction. The management field uses access count and intra-warp access count fields from the MTA entry to determine the load locality type using the criteria defined in Table 4.2. For streaming data the management method is set to bypassing, and if the load exhibits intra-warp locality the management method is set to protection. Otherwise the normal cache management scheme is set. The Access count and Last load ID fields are just copied from the MTA tag. It is possible that after an MTA tag associated with a load ID is evicted the same load ID may execute again from the monitored warp and reallocate a new MTA entry. Thus before the monitored warp completes execution the MTA entry may potentially be evicted and allocated multiple times by the same load. Note that this is not a common scenario, but it is just a corner case scenario that must be handled. Since the CAIT entry is indexed using only the load ID on each MTA entry eviction the CAIT entry must also be updated. We use the following simple policy. An entry of CAIT is overwritten by the new information if the Access count value in an MTA entry is larger than the current Access count stored in the CAIT entry. Finally, after a monitored warp finishes execution all the MTA entries are scanned and each MTA entry updates the CAIT as described above. Then all the MTA entries are invalidated and may be used again for monitoring a different kernel execution later. 93 4.4.2 Cache Management The overall cache management works as follows. A load instruction uses the load ID to index into a direct mapped CAIT. If the CAIT entry indexed by the load ID is valid, the cache management scheme as specified in the Management method field is applied for that load. Possible cache management methods are normal, bypassing and protection. If the management method specifies normal, the load goes through the normal GPU cache access process. If a load is categorized as bypassing type, requests from the load are directly assigned to the interconnection without accessing L1 cache. If the load access requires protection, the allocated cache line is pinned as a protected line by setting the protection bit of the cache tag (presented in Figure 4.8a) in L1 cache. In addition, the Warp ID field of the corresponding cache line is also set to the current warp ID of the load instruction. Finally, we need a mechanism to determine when to unpin the protected cache lines. The validity of the protected lines is controlled by protection status board (PSB). PSB has one entry per warp. When a load instruction in a warp allocates a protected cache line then the PSB bit for that warp is set to one. The Last load ID from CAIT is copied into the PSB entry for the warp. From then on PSB tracks if the load instruction mapped into the last load ID has completed execution at which time the PSB bit is reset to zero, which means the cache lines for that warp are no longer protected. 94 If the last load ID of a CAIT entry is identical to the load ID used in indexing that CAIT entry, the loop indication bit of the PSB entry is set in order to mark the load in- struction as being part of an iterative loop. For this case the protected lines are pinned until the warp escapes the loop. We use the SIMT stack [29] to track when a load in- struction exits the loop. In our implementation the PSB tracks only a single protected load instruction per warp. Our empirical analysis showed that in any given warp only a single load instruction needs to be protected in practice. Hence, if the PSB entry was al- ready set by a former load instruction in a certain warp, protection control for other load instructions are simply ignored in that warp until the PSB entry is released. In practice, this simple approach works well across all the applications studied. 4.4.3 An Illustrative Example An example of cache line protection is depicted in Figure 4.10. This example assumes that a monitored warp has completed execution and data fetched by load instruction (Load A) has been determined to be accessed multiple times within a loop during the monitored warp execution. The MTA entry accessed by the address from Load A would have marked the First load and Last load fields to be the same load ID, and the Access count and Intra warp access count fields are also the same at the end of the monitored warp execution. Based on this MTA entry information the CAIT entry would be updated to 95 warp 0 Load A (allocate) warp 1 W0 1 W1 0 PSB 1 L1 cache Protect warp id 0 (a) Protection of warp 0 warp 0 Load A warp 1 W0 1 W1 0 PSB 1 L1 cache Protect warp id 0 Load A (bypass) (b) Bypassing of warp 1 warp 0 Load A warp 1 W0 0 W1 0 PSB 1 L1 cache Protect warp id 0 Load A Load A (hit) Load A (hit) Branch (c) Unprotection of warp 0 warp 0 warp 1 W0 0 W1 1 PSB 1 L1 cache Protect warp id 1 Load A Load A Load A Load A Branch Load A (allocate) (d) Protection of warp 1 Figure 4.10: Cache data protection control indicate that the Load A is characterized as an intra-warp locality load, based on the cri- teria set in Table 4.2. The management method for the CAIT entry indexed by the Load A would have been set to be protection based for intra-warp loads. In this example there are two warps. First, warp 0 executes Load A. The CAIT entry indexed by that load ID is valid and its management method is set to protection. At this time the the PSB of warp 0 is set to 1, indicating the data fetched for Load A into the L1 cache must be protected. Once the data is fetched in L1 cache the protection bit in the cache line is set to one and the warp id field of the cache line is set to warp 0. From then on warp 0 will continue to protect the cache lines fetched by Load A until that load exits the loop. 96 Assume the L1 cache is direct-mapped and warp 1 then executes Load A. Since CAIT is indexed only by load id, and is warp independent, the CAIT entry for Load A from warp 1 would have also indicated that the load requires protected cache lines. The Load A from warp 1 would try to fetch data into the cache line but notices that the cache line’s protection bit is already set to one. The warp id is set to warp 0 for that protected cache line, and the PSB entry for warp 0 is still set to one, namely warp 0 has not exited the loop. Rather than evict a protected cache line, the data fetched by Load A from warp 1 would then simply bypass the cache as shown in Figure 4.10b. The bit for warp 1 in PSB is not set since the request from warp 1 was not pinned. In Figure 4.10c the repeated Load A from warp 0 hits the protected cache line multiple times and eventually the loop terminates. Then the bit for warp 0 in PSB is reset since the lifetime for the protected cache line ends. At this time if warp 1 is still executing Load A in loop it will continue to attempt to allocate a cache line and protect it on each execution of the load. After warp 0 exits the loop warp 1 may see the cache line to be protected. But warp 0 stored in the cache line is used to access PSB and eventually PSB indicates that warp 0 is no longer protected. At this time the cache line protection bit is reset and the warp 1 allocates that cache line and then sets its protection bit and sets the warp id to warp 1. The PSB for warp 1 is now set to one as shown in Figure 4.10d. 97 Extended in L1D 11 bits per tag, 128 lines MTA 36 bits per tag, 32 lines CAIT 11 bits per entry, 16 entries SRAM LAT 32 bits per entry, 16 entries CAM PSB 6 bits per warp Table 4.3: Hardware overhead by APCM 4.4.4 Hardware Cost Table 4.3 summarizes the required hardware resources for implementation of APCM. We use CACTI 6.5 to estimate area overhead and power consumption of the memory components based on 45 nm technology node parameters [97, 59]. The hardware cost of CAM in LAT is estimated based on the results in the published work [104]. Other parts are modeled in RTL level and synthesized with 45 nm FreePDK library to estimate area and power information [1]. The occupied area required by APCM is about 4700 mm 2 , equivalent to 0.14 % of the L1 data cache area estimated by CACTI. It is also estimated that APCM increases 0.02 % of one SM area (22 mm 2 ), measured based on the die photo of GF100. 4.5 Evaluation 4.5.1 Methodology We implemented APCM on GPGPU-Sim v3.2.3 [7]. The baseline configuration settings used for evaluation are listed in Table 4.4. Configuration 1 is similar to NVIDIA Fermi (GTX480) [69]. We use the first configuration as the baseline settings for the most part 98 Parameter Configuration 1 Configuration 2 SMs 15 SMs @ 1400MHz 16 SMs @ 876MHz SIMT width 32 32 Warps / SM 48 64 CTAs / SM 8 16 Scheduler LRR, 2 per SM LRR, 4 per SM CUDA cores 32 per SM 192 per SM Register file 128KB 256KB L1 data cache 16KB, 128B / line, 4-way, LRU, 64 MSHRs L2 cache 768KB, 8-way 1536KB, 16-way DRAM 384b @ 924MHz 384b @ 1750MHz GDDR5 Tim- ing [37] t RP =12, t RC =40, t RAS =28, t RCD =12, t RRD =5.5, t WR =12, t CL =12, t CL =4 Table 4.4: GPGPU-Sim baseline configurations Parameter Configuration MTA 32 entries, direct-mapped LAT & CAIT 16 entries Cache Mgmt. bypass / protection Table 4.5: Basic APCM configuration of evaluations. Additionally, we also use Configuration 2, which is similar to NVIDIA Kepler (GTX780) [68], to test the performance of APCM on the more recent architecture settings equipped with more compute cores, higher DRAM bandwidth and the same size of the L1 data cache. Although APCM is evaluated on GPGPU-Sim with NVIDIA GPU architecture, AMD GCN architecture also has similar data fetch process for the vectorized global loads, thus we believe our APCM approach is also applicable to other GPGPU architectures. The basic configuration of APCM is shown in Table 4.5. MTA is a direct-mapped tag array with 32 entries. Depth of both LAT and CAIT is set as 16, therefore, data requested by 16 different global loads is tracked and managed. We studied the impact of using just 99 cache bypassing, or just protection or a combination of both with APCM to isolate the performance impacts of each cache management scheme. We used the benchmarks introduced in Section 4.2 to evaluate APCM. We simu- lated all applications until the end of their execution or when the executed instruction count reached one billion. One exception is for KMN, which was simulated until the completion of the first kernel execution because trailing kernels have only texture loads accessing the texture cache. 4.5.2 Performance 1.34 1.32 1.04 0.8 1.0 1.2 1.4 1.6 1.8 BFS KMN IIX WC GC SSP mean (CS) SPMV MM SS CCL mean (CM) GE SRD MRI SGM STN APS mean (CI) Normalized IPC Chart Title bypass protect bypass + protect Figure 4.11: Performance of APCM normalized to the baseline Configuration 1 Figure 4.11 shows the performance of APCM normalized to the baseline Configura- tion 1. We tested the cache management methods by applying bypassing and cache line protection in isolation and then combined both. The protection based approach works well for the cache sensitive (CS) applications since intra-warp locality type data is dom- inant and effectively tracked by APCM for these applications. But for cache moderate (CM) applications the protection mechanism alone is not effective, however when com- bined with bypassing APCM works well. It is because the CM applications have mixed locality types of data, thus bypassing streaming data in L1 cache is helpful to keep other 100 reused cache lines. Especially MM exhibits large ratio of intra-warp locality type data with the infinite cache model as shown in Figure 4.2, however the reuse distance of this intra-locality type data is too long to track in MTA. For this case it is detected as stream- ing data by APCM. Bypassing such type of data is better for performance since L1 cache cannot keep those cache lines even if only one warp is allowed to access the cache. For cache insensitive (CI) applications the performance is not degraded with APCM. Over- all, when both bypass and protection are applied the performance of CS applications increases by 34% on average. The performance of CM applications is improved up to 76% for MM and the average performance is improved by 32%. IPC of CI applica- tions is slightly increased (4%) by APCM since some data exhibits streaming patterns which were bypassed, and as a result some shared cache lines were kept longer before eviction. Overall, the average performance improvement achieved by APCM across all applications is 22%. Thus applications that are bottlenecked by cache see significant performance improvements, while applications that do not rely on cache do not suffer any performance degradation. 1.42 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 BFS KMN IIX WC GC SSP SPMV MM SS CCL mean Normalized IPC Chart Title Config1 Config1+APCM Config2 Config2+APCM Figure 4.12: Performance on different GPU configurations 101 Performance of APCM was also evaluated using Configuration 2 which has more compute engines. Figure 4.12 shows the IPC of two base configurations and combina- tions of APCM normalized by the baseline Configuration 2. With the Configuration 2 settings enabling more massive TLP and higher DRAM bandwidth, the baseline perfor- mance is increased by 69% on average compared to the Configuration 1. Even with a much stronger baseline that can improve overall performance of applications, APCM can still enhance the performance with better utilization of data cache. APCM improves the performance of Configuration 2 by 42%, which is better than the enhancement for Configuration 1 (33% for CS+CM). Due to space constraints we did not show the CI benchmarks but they suffered no degradation in performance. This means APCM effec- tively resolves cache contention in more compute-intensive hardware like Kepler. 4.5.3 Cache Efficiency 0% 20% 40% 60% 80% 100% BFS KMN IIX WC GC SSP mean (CS) SPMV MM SS CCL mean (CM) GE SRD MRI SGM STN APS mean (CI) L1 cache miss rate (%) Chart Title baseline bypass protect bypass + protect Figure 4.13: L1 cache miss rate Figure 4.13 shows the miss rate of the L1 data cache by APCM. The miss rate of CS applications is reduced by on average 15% compared to the baseline configuration 1, since cache lines allocated to intra-warp locality type data are reused more effectively. 102 As the fraction of other locality type data is very low for CS applications, impact of cache bypassing is insignificant. On the other hand, cache bypassing for the streaming data is effective for CM applications since other shared cache lines can remain in the data cache without interference from the streaming data. For CM applications, the cache miss rate is reduced by 22% with cache bypassing of APCM compared to the baseline. Streaming data is dominant for CI applications and memory requests of this type bypasses the cache with the bypassing scheme of APCM. However, the access pattern similarity, shown in Figure 4.4, is not high for some CI applications. When access pattern similarity metric is low, as is the case for a few CI benchmarks, then using a single monitored warp to detect load access patterns is not accurate. As a result some loads may be misclassi- fied in CI applications. Hence, even though the overall miss rate dropped some of the misclassification reduced the potential performance benefits for CI applications. 4.5.4 Performance with Warp Throttling 1.04 1.49 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 BFS KMN IIX WC GC SSP SPMV MM SS CCL mean Normalized IPC Chart Title Baseline APCM-Baseline CCWS-SWL PCAL APCM-SWL Figure 4.14: Performance with warp throttling methods Limiting the number of active warps is one of the ways proposed in the literature to alleviate cache thrashing and congestion in memory systems by blocking issuance 103 0% 20% 40% 60% 80% 100% BFS KMN IIX WC GC SSP SPMV MM SS CCL mean L1 cache miss rate (%) Chart Title Baseline APCM-Baseline CCWS-SWL PCAL APCM-SWL Figure 4.15: L1 cache miss rate with warp throttling methods of memory requests from inactive warps. In addition the state-of-the-art warp control schemes exploit cache bypassing from several warps in order to make use of under- utilized memory system resource when warp throttling is applied. APCM also can be applied on top warp throttling methods to maximize utilization of cache resource by applying fine-grained per-load cache resource control. In order to investigate the perfor- mance impact of the per-load cache resource control by APCM, we implemented APCM on top of warp throttling and compared its performance to the state-of-the-wart warp- level cache management method. Figure 4.14 compares the performance of warp throttling approaches and APCM. IPC of each approach is normalized to the performance obtained by the best static warp throttling configuration (CCWS-SWL). The best static throttling scheme outperforms dynamic warp throttling methods [83]. PCAL is the priority-based warp-level cache al- location method presented in [54]. We compare the performance of the static PCAL since it shows better performance than other dynamic approaches. We implemented 104 APCM on the baseline configuration (Configuration 1) allowing maximum active con- current warps (APCM-Baseline) and then applied APCM on top of static warp limiting method (APCM-SWL). APCM shows better performance improvement than prior warp throttling schemes and in fact APCM can be applied orthogonally on top these warp throttling techniques. On average APCM-SWL outperforms PCAL by 43% and CCWS-SWL by 49%. It is because the per-load cache management of APCM utilizes cache resource more effi- ciently by applying selective bypassing or pinning based on the detected locality types. Figure 4.15 compares L1 cache miss rate of each approach, and it reveals lower cache miss rates for most applications when using APCM. In addition APCM with the warp throttling can allow more TLP since fine-grained cache management by APCM allevi- ates cache contention resulting from more concurrent warps. Hence more active warps are allowed with APCM-SWL compared to CCWS-SWL. 4.5.5 Comparison with Other Schemes 1.04 1.15 1.32 0.8 1.0 1.2 1.4 1.6 1.8 BFS KMN IIX WC GC SSP SPMV MM SS CCL mean Normalized IPC Chart Title ABYP APCM (bypass) APCM (bypass+protection) Figure 4.16: Performance by bypassing methods 105 Cache bypassing: We compared the performance of APCM with the adaptive GPU cache bypassing (ABYP) presented in [91]. ABYP monitors the evicted cache lines and collect this information in a per-load table. ABYP predicts the future requests from the load which will not be reused if cache lines allocated by a load are evicted multiple times without re-reference. Even though ABYP exploits L2 cache to avoid false deci- sion of streaming data, it can still mislabel data with strong locality as streaming data due to limited L1 cache size. Hence, using a monitored warp to track this information in MTA is critical. Figure 4.16 shows the performance comparison results. We first compare ABYP with APCM-bypass only. The performance of ABYP is comparable to APCM-bypass only for CS applications. In fact for a few applications ABYP sees bet- ter performance because it may aggressively characterize some data as streaming which will be bypassed, thereby enabling some of the intra-warp data to stay in cache longer to improve hit rate. Overall, the cache bypassing scheme used in APCM improves perfor- mance by 15%, which is 11% better than ABYP. However, when the full APCM scheme is enabled APCM significantly outperforms ABYP by more than 28%. 1.22 0.8 1.0 1.2 1.4 1.6 BFS KMN IIX WC GC SSP SPMV MM SS CCL mean Normalized IPC Chart Title PD-128 PD-256 APCM-p128 APCM-p256 APCM-p Figure 4.17: Performance by protection methods 106 Cache line protection: APCM estimates effective data lifetime in cache based on load data dependency. We compared the performance with other cache line protection schemes as shown in Figure 4.17. PD means the static reuse distance based cache line protection method [24]. The number following PD represents protection distance. With PD based cache protection, data in the cache line is kept until access count for the cache exceeds the defined reuse distance after the cache line is allocated. We also tested the static protection distance scheme for APCM, where only the cache lines allocated by the load instructions determined as protection bit in CAIT (APCM-pN, where N is a protec- tion distance). APCM-p applies our protection scheme based on load data dependency. In the figure, all IPC values are normalized to the baseline Configuration 1. The average performance improvement by our scheme is 22%, which is better than static protection distance schemes. Since APCM protects cache data based on data dependency as de- scribed in the previous section, it can predict lifetime of cache lines more effectively. 4.5.6 Sensitivity Studies 1.09 1.32 1.38 0.8 1.0 1.2 1.4 1.6 1.8 2.0 BFS KMN IIX WC GC SSP SPMV MM SS CCL mean Normalized IPC Chart Title LRR GTO LRR+APCM GTO+APCM Figure 4.18: Performance by warp schedulers 107 Warp scheduler: Figure 4.18 shows the performance impact of using different base warp schedulers, loose round-robin (LRR) and greedy-and-oldest (GTO), with APCM. Normally it is known that GTO warp scheduler is more effective for memory-intensive applications since it allows a single warp to continue to execute independent instructions to hide memory latency. Figure 4.18 reveals that GTO improves performance by 9% for CS and CM applications. But APCM is effective even on top of GTO and improves performance by 27% over GTO scheduler baseline. 1.32 1.21 1.18 0.8 1.0 1.2 1.4 1.6 1.8 BFS KMN IIX WC GC SSP SPMV MM SS CCL mean Normalized IPC Chart Title 16KB 32KB 48KB Figure 4.19: Performance by L1 cache size Cache size: Modern GPU architecture provides flexible options for programmers to configure the size of L1 data cache as 16KB, 32KB and 48KB [69, 68]. Figure 4.19 compares the performance improvement by APCM normalized to the different L1 cache size baselines (16KB, 32KB and 48KB per SM). For CS and CM applications APCM improves the performance by 21% and 18% on average with 32KB and 48KB L1 cache respectively. Note that the baseline performance is significantly enhanced with larger L1 cache configurations for the cache sensitive applications. This evaluation result reveals that APCM is effective even if larger L1 cache is applied. In Figure 4.20 we compare the performance benefits by larger L1 data cache and APCM. The first three bar graphs of each benchmark show the performance by different 108 1.21 1.46 1.32 1.45 1.68 0.8 1.0 1.2 1.4 1.6 1.8 2.0 BFS KMN IIX WC GC SSP SPMV MM SS CCL mean Normalized IPC 16KB 32KB 48KB 16KB+APCM 32KB+APCM 48KB+APCM Figure 4.20: Performance by L1 cache size (normalized to 16KB cache) L1 data cache size (16KB, 32KB, and 48KB per SM). The last three bars of each appli- cation represent the performance by APCM applied for different sizes of L1 cache. The performance metrics of each category are normalized by the same baseline configuration (16KB L1 cache without APCM). The average performance is improved by 21% when L1 data cache size is doubled. On the other hand when APCM is applied to the base- line L1 cache configuration, the performance is improved by 32%, which is higher than the performance benefit by 2 larger data cache. Given that the hardware overhead by APCM is negligible compared to the L1 data cache (see Section 4.4.4), our evaluation reveals APCM is an effective solution to improve the performance without increasing the data cache size. 0.5 0.6 0.7 0.8 0.9 1.0 1.1 BFS KMN IIX WC GC SSP SPMV MM SS CCL mean Normalized IPC Chart Title 32 entries DM 64 entries AS=2 128 entries AS=4 16 enties FA 32 entries FA Figure 4.21: Performance by MTA configurations 109 MTA configuration: As MTA is exploited to track the cache accesses for the moni- tored warp, a large size of MTA may capture more access history. Figure 4.21 shows the performance change of APCM with various MTA configurations. The performance by all configurations is normalized to the performance of basic MTA configuration with 32 entries shown in Table 4.5. ”DM” means direct-mapped tag array and ”FA” means full- associated tag structure. ”AS=N” represents N-way-associative tag structure. Across a range of MTA sizes with varying degrees of associativity the performance improvements are all within 5% of the baseline MTA configuration. 0.5 0.6 0.7 0.8 0.9 1.0 1.1 BFS KMN IIX WC GC SSP SPMV MM SS CCL mean Normalized IPC Chart Title ld8 ld16 ld24 ld32 Figure 4.22: Performance by LAT and CAIT depth LAT and CAIT depth: The depth of LAT and CAIT decides the number of different global load instructions that can be managed by APCM. We varied LAT and CAIT sizes and the results are shown in Figure 4.22. The number of ”ld” means the number of different loads managed by APCM. Performance metric (IPC) is normalized to the basic APCM configuration (ld16). The result reveals the performance of APCM saturates when the depth is set at 16. Normally a GPU kernel has a small number of global loads and hence tracking about 16 such loads seems to capture most of the benefits of APCM. 110 4.5.7 Energy Consumption 0.73 0.86 0.97 0 0.2 0.4 0.6 0.8 1 BFS KMN IIX WC GC SSP mean (CS) SPMV MM SS CCL mean (CM) GE SRD MRI SGM STN APS mean (CI) Normalized energy consumption Chart Title Figure 4.23: Normalized energy consumption Figure 4.23 shows energy consumption reduction by APCM normalized to the base- line Configuration 1. As discussed in Section 4.4.4, power and energy consumption of APCM is estimated with CACTI and synthesized RTL models under the 45 nm technol- ogy node. Energy and power consumption of other components of GPU are estimated with GPUWattch [53]. For KMN, APCM consumes about 48% of energy compared to the baseline machine. Some energy saving are observed even for CI applications that only benefit from bypassing of streaming data, however, the impact of bypassing is not large because energy consumption of L1 data cache is a small portion of whole GPU energy consumption [53]. 4.6 Chapter Summary General purpose applications running on GPUs suffer significant memory bottlenecks. The massive thread level parallelism in fact causes frequent cache thrashing since the size of the cache per each warp is extremely small. Using detailed motivational analysis 111 we reveal that global loads have various levels of sensitivity towards TLP. We categorize the global loads into four bins based on their cache access locality behavior. We then present locality-specific cache data management policy for each load to improve cache efficiency. We propose APCM a hardware based mechanism that automatically bins loads into different categories based on cache access history of individual loads in a predefined monitored warp. We make the observation that GPU applications exhibit strong access pattern similarity and hence it is possible to observe the access patterns of a single monitored warp to determine the load behavior across the entire application. APCM uses information gathered from the monitored warp to determine whether a load instruction exhibits little temporal reuse, in which case that load instruction is marked for bypassing the cache. For a load instruction that exhibits significant temporal reuse APCM protects the data fetched by the load instruction until the data reuse is complete. APCM gains 34% performance improvement for cache sensitive applications and on average 22% of improvement for all types of applications. Our evaluation also reveals that load-specific cache management approaches are very effective even for the cache moderate applications that exhibit a mix of data locality types. Combined with previously proposed warp throttling methods APCM significantly outperforms the state-of-the-art warp-based cache management schemes. 112 Chapter 5 In-Storage Indexing Mechanism 5.1 Introduction Long data Input/Output (I/O) time is a critical performance bottleneck in data-intensive applications since these applications transfer huge volume of data from storage devices. One approach that is routinely used to reduce unnecessary data movement is to use in- dex structures. For instance, the basic filtering operation, frequently observed in query processing, scans the entire database and selects items that meet requested conditions. However, the database scans can be significantly reduced if an index structure is already built on the filtering key. Then the filtering operation can be implemented using a more efficient hashing of the filtering key to access the index table, which will in turn point to the records that match the key. 113 While index structures reduce unnecessary data movement from storage devices they are also fraught with several challenges. First, index maintenance generates heavy com- putation load and data traffic since updates to the data require updating the index data structures. Recent database studies revealed that even if 0.1% of the data is updated, index updates can take 0.3–5.2 of the original index build time, using B+-tree based index structures [102]. This index maintenance cost increases exponentially as database size grows. Second index structures themselves need I/O accesses before reaching the desired data. Even with a multi-level index structure, such as B+-trees, where the root nodes may be cached on the host DRAM, the remaining levels need to be accessed from storage. Therefore host systems require additional latency to access the index structure. On the storage front, SSDs are becoming affordable and popular [13, 74]. SSDs include an embedded processor core to run flash management functions such as garbage collection and wear-leveling. Since flash memory does not allow in-place updates every update operation leads to a write operation to a new page. The embedded cores in SSD use flash translation layer (FTL) to map logical blocks to physical page locations. As such every page access has to go through the FTL lookup in current SSDs. We propose to exploit this property to use FTL as a distributed index structure where the data indexes for each page are maintained within the FTL entry of that page. Furthermore, we propose to marshal the embedded core to perform page level index lookups directly while accessing the FTL entry of each page, thereby reducing the need to move large index structures to the host DRAM. The two proposals collectively called FLash IndeXeR (FLIXR) is a novel 114 in-SSD indexer architecture and programming model. To summarize, FLIXR utilizes the existing page-mapping tables in the FTL to automatically create and organize data indexes. FLIXR provides a programming model to create programmer-defined indexing rules, and a set of API to define index lookup operations to be offloaded to the SSD controller thereby reducing index-related data movement between the host and SSD. FLIXR has several advantages compared to the conventional index management schemes. First, the storage processors can update the index structures on-the-fly while the data items are written or updated. Whenever a page is updated the index structure that is associated with that FTL entry for that page is also updated. Since NAND flash mem- ory, employed as primary storage media on SSDs, exhibits much higher write latency compared to read, the index generation and update time can be easily hidden within the flash write process. Thus indexes are automatically updated for each page alongside its update. Second, the distributed page-level index structures are highly scalable. Since big data may grow with time and may be distributed on multiple storage devices, FLIXR’s index structures also scale in a distributed manner. Finally, data movement burden for reading index structures may also be reduced. In order to perform index filtering typi- cally the host processors fetch index data structures from the storage devices. As index structure size grows proportionally to database size (5–15% for B+-tree indexes), the performance cost for moving index structures from storage to host DRAM cannot be ignored. 115 5.2 Motivation: Criticality of Indexing Data indexing is a critical function for accelerating big data analytics. In fact signifi- cant research has been expended on index structures as a database management system (DBMS) has recognized the critical need for good indexing mechanisms [79, 72]. As such, we will use DBMS as our driver for FLIXR implementation and evaluation, al- though the FLIXR’s indexing functionality and programming model is useful for any general data indexing. Modern database engines exploit index structures to quickly point database entries that meet query filtering conditions. B+-Trees, and multi-level index tables are popular index structures [25]. As data size grows, the cost for storing the tables and the com- putational load for updating index structures also escalates. Index structures themselves are stored alongside the data and hence accessing an index requires additional storage accesses. To reduce the index I/O cost, index structures may be cached in host DRAM. But given the size of the structures typically only a small fraction of the index tables may be cached. For instance, only the root node and a subset of second level nodes in a B+-Tree may be cached in the host DRAM. The operation of the hierarchical index structures is illustrated in Figure 5.1. With the hierarchical index structures, when a query has a filtering condition (using the where clause in SQL for example) the first level index in the DRAM cache will point to the second (or third) level index pages that need to be accessed. The database engines read these lower-level indexes from the storage to finally identify the necessary target data 116 pages that meet the filtering criteria. Consequently even in the presence of a host DRAM to cache some index tables, an indexing operation typically requires additional I/O ac- cesses (the lower-level indexes) before finally reaching the data page. We now explain how the multi-level index may be used to speed up some exemplar DBMS operations. Database Engine DRAM L1/L2 indexes Storage L3 indexes Query request Database Index lookup Last-level index fetch Index lookup L3 indexes Index filtering Page read Figure 5.1: Hierarchical indexing operation 5.2.1 Scan and Filtering Scan and filtering is a frequently observed operation in query processing. Figure 5.2 shows the SQL code of TPC-H Query 6 [94], which applies filtering conditions while scanning database. The query defines the filtering conditions ( 2 ) from the lineitem database table. Note that the filtering conditions are defined for multiple items (l shipdate, l discount, and l quantity). For the purpose of this discussion, let us optimistically as- sume that the database administrator has already created one primary index on l shipdate, and one secondary index on l discount columns. Then the database engine first uses the primary index to winnow the records that fall within a shipping time window, and access- ing the primary index itself may incur I/O accesses. The database engine then uses the secondary index to identify the intersecting records that also match the stated discount 117 criteria. Finally, these records are then streamed to the host CPU which then further filters these records based on l quantity before performing the computation ( 1 ). select sum(l_extendedprice * l_discount) as revenue from lineitem where l_shipdate >= date ‘:1’ and l_shipdate < date ‘:1’ + interval ‘1’ year and l_discount between :2-0.01 and :2+0.01 and l_quantity < :3; Computation DB table Filtering 1 2 3 Figure 5.2: Scan and filtering in TPC-H Query 6 In this example, there are multiple challenges faced by the database administrator. First, the administrator must identify the primary and secondary keys for index creation. Second, every time the lineitem table is updated the two indices must be updated as well. 5.2.2 Join Processing Join processing is one of the essential functions of database engines that allow queries on multiple database tables that share a key [88, 58]. Join processing usually requires repeated accesses to the index structures by multiple inter-related key values and hence the I/O traffic to index structures is significantly higher than simple filtering. select sum(CASE when p_type LIKE ‘PROMO%’ then (l_extendedprice * l_discount) else 0) / sum(l_extendedprice * l_discount) as revenue from lineitem part where l_partkey = p_partkey l_shipdate >= date ‘:1’ and l_shipdate < date ‘:1’ + interval ‘1’ month; Computation DB tables Filtering Join 1 2 3 4 Figure 5.3: Join processing in TPC-H Query 14 118 Figure 5.3 demonstrates an example of a query that uses join processing. TPC-H Query 14 requests the join operation with a common column (l partkey = p partkey) from two different database tables (lineitem and part). This query also has the filtering conditions ( 2 ) for the items in lineitem. Note that filtering, as described in the prior sec- tion, is first executed to reduce the size of the intermediate table (lineitem filtered by the conditions 2 in this example) for the join processing. As the database engine performs the filtering operation for the lineitem table, the system also generates the hash table con- taining filtered l partkey values. Later each of the unique part key items is used to index the part table. Depending on the number of the unique keys the I/O system is repeatedly accessed to just get the index values for the part table. Thus the join processing can require massive I/O requests to access the index structures repeatedly. 5.3 Flash Indexer (FLIXR) As investigated in the previous section, the large index structures can be a significant performance burden in DBMS even if the index structures are partially cached on the host memory. Furthermore updating and managing these index structures itself is a com- putationally expensive task. To reduce these costs we present FLIXR – an efficient data indexing mechanism in SSDs. 119 5.3.1 Overview of FLIXR Model FLIXR’s in-storage indexing mechanism exploits the native page translation structure in SSD’s flash translation layer (FTL) as depicted in Figure 5.4. Every storage I/O com- mand (read/write/erase) accesses the page mapping table in FTL to get PPA mapped to the target LBA. FLIXR implements a per-page index structure associated with each LBA. FLIXR’s index structure can be implemented as an extended metadata field in each map- ping table entry or as a separate data structure pointed by an LBA. Thus when the LBA translation entry is searched FLIXR can concurrently access the index. FLIXR supports two major indexing functions – an index creation and maintenance operation, and an index usage mechanism. We will describe these components in following sections. Flash Translation Layer (FTL) Page mapping table L B A 0 P P A 0 L B A 1 P P A 1 L B A 2 P P A 2 L B A 3 P P A 3 FLIXR page indexes K E Y 1 K E Y 0 L B A 0 K E Y 1 K E Y 0 L B A 1 K E Y 1 K E Y 0 L B A 2 K E Y 1 K E Y 0 L B A 3 Page buffer (SSD DRAM) Index creation rules Index creation function NVMe host interface Storage command Page data (RD) FLIXR Page data from buffer Index filtering rules Index filtering function Page data (WR) Flash memory interface Flash command Page data (RD) Page data (WR) Figure 5.4: Overview of FLIXR operation model 120 5.3.2 Index Creation and Maintenance FLIXR can create or update the page-level indexes when page data is written to the flash memory. For the data write process the host system transfers block data with the storage write commands. The transferred data is first buffered in the SSD DRAM and then writ- ten to the flash memory. For index creation process FLIXR supports the index creation rules ( 1 ) that can be specified by database administrators as shown in Figure 5.4. Then FLIXR firmware executes the index creation function, which reads the buffered page data ( 2 ) from SSD DRAM during the flash write process. The created data is stored in FLIXR’s page-level index structures ( 3 ). The detailed FLIXR operations for index creation and maintenance are described as follows. Table structure semantics: In order to extract indexes from a database table FLIXR requires information of the table column structure and index generation rules. A database administrator on the host side can use a set of FLIXR APIs to provide the FLIXR storage systems with the semantics of each database table. In particular, FLIXR provides the new NVMe commands to transfer the column structure of a database table. The admin- istrator first stores the structure of each database table, namely the number of columns, the data type and size of each column etc., in the host DRAM. Then the host system sends this database structure information with the database table identifier tblID using SEND TBL COL command. Then the FLIXR firmware on the SSD side processes the SEND TBL COL command as follows. It first copies the database column structure information from the host DRAM 121 to the SSD DRAM. The transferred table structure information is stored in the dedicated array structure, called table structure information array (TSIA). Each database structure information is identified by the key tblID passed by the SEND TBL COL command. The above process is repeated for each table in the database, thus TSIA stores a col- lection of table structure information. Due to the limitation of SSD DRAM resources FLIXR may return a success or failure indication to the host administrator after process- ing the SEND TBL COL command. Note that a failure to register a table in FLIXR is a lost opportunity for performance since the database administrator may still create a traditional host-side index structure for any tables. Offloading index creation: Next the administrator has to offload the index creation functions to the FLIXR storage systems. FLIXR provides another NVMe administra- tion command SEND IDX ENC which passes the index generation function to the SSD DRAM space. Similar to the SEND TBL COL command, the SEND IDX ENC com- mand also uses the database table id (tblID) as a parameter to associate the offloaded index creation function with the registered database table. The host administrator can define any index creation rules with the keys registered in TSIA as long as the created per-page indexes can be packed in the predefined index size. For instance, with a key that contains the integer type data, the administrator may simply create the range-based index which indicates the min and the max values of the items located in a flash page. A bitmap type index can be also created from the category (for example, states, nations, and so on) type or character type data [40, 98]. The per-page indexes may be created 122 from multiple keys if the size of created indexes is smaller than or equal to the allocated index structure space. Once the host administrator issues the SEND IDX ENC command to the FLIXR stor- age devices, the index creation function in the host DRAM is transferred to the SSD. Then FLIXR stores the function pointer of the offloaded index creation function in the index encoding rule array (IERA). Similar to the table structure information, the regis- tered index creation functions are identified by the tblID. Automatic index creation: FLIXR firmware can automatically create the per-page indexes once the table key structures and the index creation functions are registered in TSIA and IERA. In order to support the concurrent index creation with the storage write process, FLIXR provides the new NVMe I/O command called WRITE IDX that initiates the index creation when the database table data is transferred to the SSD and written to the flash memory. The WRITE IDX requires the tblID as a parameter. Once the host database systems issue the WRITE IDX command the SSD firmware first buffers the transferred table data in the SSD DRAM. Then the firmware issues the page write commands. Concurrently the FLIXR firmware initiates an index generation process by executing the offloaded index creation function identified by the tblID. While scanning the table data in the buffered page it parses the column entries in the table using the registered table structured information identified by the tblID. The FLIXR firmware executes the index creation function pointed by the tblID to generate the per-page in- dexes. 123 5.3.3 Exploiting Indexes for FLIXR Computations In this section we describe how FLIXR’s indexes are exploited to significantly improve data access speeds. We use the embedded FTL indexes to perform filtering operations on the data at the page granularity. Recall that database queries specify filtering con- ditions using the where clause in SQL. FLIXR can perform the where clause filtering directly at the page granularity without accessing to separate index structures as is done transitionally in database systems. For the page-level filtering process FLIXR supports the index comparison rules ( 4 ) which specify the filtering rules based on the created FLIXR’s page-level indexes (see Figure 5.4). When the host database systems access the database table from the SSD, FLIXR can apply the light-weight index comparison operations ( 5 ) by comparing the per-page index values with the previously registered index comparison rules. If the page data does not include the items that meet the filtering conditions, FLIXR cancels the flash memory read to eliminate the unnecessary page accesses. The detailed implementation is described below. Specifying filtering conditions: FLIXR extracts the filtering conditions used in query processing. FLIXR provides a set of API that can be used by the query optimizer (or directly by a programmer as we have done in our implementation) to define the query filtering rules that must be applied when reading a flash page. FLIXR provides the new NVMe administration command SEND IDX ICR to transfer the per-page index compar- ison rules to the SSD processor. The SEND IDX ICR command uses the (tblID) and a 124 index comparison rule id (icrID) as parameters. The tblID references the database table that the query processing will be applied for. The index comparison rule of each query is identified by the icrID separately since multiple queries can access single database table simultaneously. The FLIXR firmware decodes this NVMe command and register the index compari- son rules in the index comparison rule array (ICRA) in FLIXR’s process space. Typically the index comparison rules are relatively simple operations, such as less than, greater than, bitmap match and so on. Once the host database system issues the SEND IDX ICR command, the index comparison rule is registered in ICRA indexed by tblID and icrID. Applying page-level filtering: While the host database systems access the database tables on the storage systems for query processing, the FLIXR firmware performs the page-level index comparison operations using the created page indexes associated with the page mapping entries. To enable this page-level filtering process FLIXR provides the READ IDX NVMe command. This new command requires the additional parameters (tblID and icrID) along with the LBA range information for page accesses. The tblID is an identifier of the database table specified by the from clause in SQL. For exploiting the page-level filtering by FLIXR the index comparison function identified by the icrId is set through the SEND IDX ICR command before accessing the table page data. FLIXR calls the index comparison function to check whether the indexes of the target page meet the registered filtering condition or not. During the page translation process for page read, the FLIXR firmware can easily access the page-level indexes without 125 additional index structure scanning since the indexes are associated with the LBAs of the page mapping table. The FLIXR firmware simply performs the light-weight comparison operations for the target page indexes. If the target page contains the items that meet the filtering condition, FLIXR allows the normal flash read process and returns the fetched page to the host system. Otherwise, the FLIXR firmware cancels the page read process to prevent the SSD from fetching the unnecessary page data. Unlike traditional B+-Tree index where each index may precisely point to a set of records, FLIXR index only shows what is the range of key values which are contained within the page. While this approach may appear to have reduced index precision we note that SSDs are page level access devices. Hence, even if a particular database entry is precisely identified by a B+-Tree index eventually the entire page must be read from the flash memory before one can access the record. By exploiting this fact we reduce the index precision but without any increase in the number of page reads that may be performed later during the reading process. 5.3.4 Supports for Join Processing For the join processing the key items of one table are compared against the key values in other tables. In the example of TPC-H Query 14 (see Figure 5.3) two database tables (lineitem and part) have the same type of key (partkey) in both tables. This query that requests the common partkey items in both tables requires the join processing. Since 126 this join processing requires multiple filtering conditions (multiple partkey items from lineitem), it requires massive accesses to the index structure. In order to reduce data traffic by the join processing FLIXR supports per-table in- dexing, which enables the page filtering based on the indexes generated by the scan and filtering operation for the primary table. We take an example of the per-table indexing operation by FLIXR in Figure 5.5. Let us assume the bitmap-based per-page indexes are created for two keys (Key 0 and Key 1) in each table and both tables have the Key 0 in common. The query first requests the filtering operation by the conditions for the Key 1. The index comparison rule created by SEND IDX ICR compares the per-page indexes for the Key 1. In this example, let us assume that page 0 includes the items that meet the filtering condition, thus the page 0 is fetched to the host. On the other hand, the Key 1 index in page 1 does not overlap with the comparison rule, thus the page 1 is filtered out by FLIXR. As the join processing will be performed for the Key 1 of both tables, the FLIXR firmware creates the per-table index for the primary table using the Key 0 index from page 0. Note that the index in page 1 is not included in the per-table index since page 1 is filtered out. Once the filtering operation completes for the primary table, FLIXR executes the per-table index comparison process. In this example, the bitmap of the created per-table index does not overlap with the Key 0 index in page 2 of the sec- ondary table. Thus the page 2 of the secondary table is filtered out by the per-table index comparison. Now we will describe how the join processing is supported. 127 01 111 000 00 11 001 100 11 00 011 010 11 00 000 000 01 1 11 000 00 00 per-table index page 0 page 1 page 2 Secondary table 01 100 001 10 10 000 100 10 page 0 page 1 Primary table 1 11 000 00 00 Key 0 Key 1 0 00 000 00 11 Page index comparison rule 0 01 100 01 10 1 00 011 00 01 0 00 111 10 01 Figure 5.5: An example of per-table indexing by FLIXR Offloading per-table index creation functions: Like the per-page index creation process, the database administrator can offload per-table index creation functions to the SSD using the new NVMe command, called SEND TIDX ICR. The offloaded index cre- ation function is registered in the table index rule array (TIRA) indexed by tblID and tiaID. Thus the SEND TIDX ICR command requires these two identifiers as parameters. Per-table index creation: The per-table indexes are created while the FLIXR firmware executes the page filtering operation for the primary table. In order to sup- port the concurrent per-table index generation, FLIXR provides the NVMe page read command for the join processing. The READ IDX JOIN1 command enables the per- page filtering operation for the primary table and the per-table index generation for the join processing. Hence, this command requires the table id (tblID) of the primary table, the identifier for the per-page index comparison (icrID), and the per-table index creation id (tiaID) as parameters. The created per-table index is stored in the table index array (TIA) in the FLIXR firmware. 128 Exploiting per-table indexes: While the database systems read the secondary page for the join processing FLIXR can execute the page filtering process using the created per-table indexes. Similar to the per-page filtering, the administrator can register the page-level filtering rules if necessary, and the filtering rules that exploit the per-table indexes. Once the filtering rules are registered the host system reads the secondary table with READ IDX JOIN2 command. This command enables the page-level filtering for the secondary table by comparing the page-level indexes of the secondary table with the per-table index generated by the READ IDX JOIN1 command. For this operation, this READ IDX JOIN2 command needs the tblID1 and tiaID for pointing the per-table index, and tblID2 and icrID to identify the comparison rules for the secondary table. 5.3.5 Extended NVMe Commands for FLIXR Command Type Description Parameter FLIXR resource control GET FLIXR STATUS Adm Get FLIXR resource status ALLOC FLIXR Adm Allocate FLIXR resources tblID, icrID, tiaID FREE FLIXR Adm Free FLIXR resources tblID, icrID, tiaID Sending indexing rules SEND TBL COL Adm Send table column item structure tblID SEND IDX ENC Adm Send per-page index generation rule tblID SEND IDX ICR Adm Send per-page index comparison rule tblID, icrID SEND TIDX ENC Adm Send per-table index generation rule (for join processing) tblID, tiaD In-SSD indexing & filtering with I/O WRITE IDX I/O Write pages with index generation tblID READ IDX I/O Read pages with index comparison tblID, icrID READ IDX JOIN1 I/O Read pages with index comparison and generate per-table indexes tblID, icrID, tiaID READ IDX JOIN2 I/O Read pages with index comparison using per- table indexes tblID1, tblID2, icrID, tiaID Table 5.1: New NVMe commands for FLIXR 129 We extend the NVMe commands to support FLIXR operations as listed in Table 5.1. The new NVMe commands for FLIXR support FLIXR resources, providing index cre- ation/comparison functions, and data I/O with page-level filtering. We exploit the NVMe administration command packets to provide the SSD processors with the required infor- mation for the FLIXR operations. In order to support multiple query processing requests, the FLIXR firmware manages the offloaded parameters in the array structures described in the previous section. Once the index creation or filtering rule sets are registered in the FLIXR firmware, the administrator accesses the database tables with FLIXR I/O com- mands. The required parameters for each FLIXR NVMe command are also listed in the table. We now briefly summarize the FLIXR operations using the extended NVMe com- mands. Resource control: The FLIXR firmware manages the multiple index generation/- comparison rule sets in the array structures as described in previous section. Hence the database administrator needs to check whether the FLIXR firmware has available resources for the rule sets. After that the administrator can allocate the array structure re- sources for new rules. To support FLIXR resource control we define new administration commands for check, allocation, and free. The host database engine gets the available array resources in the FLIXR firmware with the GET FLIXR STATUS. If the FLIXR firmware has enough resources the administrator can allocate the arrays for indexing rule 130 sets with the ALLOC FLIXR. The administrator can release the allocated resources with the FREE FLIXR command if the allocated array resources are no longer necessary. Write with indexing: For the data write process with the index creation, the admin- istrator first transfers the database table structure information (SEND TBL COL, and then registers the index creation functions in the FLIXR firmware with the SEND IDX ENC command. Then the database system writes the table data with the WRITE IDX com- mand to enable the automatic page-level index generation. Read with filtering: As described in the previous section, the administrator first offloads the index comparison rules extracted from the query requests (SEND IDX ICR). Then the FLIXR enables the page-level filtering if the database engine accesses the tables with the READ IDX command. Read for join processing: For the join processing the administrator also regis- ters the per-table index creation rules with the SEND TIDX ENC command. Then, the database engine accesses the primary table by enabling the per-table index creation and the per-page filtering (READ IDX JOIN1). After that the database engine reads the sec- ondary table with the READ IDX JOIN2 command to enable the page-level filtering with the created per-table indexes. API function Description allocFlixr() Allocate FLIXR array entries freeFlixr() Free FLIXR array entries writeIndex() Write a table with indexing readFilter() Read a table with page filtering readJoin1() Read a table for join processing (primary table) readJoin2() Read a table for join processing (secondary table) Table 5.2: FLIXR API 131 Based on the described FLIXR operations supported by the extended NVMe com- mands we design FLIXR API that aids FLIXR programming model for in-SSD indexing and filtering. The normal file read/write operations of database applications can be ported seamlessly to the FLIXR programming model with the FLIXR API listed in Table 5.2. 5.3.6 Cost Overhead FLIXR creates the indexes per page and associates these per-page indexes to each logical block, thus FLIXR requires additional storage space for the per-page indexes. In this work we set the size of the index structures as 8 byte per page by default. Note that FLIXR’s per-page index structure can include index keys as long as the indexing items are packed in the pre-assigned size. Hence more indexing items can be packed in the per-page indexes as large space is assigned to FLIXR indexes. Assuming the page size of the NAND flash memory is 16 KB the storage overhead of FLIXR’s per-page indexes is approximately 0.05% of the entire storage space with the default configuration. That means the per-page indexes occupies about 1 GB space out of 2 TB SSD. Although FLIXR’s index structure size is very small compare to the large SSD storage space, the index structure may occupy SSD DRAM space that is shared with other functions such as FTL functions and a page buffer. Reduced DRAM space by the index structures may work as a performance burden even though recent enterprise-level SSDs equip larger DRAM buffer. Thus FLIXR exploits the assigned DRAM space as a cache for the page indexes. Namely indexes in hot pages are cached in the DRAM space and cold pages 132 are loaded in DRAM when requested. This demand-based approach has been widely researched for FTL mapping table management [56, 105]. 5.4 Evaluation Programmable logic cells ( h a r d w a r e ) FPGA (Xilinx Zynq-7000) ARM Cortex-A9 ( f i r mw a r e) NVMe command decoder AXI interconnection DRAM controller Flash translation layer Page buffer manager NVMe host interface Command queue DMA engine PCIe interface Flash I/F arbiter Flash channel interface Command queue ECC NAND flash controller (a) SSD controller architecture on FPGA NAND flash DIMM 0 NAND flash DIMM 1 DRAM PCIe interface cable SoC (FPGA) (b) SSD development board Figure 5.6: Evaluation platform We implemented FLIXR on the Cosmos+ OpenSSD board – the open-source SSD research and development platform [73, 89]. Figure 5.6a depicts the architecture of the SSD controller on the OpenSSD platform. The OpenSSD board contains an Xilinx Zynq- 7000 programmable SoC with a dual-core ARM Cortex-A9 application processor [100]. 133 This hardware platform on the programmable logic cells supports up to 8 NAND flash channels and 8 ways per channel. The OpenSSD board communicates with the host system via PCIe Gen.2 8 interface. The host system uses an Intel i7 CPU running at 4 GHz clock and 16 GB DRAM. Table 5.3 lists the specification of the SSD development platform and the host system. OpenSSD platform Processor Dual-core ARM Cortex-A9 @ 1 GHz FPGA Xilinx Zynq-7000 (350K logic cells) DRAM On-board 1 GB DDR3-1066 NAND flash 8 channels, 8 ways/channel, 2 TB MLC Interconnection AXI-lite (command) and AXI (data) bus Protocol NVM express 1.1 Host interface PCIe Gen.2 8 (Max. 4GB/s) Host system CPU Quad-core Intel i7-4790 @ 4 GHz DRAM 16 GB DDR3-1600 Table 5.3: Evaluation platform configuration OpenSSD supports up to 1.38 GB/s sequential read data bandwidth. Although this performance metric is lower than the commercial high-end NVMe SSDs, the read band- width supported by the OpenSSD is similar to the mid-range NVMe SSDs and much higher than SATA SSDs [26]. Hence, the OpenSSD platform can reflect the real SSD systems. In addition, it is also possible to emulate various performance ranges by adjust- ing the host system’s performance or changing the configurations of the SSD controller. The SSD controller of the OpenSSD platform runs on the FPGA SoC, thus users can pro- gram hardware logic as well as the controller firmware. Figure 5.6a depicts the architec- ture of the SSD controller on the OpenSSD platform. The controller firmware consists of NVMe command decoder, page buffer management and flash translation layer functions 134 running on the embedded processor. The NVMe host interface and the flash memory channels in the OpenSSD controller is designed with hardware logic to guarantee rapid data handling. As accelerated by the hardware engines the OpenSSD supports up to 1.38 GB/s sequential read data bandwidth. Although this performance metric is lower than the commercial high-end NVMe SSDs, the read bandwidth supported by the OpenSSD is similar to the mid-range NVMe SSDs and much higher than SATA SSDs [26]. Hence, the OpenSSD evaluation platform can reflect the real SSD systems. In addition, it is also possible to emulate various performance ranges by adjusting the host system’s perfor- mance or changing the configurations of the SSD controller. OpenSSD assigns 16 MB DRAM space as page buffers, and this space is relatively small compared to the total SSD DRAM size. FLIXR’s index structures do not sacrifice this page buffer space, thus FLIXR implementation does not change the performance of normal page transfers. We use two different types of OLTP applications to evaluate FLIXR. In order to study FLIXR’s index maintenance performance, we tested the NewOrder transactions provided by the TPC-C benchmark [93]. The selected transactions include frequent database writes and updates along with query processing. Thus, frequent index updates are also requested for precise indexing. We configured the database tables with the setting of warehouses as 50. The size of the created tables is 4.75 GB. We evaluated the page- level index filtering performance by FLIXR using the TPC-H benchmark suites [94]. We study a wide range of queries (queries 1, 4, 6, 12, 14, and 17) generated by the TPC-H 135 query generation engine. Queries 1 and 6 perform the aggregation function with the fil- tering operations. Other queries include the join processing that includes the inter-table relation conditions. Additionally we composed the simple filtering and join processing kernels similar to the example queries mentioned in Section 5.2. These basic kernels are frequently observed in database queries, thus performance for such kernels is critical for database systems. In order to compare the performance of FLIXR with the conventional host-oriented index structures, we implemented a B+-Tree index structure and its management schemes processed in the host side. For the conventional host-side indexing, the host processor performs all index-related operations, such as creation, updates and traversing of in- dexes. This host-side B+-Tree index operates as follows. Initially the database tables are generated by the table generation engines, and then the host engine scans the stored database tables to create index structure. The generated index structures are stored in the SSD. When the host database engine requests query processing, the host database system fetches the B+-Tree index structures from the SSD. The fetched indexes are buffered in the host DRAM for the future query processing. If the host engine updates the database tables, the host processor updates the existing index structures and writes to the SSD. If the updated indexes are already cached in the host DRAM, the updated parts are written back to the SSD. We call this mechanism as Host-Side Index in the evaluation section. 136 We implemented FLIXR with two types of indexing mechanisms – simple range- based indexing (called FLIXR-I) and bitmap-based indexing (called FLIXR-IB). For FLIXR- I the per-page indexes are generated only for the keys that contain integer-type items. FLIXR-I creates the per-page index fields by computing the minimum and the maximum values of the key items stored in a single flash page. With FLIXR-I the host database system receives the table pages where the ranges of key items overlap with the filtering conditions of query requests. Another indexing mechanism (FLIXR-IB) utilizes bitmap type indexes where each bit field can indicate a certain range of values (for numbers), a character, or an item of categories [40, 98]. Using the bitmap type indexes, FLIXR-IB also creates the per-table indexes to support join processing in the SSD. 5.5 Experimental Results 5.5.1 Index Maintenance Performance 0 0.5 1 1.5 2 2.5 3 Baseline Host-Side Index FLIXR Normalized Performance Figure 5.7: Index maintenance performance Figure 5.7 shows the index maintenance performance by FLIXR. As described in the previous section we used the TPC-C benchmarks that update database tables frequently. We compared FLIXR with the host-side B+-tree indexing and the baseline configuration 137 which does not utilize any indexing mechanism. Our evaluation shows FLIXR outper- forms the host-side indexing by 21.7%. Note that the host-side index maintenance for large database tables is a heavy burden for host-side computation and storage I/O as men- tioned in Section 5.2. On the other hand, FLIXR updates index structure on-the-fly while page data is written to the flash memory. Hence, FLIXR can manage index structures with lower performance overhead even if database tables are updated frequently. 5.5.2 Query Processing Performance 0 0.5 1 1.5 2 2.5 Scan Join Q1 Q6 Q4 Q12 Q14 Q17 GMEAN Speedup Baseline Host-Side Index FLIXR-I FLIXR-IB Figure 5.8: Query processing performance Figure 5.8 compares the performance of query processing by the baseline configu- ration (a database system without indexing), the host-side indexing and FLIXR. Among the evaluated query requests, queries 6, 12, and 14 include the filtering conditions for multiple keys. For those queries, we apply the index filtering using both the primary and the secondary keys for FLIXR and the host-side indexing. Note that FLIXR can generate indexes for multiple keys if the created indexes are packed within the reserved per-page index fields. We will show the performance sensitivity by the number of indexing keys in Section 5.5.4. 138 Overall, FLIXR-IB improves the best performance, which is 1.81 speedup com- pared to the baseline. Even without the accurate value distribution comparison, FLIXR-I achieves 1.60 of performance improvement compared to the baseline. The experi- mental results reveal that the FLIXR framework improves system’s performance effec- tively for a wide range of query processing applications. The host-side index achieved 1.42 performance improvement compared to the baseline. FLIXR-IB and FLIXR-I show 27.2% and 14.0% better performance than the host-side index, respectively. The performance gain becomes more significant with FLIXR-IB since it supports more index generation capability with bit-vector structures. Hence the performance gain by FLIXR-IB is 12.6% over FLIXR-I. As page data traffic is winnowed efficiently by the FLIXR framework the I/O time of the applications decreases and even CPU processing time is reduced also. It is because FLIXR roughly reduces the overhead of processing in the host CPU with the filtering operation in SSD. Thus in our experiments, FLIXR-IB reduces the I/O time and the CPU processing time by 28.8% and 17.0%, respectively. As FLIXR-I supports page indexes, it works well for query computations that re- quest simple scan and filtering operation only. Thus FLIXR-I architecture is effective for queries 1 and 6 which rely on the filtering operation. However join processing, which we composed in these experiments, and queries 4, 12, 14, and 17 require the join processing using multiple database tables. In these queries, FLIXR-I filters a large portion of pages in the first table, however the second table requires the data-related conditions from the 139 first table. Thus reading the second table does not benefit from the simple page-level filtering as described in Section 5.2. FLIXR-IB supports more index ranges with the bitmap type indexing and the per- table indexes for join processing. Hence it improves the performance of the query pro- cessing more compared to FLIXR-I. Especially FLIXR-IB enables the join-filtering for the second tables, thus it improves the performance of join processing and queries 4, 12, 14, and 17 by 72.9% over the baseline and by 14.8% over FLIXR-I. In case of query 17, it additionally includes filtering conditions that require string comparisons different from the remaining queries. FLIXR can also generate the per- page index table with those string values and efficiently fetches only the valid pages. As a result, FLIXR-IB reduces the amount of data fetched from SSD by 84.6% com- pared to the baseline as shown in Figure 5.9 (details of overall data traffics in Section 5.5.3). However, the string comparisons require more computations than the numeric value comparisons. Given the limited performance of the processor in SSD, the latency of index comparison and I/O commands processing increase. So, FLIXR can achieve 2.25 speedup, which is less than the ratio of date movement reduction. Note that the host-side index achieved the performance gain similar to FLIXR (2.23 speedup) with query 17 due to the index access and I/O command overhead in the host processors. 140 0 0.2 0.4 0.6 0.8 1 Scan Join Q1 Q6 Q4 Q12 Q14 Q17 AVG Normalized Fetched Data Size Baseline Host-Side Index FLIXR-I FLIXR-IB Figure 5.9: Data traffic from storage 5.5.3 Storage I/O In order to study FLIXR’s impact on the data traffic in the SSD platform, we also measure the amount of page data fetched from the SSD as shown in Figure 5.9. The fetched data size is normalized to the baseline setting. Since the SSD transfers the page data that meets the query conditions, the amount of transfer data size is decided by the filtering conditions and the indexed fields. The data traffic from the SSD is reduced more with FLIXR-IB since the bit-vector logging structure of FLIXR-IB can store more indexes in limited metadata filed. Additionally TIA of FLIXR-IB enables filtering out more table data for join processing observed in queries 4, 12, 14, and 17, thus it reduces 42.8% of data traffic from the SSD more compared to FLIXR-I. Overall, the average data traffic reduction is 39.6% and 45.9% by FLIXR-I and FLIXR-IB respectively. In case of the host-side index, it reduces the date traffics by 48.0% compared to the baseline. With these results, we observe that FLIXR and the ex- isting host-side index show a similar impact on data traffics reduction. However, FLIXR achieves more performance gain by reducing the overhead of index accesses and compu- tations in the host processors. 141 5.5.4 Performance Effect of Secondary Key Indexing 1.0 1.2 1.4 1.6 1.8 2.0 Host-side FLIXR-I FLIXR-IB Speed up Primary only Primary + Secondary (a) Performance 0.0 0.2 0.4 0.6 Host-side FLIXR-I FLIXR-IB Normalized storage I/O Primary only Primary + Secondary (b) Storage I/O Figure 5.10: Performance and data traffic with the secondary key In this section, we analyze the performance impact secondary key indexing on FLIXR. Among the TPC-H queries used in our evaluation, query 6, 12, and 17 include the filter- ing conditions using multiple key values. Therefore, the secondary key indexing can be applied for these queries. For example, TPC-H Query 6 reads three columns in lineitem table as mentioned in Section 5.2.1 and the indexes of those columns are available. We can use the values of l shipdate column, which has the widest value range among the referenced columns, as the primary key and the values of l discount and l quantity as the secondary keys. We compared the performance of the host-side index and FLIXR using only primary keys and those using primary keys and secondary keys. Figure 5.10 shows the experimental results. Query processing with secondary key indexing achieved performance improvement in all configurations. FLIXR-IB achieved the best performance, which is 96.6% performance improvement compared to the base- line without an index. Compared to the host-side index, FLIXR-IB shows 28.2% better performance. It is also 23.2% and 7.1% better performance than FLIXR-IB with primary key indexing and FLIXR-I with secondary key indexing, respectively. 142 In fact, FLIXR-IB and the host-side index fetch the similar amount of pages in SSD as shown in Figure 5.10b. Compared to the data fetched in the baseline, FLIXR-IB and the host-side index reduce the amount of data fetched from SSD by 53.2% and 54.7%, respectively. However, FLIXR-IB could achieve better performance because it reduces the overhead of accesses of the index structures. As mentioned in Section 5.5.1, the host- side index has more overhead of computations and I/O commands than FLIXR. Even though the secondary key indexing reduces such overhead in the host system, the host- side index still limits the performance improvement. On the other hand, FLIXR accesses the index structure and fetches the pages that have valid data without such overhead, therefore it can achieve better performance than the host-side index. 5.5.5 Performance by Computation Power of the Host CPU 0 0.5 1 1.5 2 100% 90% 80% 70% 60% 50% Speedup Relative Host CPU Performance Baseline Host-Side Index FLIXR-I FLIXR-IB Figure 5.11: Performance by throttling host processor Offloading the computation load by filtration and join processing to the existing em- bedded processor in SSD, FLIXR can be a cost-effective solution to improve the perfor- mance of overall system equipping less powerful host processor. In order to investigate this potential we study the performance uplifts by FLIXR with slower clock frequency configurations on the host CPU. 143 Figure 5.11 shows the average performance changes by the clock frequency of the host CPU. The baseline clock frequency is set to 4 GHz, and we measure the performance changes of the benchmark applications by scaling down the host CPU clock frequency by 10% down to 50%. All results are normalized to the performance of the baseline setting running at 4 GHz. When the CPU clock is set to 2.0 GHz (50% of the default configura- tion), FLIXR-I and FLIXR-IB achieve 107.5% and 124.4% of the baseline performance, respectively. That means the FLIXR-applied system can achieve the better performance for database processing applications with approximately 50% slower CPUs. The host-side index achieved 93.7% of the performance with 50% slower CPUs com- pared to the baseline setting running 100% clock frequency. The host processors perform both the index accesses and computations for filtering with the host-side index. As a re- sult, such overhead results in significant overall performance degradation with slow host processors. Our experimental results imply that data center systems can be more cost-efficiently configured with FLIXR. For instance Intel Xeon E5-2660 and E5-2650 server processors share the same Broadwell architecture, and the CPU benchmark score of E5-2660 is 10% higher [77]. However the reported market price of E5-2660 is 44% more expensive than E5-2650, which means 10% performance uplift costs hundreds of dollars for server pro- cessors. On the other hands FLIXR framework can be implemented on the existing SSD platforms that equip general-purpose embedded processors and DRAM, thus FLIXR can boost the performance of data analytical applications without significant hardware cost 144 increase. Our results reveal that FLIXR with economical host processors can achieve similar or even better performance than equipping expensive high-end CPUs. 5.5.6 State Space Exploration: Internal Bandwidth 0 0.5 1 1.5 2 2.5 0.25X 0.5X 0.75X 1X Speedup Internal Bandwidth Baseline Host-Side Index FLIXR-I FLIXR-IB Figure 5.12: Performance by SSD’s internal bandwidth SSD’s internal data bandwidth, namely the maximum data bandwidth supported by the SSD, is saturated to the internal flash channel capacity if the host system requests a massive amount of pages within a short time window. We can simply emulate the saturation of internal bandwidth saturation by adjusting the internal data bandwidth on the OpenSSD platform. To adjust internal bandwidth, we the number of active channels or ways of the NAND flash [73]. Figure 5.12 exhibits the performance changes by internal bandwidth settings. 1 on the X-axis represents our default setting (8 channels and 8 ways/channel) for NAND flash memory. We measure the execution time of benchmarks by chaining the number of flash channels from 2 to 8. The performance by FLIXR is normalized to each baseline setting by the default SSD controller under different channel configurations. Our observations show the average performance is improved by 65.5% when the 2 flash channels are set in the OpenSSD platform. FLIXR helps to improve the performance of the system under 145 the saturated internal bandwidth environment since FLIXR trims unnecessary read flash requests if the indexes in the page do not meet the query conditions. Consequently the data congestion on flash channels can be alleviated more effectively by FLIXR under the internal bandwidth saturation. 5.5.7 Energy Consumption 0 0.2 0.4 0.6 0.8 1 1.2 Scan Join Q1 Q6 Q4 Q12 Q14 Q17 AVG Normalized Energy Consumption Baseline Host-Side Index FLIXR-I FLIXR-IB Figure 5.13: Energy consumption (Normalized to Baseline) Figure 5.13 shows the energy consumption change of the system when FLIXR is applied. The power dissipation of the host system was measured using the power-load data logger while running the benchmarks. The experimental results exhibit that the system energy consumption is saved by 43.8% on average with FLIXR-IB. The host-side index reduces the energy consumption by 30.4% compared to the baseline. Comparing to these results, FLIXR-IB reduces 13.4% more energy consumption. The existing host-side index scheme performs computations for accessing and man- aging index structure with the host CPUs. On the other hand, FLIXR utilizes the em- bedded processors in SSD for the simple per-page index access and update. In general, the host CPUs such as x86 desktop processors require larger power budget than the em- bedded processors in SSDs. For example, Intel desktop processors consumes 20 more 146 power than ARM Cortex-A9 processor, which is equipped in OpenSSD [9]. FLIXR makes the processors in SSDs access and manages the index structure, so it reduces the total amount of computations in the host processors. Also, FLIXR reduces the execu- tion time of query processing efficiently. As a result, FLIXR could achieve better energy efficiency than the existing host-side index scheme. 5.6 Related work In-storage processing: Modern storage systems have potential to work as active com- pute nodes near data as modern NVM-based storage systems equip general-purpose em- bedded processors and memory systems including DRAM and NVM packages. Thus light-weight data processing functions can be performed in the storage systems before the data requested by the host systems is transferred to the host compute nodes. Sev- eral researchers have proposed in-storage processing approaches that can reduce data movement cost from storage systems and improve the performance of overall computer systems. Acharay et al. proposed active disk storage architecture, which deploys processor and memory systems in storage devices to support computation on the storage side [3]. Riedel et al. also presented the active disk platform for large-scale data mining applica- tions [82]. These early in-storage processing ideas proposed the storage device architec- ture that includes the general-purpose processors as powerful as the host processors. Us- ing the theoretical in-storage processing model, these studies presented the performance 147 of data-intensive applications benefit from offloading user functions to the storage-side processors. These in-storage processing ideas have been applied to modern storage sys- tems since in-storage computation is becoming more feasible and beneficial in modern SSDs that equip general-purpose embedded processors and high-density NAND flash packages [10, 92, 42, 20, 8]. Even though the advance of the CMOS scaling allows more powerful and energy-efficient embedded processors, the computation power of the stor- age embedded processors is still lower than the host CPUs. Therefore commodity SSDs provide limited computation resources for in-storage processing. FLIXR minimizes the in-storage computation overhead for index creation and index comparison functions by exploiting native page I/O procedure of SSD. Thus FLIXR’s in-storage indexing model works well for commodity SSDs. Several studies proposed the in-storage processing approaches using Smart SSD model. Kang et al. explored the potential of Smart SSD model by offloading I/O tasks of MapRe- duce frameworks to storage devices [43]. Wang et al. proposed to accelerate list in- tersection processing using the embedded storage processors employed in Smart SSD devices [95]. In-storage computation of SQL query processing functions is also imple- mented using the Smart SSD model [23, 76]. In these papers the authors expected that the Smart SSD model can exploit wide internal bandwidth from the multi-channel topology of NAND flash whereas the advance of the external storage interface is limited. These papers presented that in-storage processing could be beneficial if the future SSD supports higher internal bandwidth compared to the external data traffic capacity. However, most 148 of commodity SSDs configure the internal bandwidth similar to the external storage in- terface bandwidth due to cost issues. It is because data traffic from SSDs is restricted by external interconnection even if SSDs provide higher internal bandwidth. We evaluate the performance of FLIXR using the real SSD development platform that provides lower internal bandwidth (around 1.4 GB/s) compared to the external PCIe interface (up to 4 GB/s). Under such configurations that reflect commodity SSDs better, offloading heavy functions to the storage processors may be less-effective or even painful for overall sys- tem performance. Our evaluation reveals that FLIXR’s light-weight in-storage indexing mechanism effectively improves the system performance with the mid-range SSD. Seshadri et al. proposed the user-programmable SSD platform called Willow [86]. Instead of using the fixed storage interfaces such as SATA and NVMe, Willow allows programmers to write general applications running on the compute cores that access flash memory. On the other hands, FLIXR extends the NVMe protocol to support FLIXR’s index creation and comparison functions. Koo et al. proposed Summarizer framework, which enables in-storage computation opportunistically based on the available computa- tion resources that change dynamically in SSD [48]. FLIXR provides the light-weight and optimized indexing functions that can exploit the low computation resource of SSD embedded cores. Several researchers exploit the dedicated hardware logic to accelerate the specific data processing operation for in-storage computation. Jun et al. presented the FPGA- based hardware solution to perform big data analysis in NAND flash storage [41]. Biscuit 149 framework exploits hardware pattern matcher implemented on the flash channel paths to accelerate query processing [34]. Kim et al. presented the hardware engine to execute the scan and join functions in SSD [46]. On the other hands, FLIXR modifies the SSD firmware architecture and NVMe protocol to support in-storage indexing without addi- tional hardware supports in SSD. Index structures: Most modern DBMSs deploy index structures to reduce the sig- nificant overhead in accessing large database tables resident in storage devices. As huge volumes of datasets are managed by modern database systems on datacenter or cloud computing infrastructure, maintaining large-scaled index structures is becoming more critical. Yu et al. argue that classical B+-tree indexes occupy significantly large storage space and require heavy computation for maintaining the index structures. They proposed Hippo indexing approach that logs the storage page ranges and the abstracted histograms of data distributions on storage location [102]. In order to reduce query processing la- tency on large-scale cloud computing infrastructure, Wu et al. proposed the hierarchical index structures composed of local B+-tree index structure on each compute node and the structured overlay of the global cloud indexes [99]. Zhu et al. presented the dis- tributed PDCR index structures for cloud-based real-time OLAP systems [22]. Finis et al. proposed an efficient indexing mechanism, called Order Indexes, for highly dynamic hierarchical database [28]. This research studied several tree-based index structures that can update their tree structures quickly for dynamic use cases. 150 As large database tables are maintained by multiple compute nodes and storage sys- tems, optimization of query processing on the distributed database systems is a critical challenge [75]. Thus several studies have focused on the efficient distributed database systems architecture and query processing approaches. Sahli et al. introduced a dis- tributed database system, called StarDB [85]. StarDB supports an index management technique based on a well-known Boyer-Moore algorithm. The index maintenance tech- nique employed in StarDB supports fast index build-up for string type datasets. Mackert et al. validated the query optimizer and evaluated the performance of the distributed relational database management system [57]. They also studied the various join press- ing approaches such as Semijoin and Bloomjoin on the distributed database system. Ramesh et al. proposed the enhanced distributed join processing approaches based on Bloomjoin [80]. FLIXR builds index structures in storage systems using the storage embedded pro- cessors. Since FLIXR maintains the index structures only for the datasets resident in its own local SSD, FLIXR’s index structures are easily scalable if a single host system equips multiple FLIXR SSDs via PCIe interconnection network. Note that FLIXR’s in- storage indexes are efficiently maintained by exploiting SSD’s native architecture and I/O processes. In addition the performance of the conventional distributed database sys- tems may suffer from communication overhead across multiple compute nodes. Hence we expect that in-storage indexes distributed to multiple SSDs in a single host system can 151 be efficiently managed in each local FLIXR SSD with the ignorable computation burden of the host processor. 5.7 Chapter Summary Recent large-scale database processing requests heavy data transfer from storage devices, thus data transfer has become a significant performance bottleneck for data analytical ap- plications. In-storage processing is one of emerging solutions to overcome this storage wall by offloading part of query computation on the embedded processor of the storage device. In this chapter we represent an efficient in-SSD indexing mechanism – FLIXR, which exploits SSD’s unique FTL page-mapping architecture to organize page-level in- dexes. Using the extended NVMe commands FLIXR can execute user-defined index generation functions whenever page data is written or updated in flash memory. These page-level indexes logged in FTL table are utilized for filtering or join processing exe- cution in SSD. FLIXR’s page-level indexing mechanism enables efficient data filtration even with the wimpy embedded processor on the storage platform. We also present FLIXR programming model which can be easily applied to existing query processing applications. Our evaluation with the Open-source SSD development platform shows that the overall system performance is improved by 37.8% for a large range of query processing. 152 Chapter 6 Conclusion Rapid growth of data sets size and demands for the applications that process such huge volumes of data sets have arisen compelling needs for high-performance data processing systems. Thus modern datacenter server platforms harness GPUs as high-performance compute nodes that support massively parallel processing of a vast amount of data sets. However, the performance of GPU memory systems is a critical performance bottleneck especially for the modern data analytics applications ported in GPU domain. These applications tend to create massive irregular memory requests that lead to significant congestion and resource contention in GPU memory hierarchy. Consequently the GPU performance suffers from the long data fetch latency and the inefficient data utilization in the memory systems. Data access latency to storage devices is another critical performance hurdle of data processing systems. Modern big data applications rely on large data sets lied in the 153 non-volatile storage, and for computation these data sets are transferred to the memory systems of compute nodes. For big data analytics data I/O time from storage occupies a large fraction of the entire execution time as the demand data sets do not fit in the DRAM space of compute nodes. In this dissertation we present the approaches to tackle these performance challenges in GPU memory hierarchy and storage systems. In Chapter 2 we reveal the characteristics of the GPU memory system behavior by general-purpose GPU applications. In order to analyze the diverged performance impacts by different types of load instructions, we categorize global load instructions into deter- ministic and non-deterministic loads. The deterministic loads use the indexes calculated from static parameter values, thus this type of loads are apt to generate coalesced memory requests from a single warp. On the other hands, the addresses of the non-deterministic loads are computed using non-static values. We found the non-deterministic load in- structions frequently observed in data analytics applications are the primary performance bottleneck in GPU memory systems. Long latency of data fetch in the GPU memory system is one of the critical per- formance challenges of GPU. Besides burst of memory requests observed in general- purpose applications augment queuing delays dramatically in GPU memory hierarchy, thus warps stall hundreds of cycles until the data demanded by the preceding global loads reach the local data cache. Prefetch is a well-explored approach to mitigate the perfor- mance overhead by long-latency memory operation, however there are challenges that 154 limit the benefits from prefetch in GPU domains. In Chapter 3 we propose CTA-aware prefetcher and scheduler which can guarantee issuance of accurate prefetch requests with better timeliness. Our prefetch approach detects the base address of each CTA and the stride within a single CTA quickly by rescheduling warp priority. In Chapter 4 we propose access pattern-aware cache management (APCM) scheme to improve the low utilization of GPU data cache. For general-purpose applications, GPU cache suffers from high cache miss ratio since the GPU data cache that includes only several tens of wide cache lines must be shared by dozens of concurrent warps. Hence the cache lines are frequently evicted before re-reference even if the loaded data has strong locality. APCM first detects the locality types of individual loads in runtime, and then apply the specific cache management schemes (bypassing, normal LRU, and protection) based on the detected locality type of each load. APCM employs the per- load cache management since each load since individual loads exhibit diverged cache reuse distance and lifetime by different locality types. The performance of memory intensive applications is improved with APCM since the data cache space is utilized more efficiently. Modern big data analytics demand huge volumes of data sets in storage devices, thus transferring such large data sets from storage to compute nodes is critical for the performance of data computing systems. Conventional computer systems only rely on the computation ability of the host processors to manage data accesses to storage de- vices and execute data analytics applications. Hence, for computation entire data sets in 155 storage needs to be transferred to memory systems of the host compute nodes. On the other hands, our approach utilizes the computation power of the embedded processor en- closed in SSDs to create and manage index structures in storage systems. Our solution, called FLIXR, creates indexes alongside with the native address translation structure, and then performs page filtering and join processing by comparing the created indexes with database query conditions. Consequently FLIXR can effectively reduce the data traffic from the storage systems. We present this in-storage indexing mechanism in Chapter 5. 156 Reference List [1] FreePDK process design kit. www.eda.ncsu.edu/wiki/FreePDK. [2] GPGPU-sim manual. [3] Anurag Acharya, Mustafa Uysal, and Joel Saltz. Active Disks: Programming Model, Algorithms and Evaluation. In Proceedings of the 8th International Con- ference on Architectural Support for Programming Languages and Operating Sys- tems, ASPLOS ’98, pages 81–91, New York, NY , USA, 1998. ACM. [4] Alfred V . Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles, Techniques, and Tools. Pearson Education, 2nd edition, 2006. [5] AMD. AMD Graphics Cores Next (GCN) Architecture. [6] Jean-Loup Baer and Tien-Fu Chen. An Effective On-chip Preloading Scheme to Reduce Data Access Penalty. In Proceedings of the 1991 ACM/IEEE Conference on Supercomputing, Supercomputing ’91, pages 176–186, 1991. [7] Ali Bakhoda, George Yuan, Wilson W. L. Fung, Henry Wong, and Tor M. Aamodt. Analyzing CUDA Workloads Using a Detailed GPU Simulator. In IEEE Interna- tional Symposium on Performance Analysis of Systems and Software, ISPASS ’09, pages 163–174, April 2009. [8] Rajeev Balasubramonian, Jichuan Chang, Troy Manning, Jaime H. Moreno, Richard Murphy, Ravi Nair, and Swanson Swanson. Near-Data Processing: In- sights from a MICRO-46 Workshop. IEEE Micro, 34(4):36–42, July 2014. [9] E. Blem, J. Menon, and K. Sankaralingam. Power Struggles: Revisiting the RISC vs. CISC Debate on Contemporary ARM and x86 Architectures. In 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA), pages 1–12, Feb 2013. [10] Simona Boboila, Youngjae Kim, Sudharshan S. Vazhkudai, Peter Desnoyers, and Galen M. Shipman. Active Flash: Out-of-Core Data Analytics on Flash Storage. In IEEE 28th Symposium on Mass Storage Systems and Technologies, MSST ’12, pages 1–12, April 2012. 157 [11] Martin Burtscher, Rupesh Nasre, and Keshav Pingali. A Quantitative Study of Irregular Programs on GPUs. In Proceedings of the IEEE International Sym- posium on Workload Characterization, IISWC ’12, pages 141–151, Washington, DC, USA, 2012. IEEE Computer Society. [12] Adrian M. Caulfield, Arup De, Joel Coburn, Todor I. Mollow, Rajesh K. Gupta, and Steven Swanson. Moneta: A High-Performance Storage Array Architecture for Next-Generation, Non-volatile Memories. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’43, pages 385–395, Washington, DC, USA, 2010. IEEE Computer Society. [13] Adrian M. Caulfield, Laura M. Grupp, and Steven Swanson. Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’09, pages 217–228, New York, NY , USA, 2009. ACM. [14] Niladrish Chatterjee, Mike O’Connor, Gabriel H. Loh, Nuwan Jayasena, and Rajeev Balasubramonian. Managing DRAM Latency Divergence in Irregular GPGPU Applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’14, pages 128– 139, 2014. [15] Shuai Che, Bradford M. Beckmann, Steven K. Reinhardt, and Kevin Skadron. Pannotia: Understanding Irregular GPGPU Graph Applications. In Proceedings of IEEE International Symposium on Workload Characterization, pages 185–195, 2013. [16] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. Rodinia: A Benchmark Suite for Heteroge- neous Computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization, IISWC ’09, pages 44–54, Washington, DC, USA, 2009. IEEE Computer Society. [17] Shuai Che, Jeremy W. Sheaffer, Michael Boyer, Lukasz G. Szafaryn, Liang Wang, and Kevin Skadron. A Characterization of the Rodinia Benchmark Suite with Comparison to Contemporary CMP Workloads. In Proceedings of IEEE Interna- tional Symposium on Workload Characterization, pages 1–11, 2010. [18] Shuai Che and Kevin Skadron. Benchfriend: Correlating the performance of gpu benchmarks. Int. J. High Perform. Comput. Appl., 28(2):238–250, May 2014. 158 [19] Xuhao Chen, Li-Wen Chang, Christopher I. Rodrigues, Jie Lv, Zhiying Wang, and Wen-Mei Hwu. Adaptive Cache Management for Energy-Efficient GPU Comput- ing. In Proceedings of the IEEE/ACM International Symposium on Microarchitec- ture, MICRO-47, pages 343–355, Washington, DC, USA, 2014. IEEE Computer Society. [20] Sangyeun Cho, Chanik Park, Hyunok Oh, Sungchan Kim, Youngmin Yi, and Gre- gory R. Ganger. Active Disk Meets Flash: A Case for Intelligent SSDs. In Pro- ceedings of the 27th International ACM Conference on International Conference on Supercomputing, ICS ’13, pages 91–102, New York, NY , USA, 2013. ACM. [21] Anthony Danalis, Gabriel Marin, Collin McCurdy, Jeremy S. Meredith, Philip C. Roth, Kyle Spafford, Vinod Tipparaju, and Jeffrey S. Vetter. The Scalable Het- erogeneous Computing (SHOC) Benchmark Suite. In Proceedings of the Work- shop on General-Purpose Computation on Graphics Processing Units, GPGPU-3, pages 63–74, New York, NY , USA, 2010. ACM. [22] F. Dehne, Q. Kong, A. Rau-Chaplin, H. Zaboli, and R. Zhou. Scalable Real-time OLAP on Cloud Architectures. Journal of Parallel and Distributed Computing, 79(C):31–41, May 2015. [23] Jaeyoung Do, Yang-Suk Kee, Jignesh M. Patel, Chanik Park, Kwanghyun Park, and David J. DeWitt. Query Processing on Smart SSDs: Opportunities and Chal- lenges. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD ’13, pages 1221–1230, New York, NY , USA, 2013. ACM. [24] Nam Duong, Dali Zhao, Taesu Kim, Rosario Cammarota, Mateo Valero, and Alexander V . Veidenbaum. Improving Cache Management Policies Using Dy- namic Reuse Distances. In Proceedings of the 45th Annual IEEE/ACM Interna- tional Symposium on Microarchitecture, MICRO-45, pages 389–400, Washington, DC, USA, 2012. IEEE Computer Society. [25] Ramez Elmasri and Shamkant Navathe. Fundamentals of Database Systems. Addison-Wesley Publishing Company, USA, 6th edition, 2010. [26] FastestSSD. SSD Ranking: The Fastest SSDs. http://www.fastestssd.com/ featured/ssd-rankings-the-fastest-solid-state-drives. [27] Kayvon Fatahalian and Mike Houston. A Closer Look at GPUs. Communication of the ACM, 51(10):50–57, October 2008. [28] Jan Finis, Robert Brunel, Alfons Kemper, Thomas Neumann, Norman May, and Franz Faerber. Order Indexes: Supporting Highly Dynamic Hierarchical Data in Relational Main-memory Database Systems. The VLDB Journal, 26(1):55–80, February 2017. 159 [29] Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. In Proceedings of the Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-40, pages 407–420, Washington, DC, USA, 2007. IEEE Computer Society. [30] Mark Gebhart, Daniel R. Johnson, David Tarjan, Stephen W. Keckler, William J. Dally, Erik Lindholm, and Kevin Skadron. Energy-efficient Mechanisms for Man- aging Thread Context in Throughput Processors. In Proceedings of the 38th An- nual International Symposium on Computer Architecture, ISCA ’11, pages 235– 246, 2011. [31] Paul Gratz, Boris Grot, and Stephen W. Keckler. Regional Congestion Awareness for Load Balance in Networks-on-Chip. In Proceedings of IEEE International Symposium on High Performance Computer Architecture, pages 203–214, 2008. [32] Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. Auto-tuning a high-level language targeted to gpu codes. In Proceed- ings of Innovative Parallel Computing, InPar ’12, pages 1–10, 2012. [33] Laura M. Grupp, John D. Davis, and Steven Swanson. The Bleak Future of NAND Flash Memory. In Proceedings of the 10th USENIX Conference on File and Stor- age Technologies, FAST ’12, pages 2–2, Berkeley, CA, USA, 2012. USENIX Association. [34] Boncheol Gu, Andre S. Yoon, Duck-Ho Bae, Insoon Jo, Jinyoung Lee, Jonghyun Yoon, Jeong-Uk Kang, Moonsang Kwon, Chanho Yoon, Sangyeun Cho, Jaeheon Jeong, and Duckhyun Chang. Biscuit: A Framework for Near-data Processing of Big Data Workloads. In Proceedings of the 43rd International Symposium on Computer Architecture, ISCA ’16, pages 153–165, Piscataway, NJ, USA, 2016. IEEE Press. [35] Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuyong Wang. Mars: A MapReduce Framework on Graphics Processors. In Proceed- ings of the International Conference on Parallel Architectures and Compilation Techniques, PACT ’08, pages 260–269, New York, NY , USA, 2008. ACM. [36] Joel Hestness, Stephen W. Keckler, and David A. Wood. A comparative analysis of microarchitecture effects on cpu and gpu memory system behavior. In Pro- ceedings of IEEE International Symposium on Workload Characterization, pages 150–160, 2014. [37] Hynix. 1Gb GDDR5 SGRAM H5GQ1H24AFR Specification. http://www. hynix.com/datasheet/pdf/graphics/H5GQ1H24AFR(Rev1.0).pdf. 160 [38] Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems, ASP- LOS ’13, pages 395–406, 2013. [39] Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. Orchestrated Scheduling and Prefetching for GPGPUs. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA ’13, pages 332–343, 2013. [40] Theodore Johnson. Performance Measurements of Compressed Bitmap Indices. In Proceedings of the 25th International Conference on Very Large Data Bases, VLDB ’99, pages 278–289, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. [41] Sang-Woo Jun, Ming Liu, Sungjin Lee, Jamey Hicks, John Ankcorn, Myron King, Shuotao Xu, and Arvind. BlueDBM: An Appliance for Big Data Analytics. In Proceedings of the 42Nd Annual International Symposium on Computer Architec- ture, ISCA ’15, pages 1–13, New York, NY , USA, 2015. ACM. [42] Myoungsoo Jung and Mahmut Kandemir. Middleware - Firmware Cooperation for High-speed Solid State Drives. In Proceedings of the Posters and Demo Track, Middleware ’12, pages 5:1–5:2, New York, NY , USA, 2012. ACM. [43] Yangwook Kang, Yang suk Kee, Ethan L. Miller, and Chanik Park. Enabling cost-effective data processing with smart SSD. In IEEE 29th Symposium on Mass Storage Systems and Technologies, MSST ’14, pages 1–12, May 2013. [44] Onur Kayıran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. Neither More nor Less: Optimizing Thread-level Parallelism for GPGPUs. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, PACT ’13, pages 157–166, Piscataway, NJ, USA, 2013. IEEE Press. [45] Keunsoo Kim, Sangpil Lee, Myung Kuk Yoon, Gunjae Koo, Won Woo Ro, and Murali Annavaram. Warped-Preexecution: A GPU Pre-Execution Approach for Improving Latency Hiding. In IEEE International Symposium on High Perfor- mance Computer Architecture, HPCA ’16, pages 163–175, 2016. [46] Sungchan Kim, Hyunok Oh, Chanik Park, Sangyeun Cho, Sang-Won Lee, and Bongki Moon. In-storage processing of database scans and joins. Information Sciences, 327:183 – 200, 2016. 161 [47] Gunjae Koo, Hyeran Jeon, and Murali Annavaram. Revealing Critical Loads and Hidden Data Locality in GPGPU Applications. In Proceedings of the IEEE Inter- national Symposium on Workload Characterization, IISWC ’15, pages 120–129, Washington, DC, USA, 2015. IEEE Computer Society. [48] Gunjae Koo, Kiran Kumar Matam, Te I, H. V . Krishna Giri Narra, Jing Li, Hung- Wei Tseng, Steven Swanson, and Murali Annavaram. Summarizer: Trading Communication with Computing Near Storage. In Proceedings of the 50th An- nual IEEE/ACM International Symposium on Microarchitecture, MICRO-50 ’17, pages 219–231, New York, NY , USA, 2017. ACM. [49] Gunjae Koo, Yunho Oh, Won Woo Ro, and Murali Annavaram. Access Pattern- Aware Cache Management for Improving Data Utilization in GPU. In Proceed- ings of the 44th Annual International Symposium on Computer Architecture, ISCA ’17, pages 307–319, 2017. [50] Nagesh B Lakshminarayana and Hyesoon Kim. Spare Register Aware Prefetching for Graph Algorithms on GPUs. In Proceedings of 2014 IEEE 20th International Symposium on High Performance Computer Architecture, pages 614–625, 2014. [51] Jaekyu Lee, Nagesh B. Lakshminarayana, Hyesoon Kim, and Richard Vuduc. Many-Thread Aware Prefetching Mechanisms for GPGPU Applications. In Pro- ceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Mi- croarchitecture, MICRO-43, pages 213–224, Washington, DC, USA, 2010. IEEE Computer Society. [52] Minseok Lee, Seokwoo Song, Joosik Moon, John Kim, Woong Seo, Yeongon Cho, and Soojung Ryu. Improving GPGPU resource utilization through alterna- tive thread block scheduling. In IEEE International Symposium on High Perfor- mance Computer Architecture, HPCA ’14, pages 260–271, 2014. [53] Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. GPUWattch: Enabling Energy Optimizations in GPGPUs. In Proceedings of the 40th Annual International Sym- posium on Computer Architecture, ISCA ’13, pages 487–498, New York, NY , USA, 2013. ACM. [54] Dong Li, Minsoo Rhu, Daniel R. Johnson, Mike O’Connor, Mattan Erez, Doug Burger, Donald S. Fussell, and Stephen W. Keckler. Priority-Based Cache Al- location in Throughput Processors. In IEEE International Symposium on High Performance Computer Architecture, HPCA ’15, pages 89–100, 2015. [55] E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. NVIDIA Tesla: A Unified Graphics and Computing Architecture. Micro, IEEE, 28(2):39–55, 2008. 162 [56] Dongzhe Ma, Jianhua Feng, and Guoliang Li. LazyFTL: A Page-level Flash Translation Layer Optimized for NAND Flash Memory. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIG- MOD ’11, pages 1–12, New York, NY , USA, 2011. ACM. [57] Lothar F. Mackert and Guy M. Lohman. R* Optimizer Validation and Perfor- mance Evaluation for Local Queries. In Proceedings of the 1986 ACM SIGMOD International Conference on Management of Data, SIGMOD ’86, pages 84–95, New York, NY , USA, 1986. ACM. [58] Priti Mishra and Margaret H. Eich. Join Processing in Relational Databases. ACM Computing Surveys (CSUR), 24(1):63–113, March 1992. [59] Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P. Jouppi. Cacti 6.0: A tool to model large caches. Technical report, HP Laboratories, 2009. [60] Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. Improving GPU Performance via Large Warps and Two-level Warp Scheduling. In Proceedings of the 44th Annual IEEE/ACM Inter- national Symposium on Microarchitecture, MICRO-44, pages 308–317, 2011. [61] John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. Scalable Parallel Programming with CUDA. Queue - GPU Computing, 6(2):45–53, Apr/Mar 2008. [62] Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal, and Henri Bal. A Detailed GPU Cache Model Based on Reuse Distance Theory. In Proceedings of IEEE International Symposium on High Performance Computer Architecture, pages 37–48, 2014. [63] NVIDIA. Cuda toolkit 4.0. [64] NVIDIA. NVIDIA CUDA C Programming Guide. [65] NVIDIA. NVIDIA CUDA Profiler User’s Guide. [66] NVIDIA. NVIDIA CUDA SDK 2.3. http://developer.nvidia.com/ cuda-toolkit-23-downloads. [67] NVIDIA. NVIDIA GeForce GTX 980 Featuring Maxwell, The Most Advanced GPU Ever Made. [68] NVIDIA. NVIDIA’s Next Generation CUDA Compute Architecture: Kepler GK110. http://www.nvidia.com/content/PDF/kepler/ NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf. 163 [69] NVIDIA. NVIDIAs Next Generation CUDA Compute Architecture: Fermi. http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_ Fermi_Compute_Architecture_Whitepaper.pdf. [70] NVM Express. NVM Express Revision 1.1. http://www.nvmexpress.org/ wp-content/uploads/NVM-Express-1_1.pdf. [71] Molly A. O’Neil and Martin Burtscher. Microarchitectural Performance Charac- terization of Irregular GPU Kernels. In Proceedings of IEEE International Sym- posium on Workload Characterization, pages 130–139, 2014. [72] Patrick O’Neil and Dallan Quass. Improved Query Performance with Variant Indexes. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, SIGMOD ’97, pages 38–49, New York, NY , USA, 1997. ACM. [73] OpenSSD. Open-Source Solid-State Drive Project for Research and Education. http://openssd.io. [74] Jian Ouyang, Shiding Lin, Song Jiang, Zhenyu Hou, Yong Wang, and Yuanzheng Wang. SDF: Software-defined Flash for Web-scale Internet Storage Systems. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’14, pages 471–484, New York, NY , USA, 2014. ACM. [75] M. Tamer ¨ Ozsu and Patrick Valduriez. Principles of Distributed Database Sys- tems. Springer, New York, NY , USA, 3 edition, 2016. [76] Kwanghyun Park, Yang-Suk Kee, Jignesh M. Patel, Jaeyoung Do, Chanik Park, and David J. DeWitt. Query Processing on Smart SSDs. Bulletin of IEEE Techni- cal Committee on Data Engineering, 37(2):29–26, June 2014. [77] PassMark. PassMark CPU benchmark. http://www.cpubenchmark.net. [78] PCI-SIG. PCI Express Specifications. http://pcisig.com/specifications. [79] Raghu Ramakrishnan and Johannes Gehrke. Database Management Systems. McGraw-Hill, Inc., New York, NY , USA, 3 edition, 2003. [80] Sukriti Ramesh, Odysseas Papapetrou, and Wolf Siberski. Optimizing Distributed Joins with Bloom Filters. In Proceedings of the 5th International Conference on Distributed Computing and Internet Technology, ICDCIT ’08, pages 145–156, Berlin, Heidelberg, 2009. Springer-Verlag. 164 [81] Minsoo Rhu, Michael Sullivan, Jingwen Leng, and Mattan Erez. A Locality- aware Memory Hierarchy for Energy-efficient GPU Architectures. In Proceed- ings of the IEEE/ACM International Symposium on Microarchitecture, MICRO- 46, pages 86–98, New York, NY , USA, 2013. ACM. [82] Erik Riedel, Christos Faloutsos, Garth A. Gibson, and David Nagle. Active Disks for Large-Scale Data Processing. Computer, 34(6):68–74, June 2001. [83] Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. Cache-Conscious Wavefront Scheduling. In Proceedings of the IEEE/ACM International Sympo- sium on Microarchitecture, MICRO-45, pages 72–83, Washington, DC, USA, 2012. IEEE Computer Society. [84] Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. Divergence-aware Warp Scheduling. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture, MICRO-46, pages 99–110, New York, NY , USA, 2013. ACM. [85] Majed Sahli, Essam Mansour, and Panos Kalnis. StarDB: A Large-scale DBMS for Strings. Proc. VLDB Endow., 8(12):1844–1847, August 2015. [86] Sudharsan Seshadri, Mark Gahagan, Sundaram Bhaskaran, Trevor Bunker, Arup De, Yanqin Jin, Yang Liu, and Steven Swanson. Willow: A User-programmable SSD. In Proceedings of the 11th USENIX Conference on Operating Systems De- sign and Implementation, OSDI ’14, pages 67–80, Berkeley, CA, USA, 2014. USENIX Association. [87] Ankit Sethia, Ganesh Dasika, Mehrzad Samadi, and Scott Mahlke. Apogee: Adaptive prefetching on gpus for energy efficiency. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, PACT ’13, pages 73–82, Piscataway, NJ, USA, 2013. IEEE Press. [88] Leonard D. Shapiro. Join Processing in Database Systems with Large Main Mem- ories. ACM Transactions on Database Systems (TODS), 11(3):239–264, August 1986. [89] Yong Ho Song. Cosmos+ OpenSSD: A NVMe-based Open Source SSD Platform. In Flash Memory Summit 2016, Santa Clara, CA, USA, 2016. [90] John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-mei W. Hwe. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Tech- nical report, Center for Reliable and High-Performance Computing, University of Illinois at Urbana-Champaign, 2012. 165 [91] Yingying Tian, Sooraj Puthoor, Joseph L. Greathouse, Bradford M. Beckmann, and Daniel A. Jimenez. Adaptive GPU Cache Bypassing Categories and Subject Descriptors. In Proceedings of the Workshop on General Purpose Processing Using GPUs, GPGPU-8, pages 36–47, 2015. [92] Devesh Tiwari, Sudharshan S. Vazhkudai, Youngjae Kim, Xiaosong Ma, Simona Boboila, and Peter J. Desnoyers. Reducing Data Movement Costs Using Energy Efficient, Active Computation on SSD. In Proceedings of the 2012 USENIX Con- ference on Power-Aware Computing and Systems, HotPower ’12, Berkeley, CA, USA, 2012. USENIX Association. [93] TPC. TPC Benchmark C (TPC-C) Standard Specification Revision 5.11. http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-c_v5. 11.0.pdf. [94] TPC. TPC Benchmark H (TPC-H). http://www.tpc.org/tpch/. [95] Jianguo Wang, Dongchul Park, Yang-Suk Kee, Yannis Papakonstantinou, and Steven Swanson. SSD In-storage Computing for List Intersection. In Proceed- ings of the 12th International Workshop on Data Management on New Hardware, DaMoN ’16, pages 4:1–4:7, New York, NY , USA, 2016. ACM. [96] Kai Wang and Calvin Lin. Decoupled Affine Computation for SIMT GPUs. In Proceedings of the 44th Annual International Symposium on Computer Architec- ture, ISCA ’17, pages 295–306, 2017. [97] Steven J. E. Wilton and Norman P. Jouppi. CACTI: An Enhanced Cache Access and Cycle Time Model. IEEE Journal of Solid-State Circuits, 31(5):677–688, May 1996. [98] Kesheng Wu, Ekow J. Otoo, and Arie Shoshani. A Performance Comparison of Bitmap Indexes. In Proceedings of the Tenth International Conference on Infor- mation and Knowledge Management, CIKM ’01, pages 559–561, New York, NY , USA, 2001. ACM. [99] Sai Wu, Dawei Jiang, Beng Chin Ooi, and Kun-Lung Wu. Efficient B-tree Based Indexing for Cloud Data Processing. Proc. VLDB Endow., 3(1-2):1207–1218, September 2010. [100] Xilinx. Zynq-7000 All Programmable SoC Data Sheet. https://www.xilinx. com/support/documentation/data_sheets/ds190-Zynq-7000-Overview. pdf. [101] Qiumin Xu, Hyeran Jeon, and Murali Annavaram. Graph Processing on GPUs: Where are the Bottlenecks? In Proceedings of the IEEE International Symposium on Workload Characterization, IISWC ’14, pages 140–149, 2014. 166 [102] Jia Yu and Mohamed Sarwat. Two Birds, One Stone: A Fast, Yet Lightweight, Indexing Scheme for Modern Database Systems. Proceedings of the VLDB En- dowment, 10(4):385–396, November 2016. [103] George L. Yuan, Ali Bakhoda, and Tor M. Aamodt. Complexity Effective Mem- ory Access Scheduling for Many-core Accelerator Architectures. In Proceedings of the 42Nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-42, pages 34–44, New York, NY , USA, 2009. ACM. [104] Mohammed Zackriya V and Harish M. Kittur. Precharge-Free, Low-Power Content-Addressable Memory. IEEE Transactions on Very Large Scale Integra- tion (VLSI) Systems, PP(99):1–8, 2016. [105] You Zhou, Fei Wu, Ping Huang, Xubin He, Changsheng Xie, and Jian Zhou. An Efficient Page-level FTL to Optimize Address Translation in Flash Memory. In Proceedings of the Tenth European Conference on Computer Systems, EuroSys ’15, pages 12:1–12:16, New York, NY , USA, 2015. ACM. 167
Abstract (if available)
Abstract
Recent exponential growth of the data sets size demanded by modern big data applications requires innovative computer systems architecture design. In order to accelerate processing huge volumes of data sets, modern datacenter server systems harness graphics processing units (GPUs) as high-performance compute nodes that support massively parallel processing of data-intensive applications. The datacenter server systems also deploy non-volatile memory (NVM)-based storage devices to provide lower data transfer latency from storage to compute nodes. Despite employing the advanced parallel processors and storage systems, the server systems suffer from the performance overhead in transferring massive amounts of data sets. For instance, long data fetch latency is one of the critical performance bottlenecks in GPUs. Big data applications that rely on storage resident data pay a significant fraction of their execution time on data input/output (I/O) time. ❧ In this dissertation we present the architectural approaches for improving the performance of GPUs and storage systems, which are critical hardware components in modern datacenter server systems. First we analyze the characteristics and performance impacts of global load instructions on GPU memory hierarchy. Based on the index generation rule of global load instructions we categorize load instructions into two different types—deterministic and non-deterministic types. We reveal that the non-deterministic type loads create a burst of irregular memory requests, which drains the resource of GPU memory hierarchy. ❧ The long latency of global memory instructions is one of the critical performance bottlenecks in GPU since the long data fetch latency cannot be hidden by GPU's quick context switching among tens of concurrent warps. In order to mitigate the performance overhead by load instructions, we propose the efficient prefetch mechanism combined with the prefetch-aware warp scheduler. Our proposed prefetcher estimates accurate prefetch addresses by detecting the base address of each thread block and the common stride observed between neighbor warps within a single thread block. In order to provide better timeliness for prefetch requests, the prefetch-aware warp scheduler reorganizes the warp execution priority to detect the required information for prefetch address estimation as quickly as possible. Hence the propose prefetcher is able to issue the accurate prefetch requests sufficiently ahead of time before the target demand fetch requests. ❧ GPU employs the local data cache to mitigate the performance overhead by long latency of load instructions. However the GPU data cache is not utilized efficiently since the small data cache shared by dozens of warps suffers from significant cache contention and premature data eviction. In order to improve the data cache utilization we propose Access Pattern-Aware Cache Management (APCM), which provides the fine-grained per-load cache management scheme. We discover that individual global load instructions exhibit different types of warp-based locality behaviors and data allocated by different locality types of has diverged reuse intervals and lifetime in the data cache. Furthermore the load instructions that share the same program counter exhibit the consistent locality properties across all warps originated from the same kernel. Based on this discovery APCM applies the per-load cache management scheme for all warps once the optimal cache management scheme is determined for each global load in a single warp. In order to detect the locality types of each load APCM first tracks the cache access history in runtime using the dedicated tag array structure for single monitored warp. Then APCM selectively applies cache bypassing or cache line protection for all warps based on the detected locality types of individual loads. Our evaluation shows that APCM improves the performance by 34% for data-intensive GPU applications. ❧ Storage I/O time is becoming more critical in datacenter server systems since modern big data applications demand huge volumes of data sets resident in storage devices and these large data sets cannot fit in external memory (such as DRAM) of compute nodes. Thus modern database systems exploit index structures to reduce data accesses to storage. However, as database size grows exponentially the storage space overhead for index structure also increases significantly and scanning indexes creates frequent accesses to storage. Moreover updating large indexes is a heavy computation burden for host systems. In this dissertation we propose in-storage indexing mechanism, called FLIXR, using the embedded processor and memory systems enclosed in modern storage devices. FLIXR builds and maintains page-level indexes while database table data is written or updated to SSDs. Exploiting the native address translation and page I/O process in SSDs, FLIXR efficiently performs index comparison and join processing functions to filter out unneeded page data fetch from SSD. FLIXR shows 21.7% performance improvement for index maintenance workloads and 1.81× performance uplifts for a wide range of query processing benchmarks. ❧ To summarize, in this dissertation we present several architectural approaches that can mitigate critical data movement cost in GPUs and storage systems that are widely deployed in datacenter server systems or cloud infrastructure. Hence this dissertation makes contributions to improving the performance of high-performance computer systems for modern big data applications.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Efficient graph processing with graph semantics aware intelligent storage
PDF
Modeling and optimization of energy-efficient and delay-constrained video sharing servers
PDF
Enabling energy efficient and secure execution of concurrent kernels on graphics processing units
PDF
Efficient processing of streaming data in multi-user and multi-abstraction workflows
PDF
Demand based techniques to improve the energy efficiency of the execution units and the register file in general purpose graphics processing units
PDF
Hardware techniques for efficient communication in transactional systems
PDF
Improving the efficiency of conflict detection and contention management in hardware transactional memory systems
PDF
Data and computation redundancy in stream processing applications for improved fault resiliency and real-time performance
PDF
Component-based distributed data stores
PDF
Efficient memory coherence and consistency support for enabling data sharing in GPUs
PDF
Low cost fault handling mechanisms for multicore and many-core systems
PDF
Resource underutilization exploitation for power efficient and reliable throughput processor
PDF
Scaling recommendation models with data-aware architectures and hardware efficient implementations
PDF
Coded computing: Mitigating fundamental bottlenecks in large-scale data analytics
PDF
Cache analysis and techniques for optimizing data movement across the cache hierarchy
PDF
Taming heterogeneity, the ubiquitous beast in cloud computing and decentralized learning
PDF
Energy proportional computing for multi-core and many-core servers
PDF
High-performance distributed computing techniques for wireless IoT and connected vehicle systems
PDF
Coded computing: a transformative framework for resilient, secure, private, and communication efficient large scale distributed computing
PDF
Theoretical foundations for dealing with data scarcity and distributed computing in modern machine learning
Asset Metadata
Creator
Koo, Gunjae
(author)
Core Title
Architectural innovations for mitigating data movement cost on graphics processing units and storage systems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
08/02/2018
Defense Date
05/14/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
cache management,graphics processing units,memory systems,near data processing,OAI-PMH Harvest,parallel processor architecture,prefetch,storage systems,workload analysis
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Annavaram, Murali (
committee chair
), Golubchik, Leana (
committee member
), Pinkston, Timothy (
committee member
)
Creator Email
gunjae.koo@gmail.com,gunjae.koo@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-47944
Unique identifier
UC11671189
Identifier
etd-KooGunjae-6609.pdf (filename),usctheses-c89-47944 (legacy record id)
Legacy Identifier
etd-KooGunjae-6609.pdf
Dmrecord
47944
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Koo, Gunjae
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
cache management
graphics processing units
memory systems
near data processing
parallel processor architecture
prefetch
storage systems
workload analysis