Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Efficient graph processing with graph semantics aware intelligent storage
(USC Thesis Other)
Efficient graph processing with graph semantics aware intelligent storage
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Efficient Graph Processing with Graph Semantics Aware Intelligent Storage by Kiran Kumar Matam A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Computer Science) May 2020 Copyright 2020 Kiran Kumar Matam Dedication To my mother, Gavi Vemavathi and father, Matam Chandra Mouli, whose unconditional love, courage, wisdom and devotion have been the strength of my striving. ii Acknowledgements I would like to express my heartfelt gratitude to my advisor Professor Murali Annavaram for the contin- uous support during my Ph.D study. He has been a tremendous advisor who led me into this exciting research area. He is always enthusiastic, kind, patient and was always there during the times I needed him the most. I especially appreciate that he let me work at my own pace, gave me freedom to explore my ideas and was willing to accommodate my mistakes. This thesis greatly benefited from all the insightful discussion that we had and I greatly benefited from his vast research experience. It would not have been possible to finish my Ph.D. without his constant support. In addition to my advisor, I would particularly like to thank my thesis/qualifying committee mem- bers: Professor Antonio Ortega, Professor Shahram Ghandeharizadeh, Professor Michel Dubois, Profes- sor Ramesh Govindan, and Professor Viktor Prasanna for always being supportive and providing insightful suggestions to improve my thesis. I would also like to thank Prof. Viktor Prasanna, Prof. Michel Dubois, and Prof. Xuehai for their valuable guidance during my Ph.D. years. I am grateful to my fellow lab mate Gunjae Koo for all the stimulating discussions that we had while working on several chapters in this thesis. Special thanks to my colloborator Prof. Hung-Wei from Univer- sity of California - Riverside, whose support was invaluable in getting started with the thesis topic. Thanks to my other colloborators from our lab: Mohammad Abdel-Majeed, Haipeng Zha, Hanieh Hashemi and H. V . Krishna Giri Narra, whose support was very valuable in finishing the thesis related projects. Special thanks to Prof. Shariar Shamsian, Prof. Michael Shindler, and Prof. Sheila Tejada under whom I had the opportunity to serve as a teaching assistant. I would also like to thank CS department adviser, Lizsl Deleon, for helping me to navigate through all the administrative processes, you are the best! iii I thank my other fellow labmates in the SCIP group: Zhifeng Lin, Ruisheng Wang, Qiumin Xu, Daniel Wong, Abdulaziz Tabbakh, Yunho Oh, Mehrtash Manoochehri, Hyeran Jeon, Tingting Tang, Abdulla Alshabanah, Yongqin Wang, and others for all the valuable discussions that we had. You guys made coming to the lab more fun. I also thank my friends at USC: Krishna giri narra, Rama reddy, Karishma sharma, Aravind bhi- marasetty, Malati chavva, Sundar aditya, Kiran kumar lekkala, Sanjay purushotham, Jitin singla, San- mukh rao, Abhinav bharadwaj, Abhijit kashyap, Vinod kristem, Ravi teja sanampudi, Sashi kiran, Var- sha edukulla, Ankith sharma, Sulagna mukherjee, Ren chen, Lucia sun, Andrea sanny, Charith wickra- maarachchi, Om patri, Smita gaikwad, Depaali bhola, and many others, for all the fun times we shared together and also for the much needed emotional support to survive the Ph.D. journey, without you guys I can’t imagine finishing the Ph.D. Last but not the least, I owe my deep gratitude to my beloved family: my parents and brother Sudheer, for their unconditional love and support throughout writing this thesis and my life. iv Table of Contents Dedication ii Acknowledgements iii List of Tables ix List of Figures x Abstract xii 1 Introduction 1 1.1 SSD Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.1 Architecture of modern data center SSDs . . . . . . . . . . . . . . . . . . . . . . 4 1.1.2 NVMe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2 Graph Processing Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2.1 Graph Computational model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2.2 Graph formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.2.3 CSR format in the era of SSDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.2.4 Challenges for graph processing with a CSR format . . . . . . . . . . . . . . . . . 15 1.2.5 A case for graph semantic awareness in SSDs . . . . . . . . . . . . . . . . . . . . 16 1.3 Thesis statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.4 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.4.1 #1: GraphSSD: Graph semantics aware SSD . . . . . . . . . . . . . . . . . . . . 18 1.4.2 #2: MultiLogVC: Efficient out-of-core graph processing framework . . . . . . . . 19 1.4.3 #3: Summarizer: Generalizing near storage processing . . . . . . . . . . . . . . . 20 1.5 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 v 2 GraphSSD: Graph Semantics Aware SSD 23 2.1 Chapter Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.1 Out-of-core graph processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.2 Graph updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.1 Graph command decoder: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.3.2 Graph translation layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.3.3 Supporting graph updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.3.4 Handling Graph Updates Efficiently With Caching and Delta Graph . . . . . . . . 38 2.3.5 Consistency considerations: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.3.6 GraphSSD cache manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.3.7 Graph command handling examples . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.4 Workloads and implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 2.4.1 Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.4.2 Baseline system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 2.4.3 Caching and Multi-threading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 2.6 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.6.1 Performance of basic APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 2.6.2 Application performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 2.6.3 Comparison with GraphChi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 2.6.4 GraphSSD Overheads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 2.6.5 Graph updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 2.6.6 Wear-levelling analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 2.7 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 2.7.1 Storage system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 2.7.2 Graph processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 2.8 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3 MultiLogVC: Efficient Out-of-Core Graph Processing Framework on Flash Storage 69 3.1 Chapter Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.2 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 vi 3.2.1 Out-of-core graph processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.2.2 Shrinking size of active vertices . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.3 CSR format in the era of SSDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.3.1 Challenges for graph processing with a CSR format . . . . . . . . . . . . . . . . . 79 3.4 MultiLogVC: Three Key Insights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.4.1 Avoid random write overhead with logging . . . . . . . . . . . . . . . . . . . . . 79 3.4.2 Eliminate sorting overhead with a multi-log . . . . . . . . . . . . . . . . . . . . . 80 3.4.3 Reduce read overhead with an edge log . . . . . . . . . . . . . . . . . . . . . . . 81 3.5 Multi-log Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.5.1 Multi-Log Update Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.5.1.1 Generating vertex interval . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.5.1.2 Fusing vertex intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 3.5.1.3 SSD-centric log optimizations . . . . . . . . . . . . . . . . . . . . . . . 87 3.5.2 Sort and Group Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 3.5.2.1 SSD pre-sorting capability . . . . . . . . . . . . . . . . . . . . . . . . 89 3.5.2.2 Active vertex extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.5.2.3 Graph Loader Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.5.3 Edge-log optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 3.5.4 Supporting the generality of vertex-centric programming . . . . . . . . . . . . . . 92 3.5.5 Graph structural updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 3.5.6 Programming model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 3.6 System design and Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 3.7 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 3.8 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 3.9 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 3.10 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4 Summarizer: Trading Communication with Computing Near Storage 110 4.1 Chapter Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.2 Background and motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.2.1 Potential of in-SSD computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.3 Summarizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.4 SSD Controller architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 vii 4.4.1 Summarizer architecture and operations . . . . . . . . . . . . . . . . . . . . . . . 118 4.4.2 Composing Summarizer applications . . . . . . . . . . . . . . . . . . . . . . . . 124 4.5 Case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.5.1 Data analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 4.5.2 Data integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 4.6 Methodology and implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . 131 4.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 4.7.1 Evaluation platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 4.7.2 Calibration based on workload measurements . . . . . . . . . . . . . . . . . . . . 137 4.7.3 Summarizer Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 4.7.4 Design space exploration: Internal/external bandwidth ratio . . . . . . . . . . . . 141 4.7.5 Design space exploration: In-SSD computing power . . . . . . . . . . . . . . . . 141 4.7.6 Cost effectiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 4.8 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 4.9 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 5 Conclusion 149 Reference List 153 viii List of Tables 2.1 GraphSSD APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.2 Graph dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.1 Graph dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.1 New NVMe commands to support Summarizer . . . . . . . . . . . . . . . . . . . . . . . 121 4.2 Processing by I/O ratio on data center SSDs . . . . . . . . . . . . . . . . . . . . . . . . . 137 ix List of Figures 1.1 The architecture of a modern SSD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Graph storage formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1 GraphSSD architecture overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2 Graph translation table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.3 Page layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.4 An example of GTT and page layout (storing out-going neighbors) for the shown example graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.5 An example for updating GTT with extended bit . . . . . . . . . . . . . . . . . . . . . . . 36 2.6 Flow chart of add edge operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.7 Relative performance of basic GraphSSD API . . . . . . . . . . . . . . . . . . . . . . . . 51 2.8 Bandwidth for basic GraphSSD API (MB/second) . . . . . . . . . . . . . . . . . . . . . . 51 2.9 Random-walk performance relative to baseline (X-axis is the number of random-walks performed) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 2.10 Application speedup relative to baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 2.11 NAND page access counts relative to baseline . . . . . . . . . . . . . . . . . . . . . . . . 56 2.12 Bandwidth utilization relative to baseline . . . . . . . . . . . . . . . . . . . . . . . . . . 57 2.13 Speedups relative to GraphChi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 2.14 Graph update performance relative to a baseline where graphs are not updated . . . . . . . 61 2.15 Performance with varying delta graph sizes . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.1 Graph storage formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.2 Active vertices and edges over supersteps . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.3 Accessed graph pages with less than 10% of utilization . . . . . . . . . . . . . . . . . . . 82 3.4 Layout MultilogVC Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 3.5 BFS Application Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 3.6 Application performance relative to GraphChi . . . . . . . . . . . . . . . . . . . . . . . . 101 3.7 Application performance comparisons over supersteps . . . . . . . . . . . . . . . . . . . 102 x 3.8 Performance comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 3.9 Percentage of inefficient pages predicted . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 3.10 Memory scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.1 The overall architecture of NVMe controller and Summarizer . . . . . . . . . . . . . . . . 117 4.2 Detailed in-SSD task control used by Summarizer . . . . . . . . . . . . . . . . . . . . . . 120 4.3 The SSD development platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 4.4 Execution time by the ratio of in-SSD computation . . . . . . . . . . . . . . . . . . . . . 136 4.5 Performance improvement by internal/external bandwidth ratio . . . . . . . . . . . . . . . 140 4.6 Performance improvement by SSD controller’s computation power . . . . . . . . . . . . . 141 4.7 Price by processor performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 xi Abstract Graph analytics play a key role in a number of applications such as social networks, drug discovery, and recommendation systems. As the size of graphs continues to grow, the storage access latency is a critical performance hurdle for accelerating graph analytics. Solid state drives (SSDs), with lower latency and higher bandwidth when compared to hard drives, along with the computing infrastructure embedded in SSDs, have the potential to accelerate large-scale graph applications. However, to exploit the full potential of SSDs in graph applications there is a need to rethink how graphs are stored on SSDs, as well as design new protocols for how SSDs access and manipulate graph data. This thesis presents two approaches to enable such a rethinking for efficient graph analytics on SSDs. First, this thesis makes a case for making SSDs be semantically aware of the structure of graph data to improve their access efficiency. This thesis presents the design and implementation of graph semantics aware SSD, called GraphSSD. GraphSSD exploits the unique page access indirection mechanisms used in SSDs to store graph data in a compact form. This storage format reduces unnecessary page reads by enabling direct access to a vertex’s meta data through the enhanced page access indirection mechanism. Further, GraphSSD supports efficient graph modifications by enabling incremental updates to the graph. These graph-centric capabilities of the SSD are exposed to the user through a set of basic graph access APIs, which are robust enough to support complex, large-scale graph processing. The second part of the thesis presents MultiLogVC, a novel approach for reducing SSD page accesses while executing out- of-core graph algorithms that use vertex-centric programming model. This approach is based on the observation that nearly all graph algorithms have a dynamically varying number of active vertices that must be processed in each iteration. To efficiently access the storage proportional to the number of active vertices, MultiLogVC proposes to use a multi-log update mechanism that logs updates separately rather than directly update the active edge in a graph. The proposed multi-log system maintains a separate xii log per each vertex interval, thereby enabling each superstep in a vertex-centric programming model to efficiently load and process each sub-log of updates. The last part of the thesis presents a generalized near- storage computing model, for algorithms beyond graphs, that exploits the built-in computing capabilities of SSDs. The proposed computing model called Summarizer presents a set of APIs and data access protocols to offload the data access intensive parts of a computation to the embedded storage controller on SSDs. Through these three innovations this thesis makes a strong case for how to design and deploy next generation storage systems for graph analytics and beyond. xiii Chapter 1 Introduction Graphs have been used to represent the relationship between different entities and are the representation of choice in diverse domains, such as web page ranking, social networks, drug interactions, and communicable disease spreading [6,11,13,76,111]. Due to the sheer size of graphs in these important domains, billions of vertices with tens of billions of edges, or even more, graph processing is a data-intensive task. Acceleration of graph processing can speed up a plethora of scientific discoveries and enable significant societal benefits. It is well known that accessing the necessary data on which to perform computations is a more significant bottleneck than the computing demand itself in graph applications [3, 20, 33, 80]. One way to improve the access latency is to make the graph resident in the main memory. But, as the size of the graphs grow, one has to contend with increasing the size of the main memory. Increasing the amount of DRAM is expensive in terms of energy and cost considerations. Large graphs can occupy Terabytes (TBs) of space, and a single TB of non-ECC DDR4 memory costs nearly 5000$ (about 40$ per 8GB DIMM) in 2019. Higher density DRAM DIMMs with ECC protection cost nearly an order magnitude more. 1 Another way to handle large-scale graph processing is by distributing the compu- tation. While distributed systems are readily available in the form of cloud systems, large-scale distributed graph processing suffers from several challenges. As vertices and edges dictate graph computations, it is difficult to obtain a balanced partition of computa- tion across distributed set of computing nodes. As the computation moves across various edges in the graph these translate into significant communication hurdles across the dis- tributed set of nodes. Many graph algorithms are iterative and have low computation to data access ratio, which further exacerbates the communication overhead. Moreover, distributed systems, in general, have overheads in managing the cluster, fault tolerance, and unpredictable performance [37]. Given these constraints, we explore an alternative approach of processing near stor- age, where the graph data naturally resides. In particular, this thesis relies on solid state drives (SSDs) as the data storage medium. SSDs, with lower latency and higher band- width when compared to hard drives, along with the computing infrastructure embedded in SSDs, have the potential to accelerate large-scale graph applications. However, to exploit the full potential of SSDs in graph applications there is a need to rethink how graphs are stored on SSDs, as well as design new protocols for how SSDs access and manipulate graph data. To this end, this thesis first presents a semantically aware storage architectures that perform graph analytics near storage. This approach focuses on embedding the knowl- edge of graph structure within the solid-state drive (SSD) controllers. After embedding the knowledge of graph structure within the SSD, the thesis then designs programming 2 abstractions to exploit the semantic knowledge at the application layer. These innova- tions enable us to perform graph analytics directly on SSDs, overcoming the cost barriers of using DRAM-based graph processing. Given that SSDs are orders of magnitude more dense than DRAMs, much larger graphs can be processed on storage without needing to distribute graph analytics across DRAMs that are spread across multiple compute nodes. Based on the knowledge gained from semantic storage computing, in the last part of the thesis, we generalize the near storage computing paradigm to execute any arbitrary general-purpose computing function on the storage controller to reduce the unnecessary data movement between SSD and CPU. We start with a basic overview of current-day SSD architectures, followed by graph analytics background to better place the thesis contribution in context. 1.1 SSD Overview Over the past decade, the cost of solid-state drives (SSD) has fallen dramatically. NAND Flash SSDs cost about $100 per 1TB as of late 2019. With the advent of non- volatile memory express (NVMe) interface, SSDs can offer dramatic improvements in bandwidth and enable tighter integration of compute with storage. An SSD 970 EVO Plus NVMe-enabled SSD, for instance, was stated to provide 3,500 MB/s sequential read speed, and a 600K random read IOPS (I/O operations per second) [92]. SSDs when compared to hard disk provides100 times lower latency and25 times higher bandwidth [49, 67]. The advent of affordable SSD based storage provides new opportunities to improve the performance of large-scale graph applications. 3 SSD Controller On-chip network DMA engine DRAM controller Flash interface Flash interface Flash interface Flash interface Host I/O interface Embedded cores Embedded cores Embedded cores Embedded cores SSD DRAM Embedded cores Embedded cores Embedded cores ECC / accelerators NAND flash packages Flash dies Flash dies Flash dies NAND flash dies Figure 1.1: The architecture of a modern SSD This section provides a brief overview of SSD architecture and the NVMe protocol that the host uses to communicate with the SSD. 1.1.1 Architecture of modern data center SSDs Figure 1.1 depicts the hardware architecture of a modern SSD supporting NVMe protocol via PCI Express (PCIe) bus. Most SSDs use NAND flash memory packages as the primary non-volatile storage elements. Each SSD may contain dozens of flash chips which are organized into multiple channels. Each flash chip can be accessed in parallel and all chips that share a single channel may be multiplexed on that channel. Furthermore, multiple channels can be accessed in parallel. Both the chip and channel level parallelism provides significant internal bandwidth in SSDs. Data center SSDs can 4 provide massive internal parallelism with multi-channel topology of flash memory pack- ages and die-stacked fabrication per package [15]. For instance, a commercial NVMe SSD in 2018 can support 32 channels of MLC NAND flash memory packages [74] and achieve up to 4.5 GB/s total internal bandwidth [73]. Recent advances in 3D NAND flash technology achieve even higher data bandwidth per package, and are thus capable of providing much higher internal bandwidth [73]. Increasing the internal bandwidth beyond the current level is not very advantageous in SSD platforms since the external bandwidth of even high-end NVMe SSDs currently saturates at less than 4 GB/s due to bandwidth limitation on PCIe lanes that connect the SSD to the host computing fabric. Rather than increasing the internal bandwidth SSD desigs use fewer packages of 3D NAND to match the internal bandwidth to the external bandwidth limitations. To summarize, we note that although increasing the internal bandwidth of an SSD can be achieved by equipping multiple flash memory channels and advanced fabrication process, extending the external bandwidth between host CPUs and SSDs would require additional PCIe lanes from the processor. In fact, the recent move from SATA SSDs to NVMe SSDs in data centers was triggered by the fact that SSD’s internal parallelism far exceeds the maximum bandwidth supported by SATA, even though NVMe interface is more expensive than SATA [84]. 5 To effectively manage the channel parallelism and internal bandwidth, modern SSDs integrate embedded multi-core processors as SSD controllers. These processors han- dle I/O request scheduling, data mapping through flash translation layer (FTL), wear- leveling, and garbage collection. The controllers connect to flash memory chips through channels and issue flash memory commands to perform I/O operation in each channel in parallel. SSDs also provision a DRAM controller to interface with DRAM, which acts as temporary storage for flash data and also to store the controller data structures. In ad- dition to the embedded cores, each SSD may also contain several hardware accelerators to provide efficient error correction code (ECC) processing or data encryption. The NVMe SSD attaches to the host computer system’s PCIe interconnect through a standard PCIe slot. NVMe interface fetches I/O commands such as read and write operations from the host CPU and performs direct memory access (DMA) operations to move the data from SSD to the host. For instance, the host CPU may specify the memory address into which the data must be placed after reading the data from SSD. The NVMe interface then enables a DMA transfer from the SSD controller to the host memory. Current generation NVMe SSDs support dozens of host requests to be queued within the SSD controller’s command queues. The controller may schedule these requests to maximally utilize the available chip and channel level parallelism. The write behavior of flash memory is significantly different from that of magnetic memory technologies. Every page in flash memory becomes immutable after being written once. The page cannot be updated again without the block containing it being fully erased. One block can contain between 64 to 512 pages. Each block can only be 6 erased for a limited number of times during its lifetime, and block erase operations are significantly slower than page reads and writes. To address these limitations and provide a longer lifetime, SSDs implement the Flash Translation Layer (FTL) in the SSD controller. The SSD uses general-purpose embedded processors to run the firmware necessary for FTL operations. The FTL firmware maps the logical block address (LBA) requested from host applications to the physical page address (PPA) in the flash memory chips. During a page modification process, SSDs write the updated page contents to a new empty physical page rather than erase-and- write an entire block. Consequently, the logical block address (LBA) of a flash page is mapped to a new physical page address (PPA) in the flash memory space whenever the page data is updated (and the old page is invalidated). Application developers can simply use the LBA and need not be aware of the change in the physical page address. The FTL automatically manages this LBA-to-PPA mapping table. The FTL mapping tables are cached in the DRAM on the SSD system for faster access. Furthermore, to guarantee a long lifetime for the entire flash memory in the SSD, a wear-leveling algorithm is used for efficiently mapping logical page to physical page. Wear-leveling algorithms may pick a physical page to minimize page failures and to allow all physical pages to be equally used over the SSD lifetime. In addition, the FTL firmware periodically executes garbage collection (GC) to reclaim space in blocks with invalid pages. 7 1.1.2 NVMe NVM Express (NVMe) is a protocol for the SSDs attaching to the PCIe bus or M.2 interface [7]. NVMe avoids the disk-centric legacy of SATA and SCSI interfaces and leverages PCIe to provide scalable bandwidth. For example, 4-lane Generation 3 PCIe used in data center NVMe SSDs supports up to 3.9 GB/sec full-duplex data transfer, while SATA can typically only achieve 600 MB/sec. NVMe also supports more concur- rent I/O requests than SATA or SCSI by maintaining a software command queue that may hold up to 64K entries, and its command set includes scatter-gather data transfer operations with out-of-order completion, further improving performance. NVMe is highly scalable and capable of servicing multiple I/O requests in parallel. This capability makes NVMe a good candidate for modern data processing applications, such as graph applications and databases, where the application needs to pull enormous amounts of data from secondary storage and feed it into parallel computing devices like multi-cores and GPUs. NVMe supports a streamlined yet powerful set of commands that can initiate and complete I/O operations. Each command has a fixed length of 64 bytes containing in- formation including a command identifier, the logical block address, and the length of the requesting data. An NVMe command can also contain a list of Physical Region Page (PRP) entries, which enables scatter-gather data transfers between the SSD and other devices. The PRP entry can specify a list of pairs of the base address and offset in host memory corresponding to multiple sub-transfers that the device can execute out-of-order. 8 1.2 Graph Processing Overview This section provides a brief overview of graph computational model, out-of-core graph processing, and graph formats. 1.2.1 Graph Computational model Many popular graph processing systems use vertex-centric programming paradigm [8]. In this programming model, the input to a graph computation is a directed graph, G = (V;E). Each vertex in the graph has an id, for instance between 1 tojVj, and a modifiable user-defined value associated with it. For a directed edgee=(u;v), we refer e to be the out-edge ofu, and in-edge ofv. Also, fore, we referu to be source vertex and v to be the target vertex, ande may be associated with a modifiable, user-defined value. A typical vertex-centric computation consists of input, where the graph is initialized, followed by a sequence of supersteps separated by global synchronization points until the algorithm converges, and finishes with output. Within each superstep, vertices com- pute in parallel, each executing the same user-defined function that expresses the logic of a given algorithm. A vertex can modify its state or that of its neighboring edges, or generate updates to another vertex, or even mutate the topology of the graph (e.g., K-Core [86]). Edges are not first-class citizens in this model, having no associated com- putation. In the starting superstep, each vertex is active and scheduled for processing. A vertex can deactivate itself in a superstep, and it will not be scheduled for processing in the 9 later supersteps unless it is reactivated. If a deactivated vertex receives an update, then it is reactivated again. When there are no more active vertices or updates in transit, the algorithm is said to have converged. Depending on when the updates are visible to the target vertices, the computational model can be either synchronous or asynchronous. In the synchronous computational model, the updates generated to a target vertex are available to it in the next superstep. Graph systems such as Pregel [66], and Apache Giraph [9] use this approach. In an asyn- chronous computational model, an update to a target vertex is visible to that vertex in the current superstep. So if the target vertex is scheduled after the source vertex in a super- step, then the current superstep’s update is available to the target vertex. GraphChi [54] and Graphlab [61] use this approach. An asynchronous computational model is shown to be useful for accelerating the convergence of some numerical algorithms [54]. However, as the separation between when the update is generated and when the update is consumed is not clear in the asynchronous computation model, it may be harder to reason about the convergence and correctness properties. Graph algorithms can also be broadly classified into two categories, based on how the updates are handled. A certain class of graph algorithms exhibit associative and commutative property. In such algorithms, updates to a target vertex can be combined into a single value, and they can be combined in any order before processing the vertex. For example, in the pagerank algorithm, all the delta updates from the source vertices can be added into a single value. Algorithms such as pagerank, BFS, single-source shortest path fall in this category. There are many other graph algorithms that require all the update messages to a target vertex to be preserved before processing them at 10 the target vertex. In this scenario, all the generated messages to a target vertex need to be preserved before processing them at the target vertex. For example, in the frequent label propagation algorithm for community detection [87], at a target vertex, one has to identify the frequently occurring label among its source vertices. So when processing at a target vertex, one may need to have access to all the label updates from the source vertices. Algorithms such as community detection [87], graph coloring [30], maximal independent set [66] fall in this category. 1.2.2 Graph formats We will now describe the popular graph formats used in graph processing and discuss several challenges when using these graph formats for processing. Note that some of these graph formats have been designed specifically for use within an out-of-core graph processing framework. Such frameworks must deal with graph sizes that are orders of magnitude larger than the main memory size, hence necessitating the design of new graph storage formats. GraphChi graph format: GraphChi [54] is an out-of-core vertex-centric program- ming system. The graph format used by GraphChi is inspired by the hard disk drive limitations where random accesses to sectors is significantly worse than sequential ac- cesses. GraphChi partitions the graph into several vertex intervals, and stores all the incoming edges to a vertex interval as a shard. Figure 1.2b shows the shard structure for an illustrative graph shown at the top of Figure 1.2a. For instance, shard1 stores all the incoming edges of vertex interval V1, shard2 stores V2’s incoming edges, and shard3 11 V1 V2 V3 V6 V5 V4 V1 V2 V3 V4 V5 V6 V1 0 4 0 0 0 0 V2 0 0 0 0 0 0 V3 8 4 0 0 0 0 V4 0 0 0 0 0 0 V5 0 0 0 0 0 0 V6 3 5 3 2 1 0 4 8 4 3 3 5 1 2 Matrix format CSR format Edge weight Adjacent vertices 1 2 3 4 5 6 7 8 val 4 8 4 3 5 3 2 1 colIdx 2 1 2 1 2 3 4 5 rowPtr 1 2 2 4 4 4 9 (a) CSR format representation for the example graph Src Dst Val 3 1 8 6 1 3 Src Dst Val 1 2 4 3 2 4 6 2 5 Src Dst Val 6 3 3 4 2 5 1 Shard1: V1 Shard2: V2 Shard3: V3-V6 (b) GraphChi shard structure for the example graph Figure 1.2: Graph storage formats 12 stores incoming edges of all the vertices in the intervalV3V6. While incoming edges are closely packed in a shard, the out-going edges of a vertex are dispersed across other shards. In this example, the out-going edges ofV6 are dispersed across shard1, shard2, and shard3. Another unique property of the shard organization is that each shard stores all its in-edges sorted by source vertex. GraphChi relies on this shard organization to process vertices in intervals. It first loads into memory a shard corresponding to one vertex interval, as well as all the out- going edges of those vertices that may be stored across multiple shards. As all the out-going edges for a vertex interval in other shards are at contiguous locations in other shards, these out-going edges can be accessed with mostly sequential page reads from disk drives. Updates generated during processing a vertex interval are asynchronously and directly passed to the target vertices through the out-going edges, which are already fetched from other shards into memory. Once the processing for a vertex interval in a superstep is finished, its corresponding shard and its out-going edges in other shards are written back to the disk. Then a new vertex interval processing starts. Using the above approach, GraphChi primarily relies on sequential accesses to disk data and minimizes random accesses. However, as graph supersteps progress, many ver- tices may become inactive; a vertex becomes inactive if it does not receive any messages on its in-edges during a prior superstep. The in-edges to a vertex are stored in a shard, and in-edges to a vertex can come from any source vertex. These in-edges in a shard are sorted based on source vertex id. Hence, even if a single vertex is active within a 13 vertex interval, the entire shard must be loaded since the in-edges for that vertex may be dispersed throughout the shard. So to access the in-edges of active vertices within a shard, one has to load the entire shard, even though there may be few active vertices in that shard. For instance, if any of theV3,V4,V5, orV6 is active, the entire shard3 must be loaded. Loading a shard may be avoided only if all the vertices in the associated ver- tex interval are not active. However, in real-world graphs, the vertex intervals typically span tens of thousands of vertices, and during each superstep, the probability of a single vertex being active in a given interval is very high. As a result, GraphChi, in practice, ends up loading all the shards in every superstep independent of the number of active vertices in that superstep. 1.2.3 CSR format in the era of SSDs An alternative to GraphChi format is compressed graph representation formats, such as the compressed sparse row (CSR) format for graph processing. Large graphs tend to be sparsely connected. CSR format takes the adjacency matrix representation of a graph and compresses it using three vectors [52, 69, 72]. The value vector, val, stores all the non-zero values from each column sequentially. The column index vector, colIdx, stores the column index of each element in the val vector. The row pointer vector, rowPtr, stores the starting index of each row in the val vector. CSR format representation for the example graph is shown in Figure 1.2a. The edge weights on the graph are stored in val vector, and the adjacent outgoing vertices are stored in colIdx vector. To access adjacent outgoing vertices associated with a vertex in 14 CSR graph storage format, we first need to access the rowPtr vector to get the starting index in the colIdx vector where the adjacent vertices associated with the vertex are stored in a contiguous fashion. To get an edge weight, we need to get the adjacent vertices for the source vertex and find the index of the destination vertex in the colIdx vector; with that index, we can access the corresponding location in the val vector, where edge weights are stored. CSR format has the desirable property that one may load just any active vertex information more efficiently than loading an entire shard. Furthermore, as all the outgoing edges connected to a vertex are stored in a contiguous location in the colIdx vector, accessing the adjacency information for the active vertices in a CSR format leads to sequential accesses. In this thesis, we use CSR representation of a graph for performing graph analytics because it provides more efficient mechanisms to access a subset of active vertices, unlike shard based graph structure that requires the entire shard to be loaded. We will quantitatively demonstrate the benefits of our graph format choice throughout the thesis. 1.2.4 Challenges for graph processing with a CSR format While CSR format looks appealing, it suffers one fundamental challenge. One can either maintain an in-edge list in the colIdx vector or the out-edge list, but not both (due to coherency issues with having the same information in two different vectors). Consider the case that adjacency list stores only in-edges, and during the superstep, all the updates on the out-edges generate many random accesses to the adjacency lists to extract the out-edge information. 15 1.2.5 A case for graph semantic awareness in SSDs Traditionally, SSD controllers treat data that they store as a collection of blocks and pages without any semantic knowledge of the structure of the stored data. The host processor provides a logical block address to the flash translation layer (FTL), and the FTL converts the logical block address to physical block address to access the flash memory. For instance, in graph applications, the SSD controller is unaware of the fact that the underlying data is a graph representation. This lack of semantic knowledge leads to a significant waste of storage bandwidth and DRAM cache space, as it leads to fetching and caching unnecessary data as we demonstrate in later chapters in this thesis. SSDs also suffer from write amplification problem where even a small update may lead to a whole page read-modify-update cycle. Write amplification wastes power and reduces reliability as SSDs have limited write endurance. SSDs that are semantically aware of the graph structure and representation of the stored data can become essential building blocks for efficient graph processing architectures. The application layer can query the SSD using graph-oriented access requests, such as finding all the neighbors of a given vertex. Semantic awareness also reduces write amplification through fine- grain vertex and edge oriented data storage. Finally, a semantically aware SSD can also perform better graph caching within the internal DRAMs by keeping track of hot vertices and hot edges that are frequently traversed. 1.3 Thesis statement In this thesis, we make a case for semantically aware storage systems where SSDs are made aware of the structure of the data that is being stored and processed. In particular, 16 we expose graph semantics to an SSD controller to create a new graph storage archi- tecture called GraphSSD. We present the design and implementation of graph semantic aware SSD, GraphSSD, which manages graphs on the SSD platform. For faster accesses, GraphSSD uses a storage organization that matches the structure of the graph. Further, GraphSSD supports efficient graph modifications. It exposes this graph specific capa- bility to the application developer through a set of basic graph APIs, which are robust enough to support complex, large-scale graph processing. Second, we propose an efficient vertex-centric out-of-core graph processing frame- work on SSDs, namely MultiLogVC. We make the observation that nearly all graph algorithms have a dynamically varying number of active vertices that must be processed in each iteration. To efficiently access the storage proportional to the number of active vertices, we propose to use a multi-log update mechanism that logs updates separately rather than directly update the active edge in a graph. Our proposed multi-log system maintains a separate log per each vertex interval. This separation enables us to efficiently process each vertex interval by just loading the corresponding log. Further, we reduce unnecessarily reading pages of data corresponding to inactive vertices, a problem called read amplification, due to the page granular accesses in SSD by logging the active vertex data in the current iteration and efficiently reading the log in the next iteration. This helps us in reducing reading SSD pages with fewer active vertex data. Third, we generalize GraphSSD framework to enable a programmer to run any arbi- trary computation on large blocks of data by bringing computing closer to the data, rather than moving data closer to the compute. We propose our framework Summarizer, which 17 enables the programmer to offload data-intensive computations with limited compute intensity to the storage processor. 1.4 Thesis Overview In the following sections, we present an overview of each of the major contributions of this thesis. 1.4.1 #1: GraphSSD: Graph semantics aware SSD Graph analytics play a key role in a number of applications such as social networks, drug discovery, and recommendation systems. Given the large size of graphs that may exceed the capacity of the main memory, application performance is bounded by storage access time. Out-of-core graph processing frameworks try to tackle this storage access bottleneck through techniques such as graph sharding, and sub-graph partitioning. Even with these techniques, the need to access data across different graph shards or sub-graphs causes storage systems to become a significant performance hurdle. In the first part of this thesis, we propose a graph semantic aware solid-state drive (SSD) framework, called GraphSSD, which is a full system solution for storing, accessing, and performing graph analytics on SSDs. Rather than treating storage as a collection of blocks, GraphSSD con- siders graph structure while deciding on graph layout, access, and update mechanisms. GraphSSD replaces the conventional logical to physical page mapping mechanism in an SSD with a novel vertex-to-page mapping scheme and exploits the detailed knowledge of the flash properties to minimize page accesses. GraphSSD also supports efficient 18 graph updates (vertex and edge modifications) by minimizing unnecessary page move- ment overheads. GraphSSD provides a simple programming interface that enables application devel- opers to access graphs as native data in their applications, thereby simplifying the code development. It also augments the NVMe (non-volatile memory express) interface with a minimal set of changes to map the graph access APIs to appropriate storage access mechanisms. Our evaluation results show that the GraphSSD framework improves the performance by up to 1.85 for the basic graph data fetch functions and on average 1:40, 1:42, 1:60, 1:56, and 1:29 for the widely used breadth-first search, con- nected components, random-walk, maximal independent set, and page rank applications, respectively. 1.4.2 #2: MultiLogVC: Efficient out-of-core graph processing framework Vertex-centric programming model described in subsection 1.2.1 is the popular pro- gramming model for large-scale graph analytics. In the vertex-centric programming model during each superstep/iteration, many vertices may become inactive over time. For inactive vertices, there is no need to load the vertex related data such as its neigh- bors lists, vertex value, etc. We demonstrate that over a wide range of applications, the fraction of active vertices and active edges shrink dramatically as supersteps progress. However, at the granularity of a shard (in GraphChi graph format described in subsection 1.2.2), even a few active vertices lead to loading many shards since the active vertices are spread across the shards. 19 In the second part of this thesis, we build on our compressed sparse row (CSR) based graph storage that is more amenable for selectively loading only a few active vertices in each iteration. But CSR based graph processing suffers from random update propagation to many target vertices. To solve this challenge, we propose to use a multi-log update mechanism that logs updates separately, rather than directly update the active edge in a graph. Our proposed multi-log system maintains a separate log per each vertex interval. This separation enables us to efficiently process each vertex interval by just loading the corresponding log. Further, while accessing SSD pages with fewer active vertex data, we reduce the read amplification due to the page granular accesses in SSD by logging the active vertex data in the current iteration and efficiently reading the log in the next iteration. Over the current state of the art out-of-core graph processing framework, our evaluation results show that MultiLogVC framework improves the performance by up to 17:84, 1:19, 1:65, 1:38, 3:15, and 6:00 for the widely used breadth-first search, pagerank, community detection, graph coloring, maximal independent set, and random-walk applications, respectively. 1.4.3 #3: Summarizer: Generalizing near storage processing Processing large volumes of data is the backbone of many application domains be- yond graph processing. The cost of transferring data from storage to compute nodes starts to dominate the overall application performance in many domains, such as databases. 20 Applications can spend more than half of the execution time in bringing data from stor- age to CPU [105]. In the last part of the thesis, we generalize our GraphSSD approach to enable near storage computing for a broad range of applications, called Summarizer. Summarizer is an architecture and computing model that allows applications to make use of wimpy SSD processors, which may have highly dynamic utilization behavior, for filtering and summarizing data stored in SSD before transferring the data to the host. Summarizer reduces the amount of data moved to the host processor and also allows the host processor to compute on filtered/summarized results, thereby improving the overall system performance. A prototype Summarizer system is implemented on a custom-built flash storage system that resembles existing SSD architectures but also enables fine-grain computational offloading between the storage processor and host. Using this prototype, we demonstrate the benefits of collaborative computing between the host and embedded storage processor on the SSD board. Summarizer dynamically monitors the amount of workload at the SSD processor and selects the appropriate work division strategy among the host processor and the SSD processor. Summarizer’s work division approach quantifies the potential of using both the host processor and SSD processor in tandem for improved performance. We evaluate using applications from two important domains, namely data analytics and data integration. We perform design space exploration of our proposed approach by varying the bandwidth and computation capabilities of the SSD processor. We evaluate static and dynamic approaches for dividing the work between the host and SSD processor, and show that our design may improve the performance by up to 20% when compared to the processing at the host processor only, and 6 when compared to the processing at the SSD processor only. 21 1.5 Dissertation Organization The rest of the thesis is organized as follows. One way to improve SSD access per- formance in graph processing is by making SSD aware of the application semantics. Chapter 2 proposes graph semantics aware SSD, GraphSSD. We present the design and implementation of GraphSSD, which manages graphs on the SSD platform. Chapter 3 proposes an efficient vertex-centric based out-of-core graph processing framework on flash-storage based systems, namely MultiLogVC. MultiLogVC relies on logging updates, rather than directly updating the graph, thereby overcoming the bottleneck of using CSR formatted graphs for graph processing. Chapter 4 improves storage access performance by helping the programmer in offloading the appropriate amount of work on to the storage processor and reduce the amount of data transferred to the host system. Chapter 5 concludes the thesis with retrospective thoughts. 22 Chapter 2 GraphSSD: Graph Semantics Aware SSD 2.1 Chapter Overview Graphs analytics are at the heart of a broad range of applications such as social net- work analysis, drug discovery, page ranking, transportation systems, and recommenda- tion systems. The size of the graphs in many of these domains exceeds the size of main memory seen in commodity computing systems. There is a need to consider efficient storage-centric graph processing, at least in a subset of application scenarios where the size of the graph far exceeds the size of the main memory. It is well known that data input/output (I/O) time to access large graphs consumes a significant fraction of the total execution time compared to the CPU and memory access time [2, 33, 81]. On the storage front, the cost of solid-state drives (SSDs) has fallen dramatically. NAND Flash SSDs cost about $100 per 1TB as of late 2019, and the price is expected to reduce further. With the advent of non-volatile memory express (NVMe) [38] interface, SSDs can offer significant improvements in bandwidth and enable tighter integration of computing with storage. Furthermore, SSDs are equipped with reasonably capable 23 compute fabric to handle flash management tasks. The advent of such affordable SSDs provides new opportunities to improve the performance of graph analytics by making storage systems semantically aware of the graph data being stored. In particular, we make a case for treating graphs as a native format supported on storage, rather than treat- ing graphs as a collection of pages that are accessed using standard block I/O interface. This chapter presents the design and implementation of a graph semantic aware SSD (GraphSSD) to manage graphs on an SSD platform. GraphSSD supports the compressed sparse row (CSR) format for graph layout and further customizes this format to enable fast mapping of vertex id to the physical page location that contains the adjacency in- formation of that vertex. GraphSSD provides a set of programming APIs to application developers to access graph vertex and edge information, similar to recent graph frame- works such as GraphCHI [54] and Pregel [66]. But the novelty of this work is that the SSD controller is made aware of the graph data structures stored on the SSD. Thus the controller can automatically translate the graph access APIs into a set of low level phys- ical page accesses to fetch the requested data. These APIs accept basic graph related queries such as fetching adjacent vertices for a target vertex, fetching edge weights of connected edges. GraphSSD APIs are robust enough to enable developers to write com- plex graph analytics on top of GraphSSD. GraphSSD provides solutions to the following challenges: 1. Graph as native objects: GraphSSD allows the embedded controller in SSDs to treat graphs as native storage objects, and provides a set of APIs that can be used to access the graph objects on storage. 24 2. NAND flash aware graph layout: NAND flash memory can only be accessed in fixed-size pages, and pages can be accessed concurrently only across the parallel units. The widely varying sizes of the adjacent neighbors across different vertices require a graph storage mechanism that accommodates this diversity within the constraints of the flash memories. GraphSSD tackles this challenge by relying on a compressed sparse row (CSR) representation for a graph, and embedding metadata in NAND pages to store edges from one or more vertices. 3. Efficient indexing mechanism: GraphSSD presents an innovative graph translation layer (GTL), which translates a vertex id to a physical page address on the flash memory media directly, thereby reducing unnecessary indirect page accesses to reach the metadata associated with a given vertex. 4. Indexing compaction: GraphSSD reduces the GTL mapping overhead by co-locating multiple vertices with few edges in the same physical page. 5. Support for graph updates: GraphSSD relies on Delta graphs and Delta merging mechanisms that allow GraphSSD to modify only a small subset of pages contain- ing the updated sub-graph instead of re-shuffling the entire graph. 6. We implement GraphSSD framework on an industrial strength SSD development platform to show the performance improvement of GraphSSD over a conventional graph storage architecture. Our evaluation results show that the GraphSSD frame- work improves the performance by up to1:85 for the basic graph data fetch func- tions and on average1:40, 1:42, 1:60, 1:56, and1:29 for the widely used 25 breadth-first search, connected components, random-walk, maximal independent set, and page rank applications, respectively. The remainder of this chapter is organized as follows: Section 2.2 provides a brief background for GraphSSD. The detailed architecture and functions of GraphSSD are described in Section 2.3. The implementation methodology, evaluation platform and the experimental results are presented in Sections 2.4, 2.5, and 2.6 respectively. Related work is provided in Section 2.7, and we conclude this chapter in Section 2.8. 2.2 Background 2.2.1 Out-of-core graph processing In many domains, graphs are large requiring out-of-core graph processing. Namely, graphs are processed in smaller chunks where each chunk is read from the storage into the DRAM. Large graphs also tend to be sparsely connected and hence to reduce the I/O bottleneck large graphs are stored in a compressed format, such as the compressed sparse row (CSR) format. Alternate approaches have also been proposed to access graphs in smaller chunks such as shards [54]. Irrespective of the choice of the graph storage format, all prior techniques require the storage to be treated as a block device. In this work, the storage controller understands the semantics of graphs and the system provides a set of APIs to program the storage controller to access graph data. In this context we use CSR format in our implementation. CSR format is widely used for general 26 purpose large-scale graph processing [77, 123] and is shown to be efficient for out-of- core graph processing systems [44, 55, 64, 77]. Description about CSR format can be found in subsection 1.2.3. 2.2.2 Graph updates Many of the existing graph frameworks assume static graphs but there is a need to sup- port dynamic graphs whose edge and vertex related information may be updated [77]. GraphSSD supports efficient graph updates by building on Delta graphs introduced in LLAMA [64]. GraphSSD maintains graph updates as a series of snapshots. Initially, all the vertices store their adjacency list as one contiguous vector. When graph updates are performed, multiple updates are grouped together into a single snapshot. At regular intervals a snapshot is written back to storage and a new snapshot is created for the in- coming updates. Multiple snapshots are chained together alongside the initial adjacency list as a linked list of pointers. To retrieve the entire adjacency list for a vertex, one has to traverse these chains of pointers starting from newer snapshots to older snapshots. This pointer chasing may increase the latency for accessing adjacent vertices. To reduce this inefficiency GraphSSD also merges the snapshots at regular intervals to create a single contiguous adjacency list for each vertex. When using a CSR based graph representation to merge updates one has to read the entire graph and merge the updates and write to a new location [64, 77]. To have a consistent view one can read a vertex’s adjacency list from the new location only after 27 SSD DRAM Host GraphSSD cache and log manager Graph applications GraphSSD cache NVMe PCIe Interface I/O command decoder Graph command decoder Command handler Graph command handler Translation layer FTL GTL Interconnection network DRAM Page buffer NAND pages Graph command handling Graph layout Graph garbage collection GraphSSD I/O controller Figure 2.1: GraphSSD architecture overview the entire merge operation has been performed. We describe how GraphSSD handles these potential inefficiencies in later parts of this chapter. 2.3 Architecture Figure 2.1 provides an overview of the GraphSSD architecture. GraphSSD has the following components: graph translation layer (GTL) and graph command decoder on the SSD; and a host side graph caching layer and a graph update logger. 28 2.3.1 Graph command decoder: The capabilities of GraphSSD are exposed to the application programmer through a set of graph access APIs. These APIs are implemented within the GraphSSD library that can be linked into any graph application. Each of these APIs in turn activate the SSD microcontroller to perform some of the storage access tasks. These activation com- mands are transferred as NVMe commands issued from the host to the SSD. The current NVMe protocol has several unused bytes in the read/write command encoding which can be easily adapted to implement the GraphSSD APIs, if deemed necessary to mini- mize any protocol changes. All NVMe commands from the host are first intercepted by the GraphSSD command decoder. The command decoder routes all GraphSSD APIs to the graph command processing, while regular NVMe commands (page read and write) are sent to the default SSD processing sequence. Using this approach GraphSSD can co-exist with traditional block based storage interface within a single SSD. Category Commands Graph read commands GetAdjacencyList, GetEdgeWeight Graph update commands AddEdge, AddVertex, DeleteEdge, DeleteVertex, Upda- teEdge, UpdateVertex Graph initialization GraphInitialize Table 2.1: GraphSSD APIs The list of APIs supported by GraphSSD currently is shown in Table 2.1. All GraphSSD APIs provide a vertex id to start any type of graph access. As such the first step in GraphSSD processing is to access the vertex id and its associated edges in the graph 29 Vertex id PPN Dirty Extension Valid V i PPN i 0 0 1 V i+N PPN j 0 0 1 Figure 2.2: Graph translation table from the flash storage. To enable fast access to the physical pages consisting of the ver- tex related data we propose a novel graph translation layer as a substitute for traditional FTL used in SSDs. 2.3.2 Graph translation layer In order to understand the workflow of GraphSSD it is important to understand how graphs are laid out in storage by our design. As described earlier, we assume that graphs are represented in CSR format using three vectors, named rowPtr, colIdx, and val. Each entry in the rowPtr vector is essentially the starting index into colIdx (and val) vectors where the neighbors of that vertex are located. GraphSSD preserves this indexing mech- anism while laying out these vectors in the flash pages. The colIdx and val vectors are proportional to the number of edges in the graph and hence they are significantly larger than the rowPtr vector. GraphSSD stores only the colIdx and val vectors in the NAND flash pages, and uses the rowPtr vector as an indexing table. GraphSSD provides a trans- lation layer for this indexing purpose, called the Graph Translation Layer (GTL). GTL replaces the more traditional LBA-to-PPN (logical block address to physical page num- ber) page mapping used in commodity SSDs. GraphSSD maps a given vertex id (V i ) to the physical NAND page number where the neighbor vertices (colIdx values) are stored. 30 GTL architecture: Figure 2.2 shows the structure of graph translation table (GTT). Each entry in GTT includes the mapping from a vertex id to the physical page number (PPN), and a tuple of status flags (dirty, extension, and valid). We will later describe how the GTT status flags are utilized. Conceptually each vertex maps to the physical pages storing all its neighbors through GTT. However, most of the real-world graphs have sparse connectivity. Hence, lots of vertices have only a few edges. As such, it is possible to co-locate the neighbors of multiple vertices in a single physical page. In this scenario, it is wasteful to allocate one GTT entry per vertex. To reduce this wastage GTT stores only one vertex id per physical page. GTT stores just the smallest vertex id from all the vertices whose neighbors are stored in a given physical page. Namely for each vertex V i , there is a GTT entry indexed with vertex id V j that is smaller than or equal toV i . The next entry in GTT has a vertex idV k which is greater thanV i . To make this search process efficient, GTL stores all the vertex ids in sorted order in GTT. For instance, in Figure 2.2 the first GTT entry stores the vertex to physical page mapping for a starting vertex idV i and the second GTT entry stores the vertex to physical page mapping for vertex idV i+N . Hence, any vertex id betweenV i andV i+N1 will find all its associated edges in the physical page mapped from the first GTT entry. If the graphs are directional, GraphSSD stores the incoming and outgoing edge information in separate pages, and keeps separate GTT for each of incoming and outgoing edge information. Since each physical page may store neighbors of multiple vertices we need to identify the offset of the neighbor list for each vertex id. For this purpose, each physical page includes additional fields to store layout information as shown in Figure 2.3. We will describe the fields in the page starting from the last field and moving to the front. The last 31 Number of vertices Offsets for vertices for V i ~ V i+N-1 Neighbor vertices for V i ~ V i+N-1 {vertex id, offset} 1 No. of vertices No. of edges V i_N1 ,..., V i_Nk V i+1_N1 ,..., V i+1_Nl V i : Nk neighbors V i+1 : Nl neighbors V i+N-1_N1 ,..., V i+N-1_Nm V i+N-1 : Nm neighbors {End,Ne+N m }, {V i+N-1 ,Ne}, …, {V i+1 ,Nk+1},{V i ,1} Figure 2.3: Page layout field in the physical page stores the number of vertices whose neighbor lists are stored in that page. Preceding this count there are N+1<vertex, offset> tuples, corresponding to the N vertices stored in that page. Each tuple stores the vertex id and the starting byte offset within the page where the neighbor list for that vertex is stored in that page. Since the last adjacency list stored in page may fill a page partially a special tuple (labeled asEnd;Ne in the figure) is used to indicate the ending offset for the last neighbor list. This ending offset of the last neighbor list is necessary to mark where the valid data in a page ends. The offset information and metadata described above is stored from the end of the page, and the actual vertex neighbor lists are stored from the starting of the page. Storing location pointers along with the adjacent vertices in a page helps us 1) in reducing the size of the GTT which will enable keeping a large chunk of GTT in DRAM, and 2) making no extra NAND page accesses to reach the adjacent vertices associated with a vertex id. We now discuss different graph layout scenarios in GTT. 1. When neighbor vertices of a vertexV i are stored entirely in a page: In this case, all neighbors are stored contiguously in the page and the starting offset of the neighbor list is stored in the location tuple associated withV i . 32 2. When neighbor vertices of a vertexV i are stored across multiple pages:V i ’s neigh- bors span multiple pages for two reasons. First,V i has many neighbors which will not fit in a single page. In this case, at least one page stores only the neighbors ofV i . That page will store just a single location pointer tuple and the last field on the page indicates that only a single vertex’s neighbors are located on that page. After filling multiple full pages for a long neighbor list, there may be at most one partial page to store the last remaining neighbors. That page may also store the neighbors of other vertices. In this case, the location pointer tuples of all vertices includingV i are stored just as the first case above. GTL handles these dense vertices by storing theV i to physical page mapping in GTT for each page that stores the neighbors ofV i . ThusV i may have more than one GTT entry. 3. There is a third case where the number of neighbors of V i may not fit in the existing free space in a page, and hence may span two different pages, even though the total number of vertices do not exceed a single page. We explored different options for packing the page but in the end, for simplicity of design, we decided to avoid spanning neighbors across two pages. Hence, if the neighbors do not fit in the leftover space in a page we simply allocate the neighbors to a new page. An example of page layout: Figure 2.4 shows an example of GTT and the cor- responding physical page layout. In this example, the GTL entry corresponding toV1 storesP1, indicating that the neighbors ofV1 are stored in physical pageP1. Since the next entry of GTL corresponds to vertexV3 it implies that the previous GTL entry also stores neighbors ofV2. Similarly, second GTL entry shows that the neighbors ofV3, V4 and V5 are stored in the physical page P2. Finally the neighbors of V6 span two physical pagesP3 andP4. 33 The physical page organization is shown on the right half of Figure 2.4. For instance, the physical pageP2 stores neighbors of three vertices and hence the last field (labeled No:V ) shows the count to be 3. To the left of this field are the tuples that shows each vertex and its starting byte offset in the page. The tuple (V3;1) in physical page P2 shows that vertex V3’s neighbors are located starting at byte offset 1 within the page. A custom tuple (End;3) shows that the last valid byte on this page is byte 2. Hence GraphSSD can extract the neighbors of each vertex by decoding the graph page layout as described. The GTL also shows thatV6 has two page entries since it is a dense vertex with many neighbors that span more than one page. Translation using GTT: Now we will discuss how GTL accesses the neighbors of a vertex idV i using the GTT. For the sake of simplicity, assume that the graph is laid out initially on the NAND page as described in the example above and no updates have been made. GTL does a binary search onV i column in GTT; recall the GTT entries are sorted and hence binary search is efficient. For a vertexV i , GTL identifies indicesj andk such thatj <= i <= k. After that GTL fetches the page pointed by theV j , sayPPN j . It then does a binary search on the location tuples in the page to matchV i in a tuple. If a match is found then the offset associated withV i is then used to access the neighbor list. IfV i is a dense vertex that spans multiple pages then there will be multiple GTT entries forV i . Hence GTL will access each of these pages to construct the neighbor list. An example translation using GTT: Here we will discuss an example translation considering the example graph shown in Figure 2.4. We consider process of locating V3’s neighbors. First, to identify the pages in whichV3’s adjacent vertices are stored, GTL does a binary search on vertex id’s in GTT. AsV3 <= V3 < V6,V3’s adjacent 34 Vid Graph translation table PPN D E V V1 P1 0 0 1 V3 P2 0 0 1 V6 P3 0 0 1 V6 P4 0 0 1 PPN 1 2 3 4 5 6 No.V P1 V2 {V1,1} 2 P2 V1 V2 {V5,3} {V4,3} {V3,1} 3 P3 V1 V2 V3 V4 {End,5} 1 P4 V5 {End,2} {V6,1} 1 Page layout {End,3} Starting vertex id Number of vertexes Number of edges Offsets for vertexes for V i ~ V i+N-1 Neighbor vertexes for V i ~ V i+N-1 {offset, number of neighbors} 1 1 1 {V N0 , V N1 , V N2 , ……, V NX } Page layout No. of vertexes No. of edges {V2,2} {End,2} {V6,1} Figure 2.4: An example of GTT and page layout (storing out-going neighbors) for the shown example graph vertices are stored in the page pointed by the GTT entry with vertex idV3, i.e.P2. Then P2 page is fetched. The last field inP2 indicates that there are 3 vertices whose neigh- bors are stored in this page. Then GTL searches the neighbor tuples to identify where V3’s offset is, which is (V3,1). This location pointer indicates that adjacent vertices for V3 can be found inP2 page at offset starting from 1. Based on next tuple’s (V4,3) offset, GTL identifies the size ofV3’s neighbors list as 2. 2.3.3 Supporting graph updates In this section we briefly discuss GTT support for operations that modify the graph: namely AddEdge() - add an edge between two vertices and also edge weight for it, AddVertex() - which adds a new vertex and a list of its adjacent vertices edge weights. Here we will describe how each of these operations updates the graph data on the NAND pages and the GTT. 35 Vid Extended GTT PPN D E V V3 0x59 0 1 1 V3 P5 0 0 1 V4 P6 0 0 1 PPN 1 2 3 4 5 Vsid No.V P5 V1 V2 {V3,1} 1 P6 V2 {End2,2} {V5,2} {V4,1} 2 Page layout 0x59 {End,3} Starting vertex id Number of vertexes Number of edges Offsets for vertexes for V i ~ V i+N-1 Neighbor vertexes for V i ~ V i+N-1 {offset, number of neighbors} 1 1 1 {V N0 , V N1 , V N2 , ……, V NX } Page layout No. of vertexes No. of edges Figure 2.5: An example for updating GTT with extended bit AddEdge(V id1 ;V id2 ): Add edge function addsV id2 vertex to the neighbor list ofV id1 . It also adds the corresponding edge weight for the added edge. AsV id1 neighbors may be stored in a single page or may span multiple pages, we describe the operation of adding an edge for these two cases. Figure 2.6 shows the flowchart for the AddEdge function. When neighbors ofV id1 are stored in a single page: In this case,V id2 is added at the end ofV id1 neighbors. The subsequent neighbors of other vertices that are stored after neighbors ofV id1 are shifted to higher page offsets. Location pointers of these vertices that are stored at the end of the page are also updated to reflect the new location of their neighbors. This shifting of neighbors may cause an overflow in a page. In this case, when all the vertices don’t fit in the page, the existing page data is split across two pages. The vertices in the existing page are divided between the two pages such that both the pages have a roughly equal amount of empty space while maintaining the page structure described before. The reason for leaving some gaps within a page is to allow for future updates of a page without causing additional overflow. The newly added page must be indexed using the GTT and hence it is necessary to update GTT. But if the GTT entry for this page is added in the sorted order as required 36 Add Vid1→Vid2 edge Load NAND Page to DRAM Vid1 neighbors stored in a single Page? Load NAND Page at highest GTT index to DRAM Add Vid2 to Vid1 neighbors Write to a NAND Page and update GTT Can data fit in a NAND Page? Write to two NAND Pages and update GTT using an extended bit YES NO YES NO Figure 2.6: Flow chart of add edge operation by GTT organization then this addition may shift other entries in the GTT. In the worst case, if a new page entry index is added at the top of the GTT it may cause a ripple effect of shifting all the GTT entries. To avoid this scenario an extended bit is kept in each GTT entry. When the extended bit is set, the GTT entry doesn’t point to a NAND page number but instead points to a location where GTT entries are stored contiguously for the newly added page and the old page that is split. Figure 2.5 shows how GTT is updated with the extended bit, when an adjacent edge is added betweenV4 andV2 and P2 page overflows for the example graph shown in Figure 2.4. When neighbors ofV id1 are stored in multiple pages: Then there are multiple GTT entries with vertex idV id1 . Among them,V id2 is added to the page pointed by the GTT entry at the highest index and after that, it is handled similarly to the above case. 37 An example edge addition: Here we will discuss an example of adding an edge from V4 toV1 considering the graph shown in Figure 2.4. First, the page containing theV4’s neighbor list, P2, is loaded into the DRAM. Then the starting offset forV4 neighbors is identified as 3 and V1 is then inserted at word 3. After insertion, as the neighbors and location pointers cannot fit in a NAND page, they are stored in two NAND pages, P5 andP6. To have roughly equal available space inP5 andP6, vertex neighbors of V3 are stored inP5,V4 andV5’s neighbors are stored inP6. GTT entry forV3 does not store the physical page number and instead it stores the location pointers where the extended GTT information is stored. And also the extended bit atV3’s GTT entry is set, as shown in Figure 2.5. Similar to adding an edge, we also implemented adding a vertex, which inserts the vertex into GTT, and then adds multiple neighbors while maintaining the previously described page layout. 2.3.4 Handling Graph Updates Efficiently With Caching and Delta Graph The graph update process described above leads to many unnecessary page writes. Since SSDs can’t do in place updates, it is not possible to simply update the NAND page with new data, even if the update is as simple as just adjusting the edge weight. Each page update triggers a read-modify-write sequence for the entire page which leads to significant write amplification, in the worst case, by a factor of 1000x. For instance, a single edge weight update leads to reading the full page into a DRAM buffer, modifying the weight in the page and then writing the new 16KB page (GraphSSD page size) to a 38 new location. To reduce the write amplification, GraphSSD implements an optimization that relies on a multi-stage update process. First, all updates are logged on the host side DRAM until sufficient number of updates have been accumulated (one page worth of updates in our current implementation) or when a timer event is triggered (a default value of 100 milliseconds is used in this work). A host side GraphSSD log manager is implemented for handling the logging functionality. Host side logger: Certain updates such as deletions or updating an edge or vertex weight need to check whether the edge or vertex that is modified exists in the graph in the first place. Hence, the host side logger sends a request to the SSD itself to verify the presence of vertex or an edge. The logger also concurrently launches another thread to check for the edge/vertex information in the DRAM log itself (by walking the log backwards in time), since the edge/vertex being updated may still be resident in the DRAM log from a recent update request that is not yet reflected in the SSD. If such an edge/vertex does not exist either in SSD or in the DRAM the API returns a FALSE condition back to the application. Note that the request for presence check on the SSD is simply a read operation and does not trigger any page updates. Delta graphs: When the DRAM log is full or when the timer interrupt expires the host side logger initiates a bulk update sequence. As described earlier, graph insertions may trigger a page overflow and in the worst case each insertion may trigger multiple page writes. While DRAM buffering on the host side helps with this concern, GraphSSD adopts the concept of a Delta graph [64] to further minimize the write amplification. We implement delta graphs using two vectors in SSD, namely deltaPointer and deltaUpdates. All updates for a adjacency list are appended to the deltaUpdates vector. The newly 39 added updates for a vertex points to the previously added delta update for that vertex. DeltaPointer for that vertex points to the index in the deltaUpdates vector that contains the latest delta updates for that vertex. Graph accesses with DRAM logs and delta graphs: The graph access mechanisms must know the presence of delta graph for a given vertex to properly reconstruct the full graph. To mark the presence of a delta graph GraphSSD sets a dirty bit in the GTL entry corresponding to that vertex. Thus when a graph access request is received, it will first access the GTL entry to identify the physical page consisting of the original graph and if a dirty bit is set in GTL entry the access mechanism then uses the DeltaPointer to reach all subsequent updates to that vertex. Merging delta graph with the initial graph: While delta graphs allow graph updates to be gracefully handled in terms of write amplification issue, it does lead to a slow down in graph access latency. As such it is preferable to periodically merge delta graphs into the original graph. For merging the delta graph into the initial graph, we loop over GTT and identify the GTT entries whose dirty bit is set. For these vertices, we access the delta modifications and merge them with existing neighbor list data stored in the original graph. After all the vertex modifications have been merged into the graph, all the dirty bits at the GTT entries are cleared. 2.3.5 Consistency considerations: Anytime there is a graph update that is logged in DRAM there is a risk of losing that state during power failures. As is the case with file buffers in OS that cache file 40 content, during a power loss some of the data may be lost. But what is important is that a consistent view is preserved after a reboot. To create a consistent view of the graph, every update must be atomically performed. For instance, during an edge weight update, a new page needs to be created with the updated edge weight. Even in the presence of a delta graph at some point in future a page update may be initiated when merging delta graphs with the original graph. In this scenario we first create a redo-log entry before initiating the update process. The redo-log stores the GTL entry (physical page number), the deltaPointer and deltaUpdate info for the vertex that is being updated. The update process then will write the new page first, then resets the GTL dirty bit, changes GTL entry to point to the new page, and finally invalidates the old page and resets the deltaPointer entry. If there is a power failure either during the new page write process or after the new page is written but before the GTL dirty bit is reset, on a reboot the redo-log starts the entire update process by selecting another page to write (the page that was being written before the power failure will be garbage collected just like other invalid pages using the default SSD policies for handling write failures). If the power failure occurs after the GTL dirty bit is reset but before the GTL entry is updated with a new page entry the redo-log sets the GTL dirty bit back again and restarts the update process. If the power failure occurs after the GTL entry has been updated but before the old page is invalidated, the redo-log simply invalidates the old page and deltaPointers. Note that there are several optimizations that can be made to improve the performance of redo-logging. We currently focus on functionality and leave optimization for future work. 41 Apart from maintaining a consistent view of the delta graphs, we also need to main- tain a consistent view of the GTT. GTT is modified during processes such as garbage collection, when writing a delta graph to SSD for setting a dirty bit, when merging the delta graph into the original graph, and so on. To reduce the write amplification due to these modifications, we first modify the cached GTT in SSD DRAM and eventually update the GTT stored in flash pages for persistency. A similar procedure is commonly used when updating the FTL in SSDs [99]. To avoid the loss of newer updates to FTL during a sudden power loss, many SSDs use a backup power circuitry [99]. This backup power is used by the SSD controller to flush the updated data in DRAM to flash pages. Similarly, in our work, we assume a backup power circuitry to flush the newer updates to GTT. Some SSDs use log-based journaling mechanisms to avoid relying on the backup power circuitry for partial protection against power loss [99]. Similar log-based journal- ing mechanisms can also be adopted when updating GTT if one doesn’t want to rely on a backup power circuitry. Garbage collection: We use a single bit for every page in the storage to indicate if that page is used by GraphSSD or not. If that page has been used for storing the graph data then while moving that NAND page during garbage collection, the garbage collector informs GraphSSD runtime which will in turn update the GTT entry. The page that is being moved already contains the information regarding the smallest vertex id whose neighbor lists are stored in that page. We update the GTT entry for that vertex id with the new NAND page location where the data is moved. 42 2.3.6 GraphSSD cache manager To improve the performance for graph applications when accessing storage, graph data is cached at the host side. For algorithms based on graph data such as page rank and graph filtering, which request sequential vertex ids, there might be many requests to storage for nearby vertex data. Handling these requests to the storage adds considerable overhead and dominates the application time. To reduce this overhead we implemented a cache manager, which caches GTT on the host side, does graph command handling. GraphSSD cache manager issues requests to fetch NAND pages on a cache miss. As many vertices may reside in a NAND page, single NAND page fetch to host side cache may serve many requests on the host side itself thereby filtering requests to storage. The host side GTT is read-only and all update requests invalidate the cached GTT entry and the update is handled on the SSD itself. During garbage collection at the SSD, data in a NAND page may be written to another NAND page. If a NAND page storing graph data is moved then new NAND page number should be updated at the host GTL cache. For updating this NAND page number on the host side cache the SSD controller automatically initiates a host cache invalidation request which is handled by the GraphSSD cache manager. 2.3.7 Graph command handling examples We summarize our system implementation using two example graph access APIs; GetAdjacentVertices, and GetEdgeWeight commands. For these commands we describe how we retrieve the required data from the NAND pages. 43 GetAdjacentVertices(vertexID, EdgeList): Using the requested vertex id GTT is accessed to get the NAND page numbers storing its adjacent vertices. If the GTT’s extended bit is set then we may have to search through the extended GTT entries to find the physical page. Once a physical page location is identified, and if the dirty bit in the GTT entry is set then it indicates that some of the neighbor information in that page has been modified. Hence, GraphSSD accesses the original graph page, and the Deltaupdates vector. Concurrently the host side logger searches the host side DRAM logs to find any cached or updated edge information for the given vertex. Finally the information from the original graph page, Deltaupdate page and the host side DRAM pages is combined to create the EdgeList buffer which is returned to the application. GetEdgeWeight(vertexID1, vertexID2, EdgeWeight): Using VertexId1 we ac- cess the GTT and fetch all the pages containing the neighbors of VertexId1. Using VertexId1 we access the NAND page containing the neighbor lists via host cache to find if VertexId2 is a neighbor. As briefly mentioned earlier, a separate GTT struc- ture is used to map a vertex id to the corresponding edge weight information using the same graph layout structure as the edge connectivity information. We use the edge con- nectivity information to find the index location of the edge to access the edge weight. Concurrently the host side logger searches the host side DRAM logs to find any updated edge weight for the given edge. The information returned from the original graph pages is again reconciled with any updated edge weight information found in the DRAM logs to get the most recent edge weight information which is returned to the application. 44 Algorithm 1 Code snippet of BFS program using GraphSSD 1: /*BFS Request thread code*/ 2: Queue.push(root) 3: while Queue not empty do 4: top element = Queue top element 5: if top element == required element then 6: Element found 7: Exit 8: Wait until empty slot is available in GraphSSD response queue 9: Wait until empty slot is available in GraphSSD request queue 10: GraphSSD.GetAdjacentVertices(top element, EdgeList) 11: 12: /*BFS Response thread code*/ 13: Wait until EdgeList is available in GraphSSD response queue 14: for i=0; i< EdgeList.size(); i++ do 15: if EdgeList[i] not already visited then 16: Add to Queue 2.4 Workloads and implementation details GraphSSD essentially provides semantic awareness to the SSDs. Instead of access- ing SSDs with logical block address, it allows users to query graph related information. For this purpose, it implements basic graph access commands listed in Table 2.1. User- s/libraries can use these basic commands to build higher-level functions. We evaluated several traditional graph applications that stress the following features 1) accessing ad- jacent vertices for a requested vertex, 2) accessing edge weight for a requested edge and 3) updating edge weights. 45 2.4.1 Workloads The applications evaluated include: BFS: BFS identifies whether a given target node is reachable from a given source node. We implemented the BFS application as shown in code snippet 1. This algorithm fetches adjacent vertices for a given vertex id repeatedly. For evaluating BFS, we select one source vertex and then varied the distance at which the destination vertex may be found at several levels along the the longest path, at 3 equal intervals from the first level to the last level. Connected components: The number of connected components are counted in this graph application. The approach uses BFS but it performs the operations on all vertices. Random walk: This application performs many random-walks starting from several source nodes. Starting from each source node, the application does random-walks for several iterations and in each iteration, it walks a maximum of steps. We implemented the efficient parallel random-walk algorithm described in [53]. This algorithm simulates random-walks in parallel, possibly from a large number of source vertices, and processes one vertex a time. In a step, at each vertex, all walks currently visiting that vertex are processed and moved forward. We evaluated with 100K parallel random walks, with a maximum of 10 steps from the source and considered several iteration values, 10, 50, 100, and 1000. Maximal independent set: We implemented the maximal independent set algorithm as described in [90]. It is an iterative algorithm based on Luby’s classic parallel algorithm [62]. 46 Page rank: [82] The page rank is a classic graph update algorithm and our imple- mentation sends edge updates if they are greater than a certain threshold (0.4). We set the maximum number of iterations to 5. Graph Update benchmark: Since GraphSSD provides significant support for graph updates, we also implemented a graph update kernel that adds edge and vertex informa- tion. The updates are maintained as delta graphs and are finally merged into the initial graph.We intersperse the graph updates with a total of 1000 get adjacent queries on ver- tices selected using the latest read model. In the latest read model, the newly added updates are accessed the most. Latest read model represents the widely used news feed, social media, where newly posted data is accessed the most [19]. We consider 95% of the get adjacent queries on vertices that are being updated or newly added to the graph. Non-intrusive NVMe Implementation: All the GraphSSD APIs are implemented by extending existing NVMe read/write commands. We used the unused bytes in read and write NVMe commands to specify the GraphSSD commands. We used these unused bytes to pass the vertex id, end vertices of an edge and an opcode encoding to indicate the API operation being requested. 2.4.2 Baseline system For the baseline system, we considered normal SSDs where the graph is stored in CSR format (described in 2.2.1). The baseline uses block based access to reach the rowPtr, colIdx and val vectors. Each access to the rowPtr, colIdx or Val vector is first translated to a physical page number using traditional FTL. The baseline system also uses file 47 caching on the host to cache multiple pages; the size of the host side file cache is 1GB in our implementation. As we show later in our results section, host side caching is critical for implementing a robust baseline that can eliminate many NAND page accesses. We also compare results with the popular out-of-core GraphChi framework. While comparing with GraphChi we use the same host side memory cache size as GraphSSD. For all the applications, when executing over GraphSSD, Baseline, and GraphChi, ap- plication data other than the graph data, such as visited vector in BFS application, value vector in page rank application, etc. are allocated in main memory. 2.4.3 Caching and Multi-threading We implemented host side caching using LRU policy for both GraphSSD and base- line. Doubly linked list and hashmap are used to efficiently implement the LRU policy. To simulate out-of-core graph algorithms, we consider host cache size of 1GB as the default size. To generate I/O request parallelism our baseline and GraphSSD implementations both provide a non-blocking request interface to graph application threads. To support non- blocking calls, we implement a request queue and response queues. In the request queue, graph data requests are posted from the application. Graph data responses to those re- quests are pushed into the response queues. The cache manager is also parallelized using multi-threading to maximize the throughput, and locks were sparingly used for synchro- nizing between the threads as necessary. 48 From storage, GraphSSD loads NAND page granularity chunks into the host cache as location pointers are stored at the end of the NAND page and vertex data is stored at the start of the NAND page. In our baseline we also load NAND page granularity chunks into the file buffer host cache. 2.5 Evaluation We evaluated GraphSSD using the open-source SSD (OpenSSD) development plat- form [79, 97]. The OpenSSD development platform equips Xilinx Zynq-7000 pro- grammable SoC that embeds a dual-core ARM Cortex-A9 processor [116]. Hence the FPGA-based programmable chip works as an SSD controller SoC on the SSD platform. PCIe interface and NAND flash channels are implemented as hardware logic on the programmable gate arrays, and the embedded ARM core runs the SSD firmware imple- menting the command handling, page buffer management and FTL functions. We im- plemented GraphSSD on the existing SSD firmware modifying the FTL part, command handling and host side library which manages cache on the host side. The OpenSSD platform encloses 1 GB DDR DRAM and 2 TB Hynix H27Q1T8YEB9R NAND flash DIMMs connected to the programmable SoC. The SSD board communicates with the host system via the PCIe Gen28 interface, which supports up to 4 GB/s bandwidth. NAND page size in OpenSSD platform is 16KB. Host system uses a logical sector of size 4KB. 49 We configured the host system with Intel i7-4790 CPU running at 4 GHz and 16 GB DDR3 DRAM. In order to extend NVMe commands for GraphSSD, the NVMe host driver on Linux Kernel version 3.19 was enhanced. Graph dataset: To evaluate the performance of GraphSSD, we selected two real- world datasets, one from the popular SNAP dataset [57] called com-friendster graph, and another is a popular webgraph from Yahoo Webscope dataset [117]. These graphs are all undirected graphs and for an edge, each of its end vertices appears in the neighboring list of the other end vertex. The datasets are listed in Table 2.2. For updates, we used a real-world YouTube dataset [75]. The graph has 1M vertices, 4.4M edges at the start. During the update process 38K new vertices and 550K new edges are added to the graph. Based on the timestamps in the dataset we divided these updates into 10 equal snapshots to test delta graph generation and merging functionality. Dataset name Number of vertices Number of edges com-friendster (CF) 124,836,180 3,612,134,270 YahooWebScope (YWS) 1,413,511,394 12,869,122,070 Table 2.2: Graph dataset 50 Figure 2.7: Relative performance of basic GraphSSD API Figure 2.8: Bandwidth for basic GraphSSD API (MB/second) 2.6 Experimental results 2.6.1 Performance of basic APIs We first present performance results of the basic graph access APIs that are provided by GraphSSD, namely GetAdjacentVertices and GetEdgeWeight APIs, before present- ing the application-level performance. We ran each command 1 million times using a random vertex or edge as the starting point for the query. GetAdjacentVertices API: The left bars in Figure 2.7 shows the relative performance of GraphSSD over the baseline for the GetAdjacentVertices API. GraphSSD outperforms 51 baseline (labeled as CSR GetAdjacentVertices by 1.85. There are two potential sources for performance improvement with GraphSSD. First, the baseline system accesses the rowPtr and colIdx vectors using block interfaces. GraphSSD on the other hand uses GTL to find the adjacent vertex list for a given vertex. When accessing SSD the baseline has to access two different NAND pages for the rowPtr and colIdx vectors but GraphSSD uses semantic knowledge through GTL to reduce NAND page access count. The number above each bar in the graph shows the number of NAND pages accessed to the nearest thousand. Even though there were million queries to random vertices, host side caching helps in reducing the number of NAND pages accesses. The performance of GraphSSD when compared to the baseline is not doubled, as rowPtr vector is compact when com- pared to colIdx vector, and baseline system caches that vector on the host side effectively to reduce the rowPtr related NAND page accesses. We also considered the performance degradation in the baseline due to the serializa- tion bottleneck. The baseline has to first access the rowPtr to find the index to access the colIdx. Hence, the two accesses may be sequentialized. Such a serialization will essentially reduce the SSD request rates which will result in lower data bandwidth be- tween SSD and the host. However, the delay due to the sequential nature of the two accesses can be hidden effectively if multiple parallel requests can be issued; then the rowPtr access and colIdx vectors of different requests can be interleaved. In fact as we described earlier, our baseline is highly multithreaded and is able to issue multiple read requests. Since we tested the GetAdjacentVertices API with million randomly selected vertex queries these queries can be issued in massively parallel manner, which in fact reduces the serialization bottleneck. This fact can be easily verified by looking at the 52 bandwidth utilization between SSD and the host as shown in Figure 2.8. The Y-axis shows the total number of MBs transferred per second between the SSD and the host for both the baseline and GraphSSD. Both approaches transfer roughly the same amount of data per second. Hence, the serialization bottleneck is not a critical performance limiter. GetEdgeWeight API: The right side bars in Figure 2.7 compares the performance of loading the weight for an edge using the GetEdgeWeight API. GraphSSD outperforms the baseline by 1.42. Unlike accessing the adjacent vertex lists which may span many pages for densely connected vertices, accessing an edge weight requires accessing at most one more NAND page in the baseline where the weight is stored (that too when there is a host cache miss). Hence, the performance gap is narrower. Again, the par- allelization of queries reduces any serialization bottleneck as seen from the bandwidth utilization graphs in Figure 2.8 (right three sets of bars). 2.6.2 Application performance We first use Random-walk application to present detailed results and analysis fol- lowed by the performance results for all applications. Random-walk: Figure 2.9a plots the GraphSSD performance (Y-axis) of Random- walk normalized over the baseline system. X-axis shows the number of iterations of random-walks done starting from each source node. The performance improvement of GraphSSD on average is about 1.6 better when compared to the baseline system over a wide range of number of random-walks from a source node. 53 (a) GraphSSD Speedup (b) NAND page access counts (c) SSD bandwidth utilization Figure 2.9: Random-walk performance relative to baseline (X-axis is the number of random-walks per- formed) 54 (a) BFS (X-axis is traversal depth) (b) Connected components (c) Maximal independent set (d) Page rank Figure 2.10: Application speedup relative to baseline In case of the Random-walk application, GraphSSD’s performance improvements in fact do come from the two reasons we discussed earlier: NAND page access counts and bandwidth utilization. In this application as the next vertex in the walk is visited ran- domly, baseline is less effective in caching the rowPtr. Using a compact representation of GTL GraphSSD can reduce the number of NAND page accesses. This observation is quantified in Figure 2.9b, which shows the NAND page accesses relative to baseline. Performance benefits also come due to the better bandwidth utilization of GraphSSD rel- ative to the baseline. In the baseline, serialization problem of accessing the rowPtr first before accessing the colIdx causes underutilization of SSD bandwidth. But GraphSSD is able to better cache GTL and access the adjacent vertex information in parallel leading to an improved bandwidth utilization as seen from Figure 2.9c. 55 (a) BFS (X-axis is traversal depth) (b) Connected components (c) Maximal independent set (d) Page rank Figure 2.11: NAND page access counts relative to baseline 56 (a) BFS (X-axis is traversal depth) (b) Connected components (c) Maximal independent set (d) Page rank Figure 2.12: Bandwidth utilization relative to baseline 57 All other applications: Figure 2.10 shows the performance improvements for the other applications on GraphSSD over the baseline system. For BFS, we breakdown the performance along the X-axis based on the fraction of the total number of levels in the graph that must be traversed to find a connection from the source node to the destination node. For instance, a value of 0.1 on X-axis means the destination node was found after traversing one tenth of the longest path in a graph. We generated queries at four different depths, 0.1, 0.33, 0.66 and 1 and the results are plotted in the figure. When compared to the baseline system, GraphSSD improves performance on average by 1.40, 1.42, 1.56 , and 1.29, for bfs, connected components, maximal independent set, and page rank applications, respectively. Just as we discussed earlier, the reason for the improved GraphSSD performance for these applications is a combination of the fewer NAND page accesses and the more effective utilization of bandwidth. Figure 2.11 and Figure 2.12 show the NAND page access counts and bandwidth utilization for the applications relative to baseline, respectively. 2.6.3 Comparison with GraphChi Figure 2.13 compares the performance of GraphSSD over GraphChi for two bench- marks BFS and Page Rank. GraphChi is designed for vertex centric programming model and applications such as BFS and connected components are ill-suited for this program- ming model. GraphChi is a vertex-centric programming framework and in most of the iterations, all the shards (partitioned graph data) are fetched from the storage repeatedly and only a few active vertices are actually used in the computation. Hence, even though 58 (a) BFS relative to GraphChi for CF dataset (b) BFS relative to GraphChi for YWS dataset (c) Page rank relative to GraphChi Figure 2.13: Speedups relative to GraphChi we collected results, it will be unfair to GraphChi to compare its performs when the ap- plications are not suited for the programming model it supports. Hence, the only purpose of us plotting the GraphChi comparison results for BFS is to demonstrate that GraphSSD is more versatile as it supports broader set of programming models. The first two graphs in the Figure 2.13 show the significant performance degradation with GraphChi for the two graph datasets that were used in the evaluation. Clearly, accessing all the vertices in each iteration as required by vertex centric programming model is not well suited for BFS. As the fraction of traversed graph increases, the performance of GraphChi wors- ens, since many iterations may be necessary for the vertex centric programming model to converge. And during each iteration all the graph shards may be read by GraphChi. On the other hand, applications such as PageRank are better suited for vertex centric programming models where many vertices may be traversed repeatedly. Figure 2.13c, 59 compares PageRank application on GraphSSD over the GraphChi. GraphSSD still out- performs GraphChi by 2.62 even for PageRank. Even though PageRank updates many vertices, not all vertices are active even in PageRank in each iteration. The sharding based data structure in GraphChi requires almost all shards to be fetched for each iter- ation into memory even though the number of compute operations per shard are quite small in some shards. The variations seen across different graphs are simply a func- tion of the graph structure and edge weights which influence the PageRank convergence process. 2.6.4 GraphSSD Overheads With the GraphSSD data layout, GTT size for the Com-Friendster and YahooWeb- Graph is 8M and 34MB, respectively. These GTT structures are quite small compared to the typical DRAM size in SSDs. The reason for the small size of GTT is the compact layout we propose for GraphSSD where multiple vertices that fit in a single page have a single GTT entry. Packing multiple vertices into one GTT entry is very common in both the graphs we studied; only rarely a vertex requires multiple GTT entries (0% in CF and 0.01% in YWS). Since all these graphs are dominated by sparsely connected vertices the compact layout of GrahSSD is another key element for boosting performance. As such the size of the GTT is small. If one were to use one entry per each vertex the size of the GTT would be atleast 1GB and 11.2GB, respectively for the two graphs studied in this work. 60 (a) Performance with delta snapshots (b) Latency of merge process with varying de- grees of empty page slots (c) Performance during merge process Figure 2.14: Graph update performance relative to a baseline where graphs are not updated Space overhead: While packing the adjacent vertices into a NAND page, for the simplicity of design, we used new NAND page if the adjacent vertices doesn’t fit in the existing free space (described in section 2.3.2). The space overhead due to this design choice is 4.8% and 4.9% for the CF and YWS graphs, respectively. This space overhead can be easily avoided with a more complex design that tracks a vertex neighbors split across pages. 2.6.5 Graph updates Figure 2.14a shows the performance of accessing graphs that are continually updated as described in the graph update benchmark earlier. The X-axis shows the number of 61 delta graphs that are chained. The X-axis labeled 0 implies that no graph updates have been performed on the graph (which is the baseline for this comparison). The perfor- mance of GetAdjacentVertices degrades slowly with increasing number of updates which are stored as delta graph chains. Recall that accessing the graph after multiple updates may require traversing the UpdatePointer chains which slows down the performance. Figure 2.14b shows the latency of the merge process with varying degrees of empty slots available in the page to prevent page spills. With a 10% empty space in each page, the merge process takes 2.4 seconds in our graph benchmark, and the time increases to 2.7 seconds when the empty space per page is limited to 5% which leads to page splits when a new edge or vertex is being added. The time increases to 3.9 seconds when the empty slots are just 1% of the total page size. Since the merge process itself is slow, any access during the merge process itself may see variable latency depending on whether the accessed vertex is already merged in which case it is a normal page access, or if it is not merged at all in which case the access may need to go through a chain of pointers. Figure 2.14c shows the variability in access latency on X-axis averaged over 1000 random GetAdjacentVertices queries compared to a graph that has no updates (or has been fully merged already). The X-axis shows the fraction of the graph updates that are already merged in 10 equal intervals of 10% each. The latency penalty decreases steadily as the merge process progresses. Further, the performance with different levels of empty space availability in a page is almost the same, even though they lead to different page splits, as we use random GetAdjacentVertices queries and there is less caching benefit. 62 Figure 2.15: Performance with varying delta graph sizes 2.6.6 Wear-levelling analysis SSD pages support only a limited number of writes to an SSD page [112]. For a longer SSD lifetime, it is critical to minimize the number of writes to SSD pages. Grouping of several updates together instead of performing each update separately helps us in improving the SSD lifetime. If an update involves adding one neighbor to a vertex, then by grouping several such vertex updates, we can improve the lifetime by up to 2048 times. During the merge process in GraphSSD, one only needs to merge NAND pages that have any updates. If any of the updated NAND pages overflow then empty NAND pages can be used to store these updates (described in section 2.3.3). For graphs where updates are in fewer NAND pages, this helps us in faster merging of the updates when compared to CSR baseline. In CSR baseline the entire graph structure needs to be written to another place [64]. Reducing the number of NAND pages written helps in reducing the critical 63 wear out of NAND pages in SSD, as they can only support a limited number of writes during their lifetime. 2.7 Related work 2.7.1 Storage system Exploiting the computation power of the SSD controller SoC can be an opportu- nity for offloading the computation burden and curtailing the data traffic in SSD for data-intensive applications. Active disk research proposed the possibility of in-storage processing for data-intensive applications with data processor on the storage disk plat- form [1, 88]. A plethora of complex in-storage processing applications have been pro- posed as SSD controller embeds more powerful application processors to handle massive data parallelism from multi-channel NAND flash memory [12, 18]. Summarizer [51], Active Flash [12,101,102], SmartSSD [21,47], Active Disks Meets Flash [18] and Biscuit [32] try to utilize the embedded cores in a modern SSD to re- duce the redundant data movement to free-up the host CPUs and main memory. Active Flash [101] for instance presents an analytic model for evaluating the potential for in- SSD computation. Summarizer [51] presents a detailed description of the application development environment to enable offloading work to SSD controller. SmartSSD [21] focuses on how to improve specific database operations, such as aggregation, using in- SSD computation. Biscuit [32] uses a flow-based programming model to enable code offloading to SSDs. SSD’s data processing ability has also been exploited to support 64 key-value interface for SSD. KAML proposed key-value interface for SSDs instead of the traditional block-based I/O interface [40]. To efficiently support key-value storage they use hash-based key-value mapping tables. To the best of our knowledge, GraphSSD is the first investigation to employ the graph semantic aware mapping structure in an SSD. 2.7.2 Graph processing Researchers have focused on enhancing the performance of graph applications on the hardware and software front. Hardware approaches exploit the graph processing ac- celeration engines near memory since large graph structure requires huge data transfer from main memory. Ahn et al. proposed Tesseract hardware accelerator that uses in- memory processing to improve memory bandwidth utilization [2]. Graphicionado is a graph-specific hardware accelerator which utilizes vertex programming model observed in wide range of graph applications [33]. These works optimize the memory system to guarantee data parallelism for vertex traversals. While the prior graph hardware acceler- ators focus on graphs that fit in main memory, GraphSSD tackles the data I/O bottleneck of large graph data structure on storage devices. ExtraV does hardware graph accelera- tion in front of storage to reduce the burden of graph management on the host processor and also support processing intensive compression to reduce the storage accesses [55]. However they use the SSD as a block device and is orthogonal to our work. Such hard- ware acceleration approaches in the front end of storage can also be employed with GraphSSD to further improve performance. 65 On the software front, on a single node system, GraphChi [54] and PartitionedVC [70] are out-of-memory systems which try to efficiently support vertex-centric graph pro- gramming. GraphChi proposes a sharding based graph format to reduce random I/O accesses from storage. PartitionedVC improves over GraphChi to access storage based on the number of active vertices/edges in a vertex-centric superstep. TurboGraph is an external-memory graph engine targeting graph algorithms expressed in sparse matrix- vector multiplication [35]. However, it is difficult to implement graph applications such as triangle counting on such a framework. They use large page sizes in multiple of MBs making it inefficient to read adjacent vertices selectively. FlashGraph implements graph engine on top of SSD file system to maximize parallel execution [123]. In order to hide data transfer overhead, FlashGraph also overlaps graph computation and data I/O from SSDs where only edge lists are stored. Mosaic [63] accelerates out-of-core graph pro- cessing on heterogeneous machines. All the existing out-of-core memory systems only look at flash as a block device. GraphSSD on the otherhand makes SSD aware of graph semantics, and proposes a layout scheme that takes SSD organization in to consideration. By customizing a widely used compressed graph representation we reduce the number of costly NAND page accesses to access graph data in out-of-core graph processing and support efficient merging of graph updates which hitherto was a problem while using the compressed graph representation. CSR format based graph processing works show that it is efficient for out-of-core graph processing, and performs better when compared to other out-of-core graph formats such as GraphChi format [43, 44, 55, 64, 77]. In this work, we consider a CSR format based graph framework and GraphChi as a baseline and show significant performance improvements. 66 GraFBoost is a vertex-centric programming model for out-of-memory graph analytics that reduces the overhead of random updates by using a log structure [43]. This work is orthogonal to our work of reducing the random access to the graph data by a NAND page organization aware graph layout. GraFBoost can be potentially incorporated within GraphSSD to further improve the performance. There have been several works to support dynamically changing graphs. Kineographs, Chronos, and Stinger are in-memory graph processing frameworks for handling dy- namic graphs [17,24,34]. GraphSSD handles dynamic graphs that go beyond in-memory graphs. LLAMA proposes efficient support for whole-graph analysis on consistent views of data while supporting streaming graph updates [64]. It proposes to store the streaming graph updates in snapshots along side the initial graph and merge a batch of snapshots to amortize the costly merge operation, limit the space utilization of snapshots and reduce the performance penalty while accessing from the snapshots containing the latest graph updates. GraphSSD’s delta graphs are inspired by the LLAMA approach. 2.8 Chapter Summary Graph applications access increasingly large graphs that are hobbled by storage access latency. In this work, we proposed GraphSSD, a graph-semantic-aware SSD framework that allows storage controllers to directly access graph data natively on the flash memory. We presented graph translation layer (GTL), which translates the vertex ids to physical page address on the flash memory media directly. In conjunction with GTL, we propose an efficient indexing format that reduces the overhead of GTL with only a small increase 67 in the per-page metadata overhead. We also presented multiple optimizations to handle graph updates using delta graphs which are merged to reduce update penalty while at the same time balancing the write amplification concerns. We implemented GraphSSD framework on an SSD development platform to show the performance improvements over two different baselines. Our evaluation results show that the GraphSSD framework improves the performance by up to 1.85 for the basic graph data fetch functions and on average 1:40, 1:42, 1:60, 1:56, and 1:29 for the widely used breadth-first search(BFS), connected components, random-walk, maximal independent set, and page rank applications, respectively. 68 Chapter 3 MultiLogVC: Efficient Out-of-Core Graph Processing Framework on Flash Storage 3.1 Chapter Overview GraphSSD demonstrated the benefits of exposing graph semantics to the storage con- troller. It also demonstrated that CSR formatted graphs are well suited for graph ap- plications running on SSDs. Prior to the SSD era, several out-of-core (also called ex- ternal memory) graph processing systems focused on reducing random accesses to the hard disk drives while operating on graphs. These systems primarily operate on graphs by splitting the graph into chunks and operating on each chunk that fits in main mem- ory. For instance, GraphChi [54] and many follow on research papers [5, 110, 121], format graphs into shards, where each shard has a unique data organization that mini- mizes the number of random disk accesses needed to process one chunk of the graph at a time [5, 43, 59, 123]. In terms of programming, many popular graph processing systems use vertex-centric programming paradigm. This computational paradigm uses bulk synchronous parallel 69 (BSP) processing where each vertex is processed at most once during a single superstep, which may generate new updates to their connected vertices. The graph is then iteratively processed in the following superstep. A vertex can modify its state or generate updates to another vertex, or even mutate the topology of the graph in each superstep. The repeated supersteps of execution in vertex-centric programming, however, cause a major data access hurdle for shard-based graph frameworks. We show later in this work that as supersteps progress, the number of active vertices that receive messages, and hence must process those messages, continuously shrinks. Shard-based graph frame- works are unable to limit their accesses to a limited number of active vertices. In each superstep, they still have to read whole shards into memory to access the active vertices that are embedded in these shards. It is this observation that active vertex set shrinks with each superstep that our work exploits as a first step. Given that shard-based graph formats do not support accessing just the active vertex set, we rely on an efficient compressed sparse row (CSR) repre- sentation of the graph to achieve this goal. GraphSSD results presented in the previous chapter already demonstrated how to exploit CSR format to reduce page indirections to reach vertex data. However, GraphSSD’s primary focus is to enable fast access to prim- itives such as obtaining the list of neighbors of a given vertex, or the weight of an edge between two vertices. These primitives may be used within each superstep to speedup each superstep execution. However, during a superstep each vertex may send messages to a subset of its neighbors. In the next superstep each vertex gathers all the messages that are sent on its input edges to process them. Hence, while sending the message each 70 vertex needs to access its out edges to send messages. While processing the messages each vertex needs to access its in-edges to receive all the messages. It is the dual requirement to access in-edges and out-edges of a vertex in each super- step that leads to hurdles with CSR format. CSR format can either place incoming or outgoing edges in contiguous locations but not both (more details in the next section). As a result, CSR formatted graphs lead to random accesses to obtain either the in-edge or out-edge lists. To resolve this challenge, in this chapter we present a novel multi-log graph processing paradigm. This paradigm uses CSR formatted graphs with contiguous incoming edges but require random accesses to reach the out-edges of a vertex. To elim- inate random accesses it uses logs for propagating the messages on the out-edges of a vertex. This approach logs all the outgoing messages into a collection of logs, where each log is associated with an interval of vertices. Logging the messages, as opposed to embedding the messages as metadata into each edge, decouples the message placement from the organization of edges in the CSR format, thereby resolving the random access hurdle. The last challenge tackled by this work is that when using log-based message pro- cessing, each superstep still has to read the edge lists from storage. As we show later in our analysis, many of the edge list pages accessed in SSD contain edges that are from inactive vertices. Hence, to improve edge list access efficiency, we optimize the multi- log design by logging the edge lists of the vertex that is likely to be active in the next superstep. We use a simple history-based prediction to determine whether a vertex will be active in the next superstep. 71 Compared to prior approaches that also rely on log-based graph processing [26, 43], our approach differs in two important ways. Most log-based graph processing frame- works use a single log and merge the messages bound to a single destination vertex to optimize log access latency. However, many important graph algorithms, such as com- munity detection [87], graph coloring [30], maximal independent set [66] must process all the messages individually. Hence, approaches that merge messages do not permit the execution of many important classes of graph algorithms. However, preserving all the messages without any merge process leads to extremely large log size thereby creating significant log processing overhead at the beginning of each superstep [43]. In fact the cost of sorting the logs was one reason that prior approaches merge messages bound to a destination vertex. Without merging the cost of sorting would be prohibitive. In- spired by these challenges, we design a multi-log graph processing paradigm that allows efficient access to smaller logs, based on the vertex interval that is currently being pro- cessed. By preserving the messages without merging, we enable all graph algorithms to be executed within our framework, thereby supporting the generality of GraphChi-based frameworks [5,54,110,121]. At the same time, we use CSR formatted graphs augmented with multi-log structures to access the active vertex set, thereby drastically reducing the number of unwanted page accesses to the storage. Our main contributions in this work are: • We propose an efficient external memory graph analytics system, called Multi- LogVC (multi-log with vertex-centric generality), which uses a combination of CSR graph format, and message logging to enable efficient processing of large 72 graphs that do not fit in main memory. Unlike shard-based graph structures, CSR format enables accessing only the pages containing active vertices in a graph. This capability is a key requirement for efficient graph processing since the set of active vertices shrink significantly with each successive superstep. Hence, CSR format reduces read amplification. • To efficiently access all the updates in a superstep, MultiLogVC partitions the graph into multiple vertex intervals for processing. All the outgoing messages are placed into multiple logs indexed by the destination vertex interval. All the updates gener- ated by one vertex interval to other vertex interval are stored in their corresponding log. When an interval of vertices is scheduled for processing, all the updates it needs are located in a single log. We use SSD’s capability for providing parallel writes to multiple channels for concurrently handling multiple vertex interval logs with only small buffers on the host side. • Even with message logging, MultiLogVC must still access the outgoing edge lists of each active vertex to send messages for the next superstep. Our analysis of edge list accesses showed that many pages accesses contain edge lists of inactive vertices interspersed with active vertices. To further reduce the read amplification associated with edge lists, MultiLogVC logs the edge list of any potential active vertex in current iteration so that this edge list can be read more efficiently from the log, rather than reading a whole page of unneeded edge lists. The remainder of this chapter is organized as follows: Section 3.2 introduces back- ground for out-of-core graph processing and motivation for this work. The detailed 73 architecture is described in Section 3.5. The implementation methodology, evaluated applications and the experimental results are presented in Sections 3.6, 3.7, and 3.8 re- spectively. Related work is provided in Section 3.9, and we conclude in Section 3.10. 3.2 Background and Motivation 3.2.1 Out-of-core graph processing In the out-of-core graph processing context, graphs sizes are considered to be large when compared to the main memory size but can fit in the storage size of current SSDs (in Terabytes). GraphChi [54] is a representative out-of-core vertex-centric program- ming system and many further works built on top of this basic framework [5, 110, 121]. We will thus describe the GraphChi graph formats and describe the challenges with shard-based graph processing. GraphChi partitions the graph into several vertex intervals, and stores all the incoming edges to a vertex interval as a shard. The shard structure was briefly introduced in the introduction and we provide some additional details here. Figure 3.1b shows the shard structure for an illustrative graph shown in Figure 3.1a. For instance, shard1 stores all the incoming edges of vertex interval V1, shard2 stores V2’s incoming edges, and shard3 stores incoming edges of all the vertices in the intervalV3V6. While incoming edges are closely packed in a shard, the outgoing edges of a vertex are dispersed across other shards. In this example, the outgoing edges ofV6 are dispersed across shard1, shard2, 74 V1 V2 V3 V6 V5 V4 V1 V2 V3 V4 V5 V6 V1 0 4 0 0 0 0 V2 0 0 0 0 0 0 V3 8 4 0 0 0 0 V4 0 0 0 0 0 0 V5 0 0 0 0 0 0 V6 3 5 3 2 1 0 4 8 4 3 3 5 1 2 Matrix format CSR format Edge weight Adjacent vertices 1 2 3 4 5 6 7 8 val 4 8 4 3 5 3 2 1 colIdx 2 1 2 1 2 3 4 5 rowPtr 1 2 2 4 4 4 9 (a) CSR format representation for the example graph Src Dst Val 3 1 8 6 1 3 Src Dst Val 1 2 4 3 2 4 6 2 5 Src Dst Val 6 3 3 4 2 5 1 Shard1: V1 Shard2: V2 Shard3: V3-V6 (b) GraphChi shard structure for the example graph Figure 3.1: Graph storage formats 75 and shard3. Another unique property of shard organization is that each shard stores all its in-edges sorted by source vertex. GraphChi relies on this shard organization to process vertices in intervals. It first loads into memory a shard corresponding to one vertex interval, as well as all the outgoing edges of those vertices that may be stored across multiple shards. Updates generated during processing are directly passed to the target vertices through the out-going edges in other shards in the memory. Once the processing for a vertex interval in a superstep is finished, its corresponding shard and its out-going edges in other shards are written back to the disk. GraphChi relies on sequential accesses to disk data and minimizes random accesses. However, in the following superstep, only a subset of vertices may become active. Due to shard organization above, even if a single vertex is active within a vertex interval the entire shard must be loaded since the in-edges for that vertex may be dispersed through- out the shard. For instance, if any of theV3;V4;V5 orV6 is active, the entire shard3 must be loaded. Loading a shard may be avoided only if all the vertices in the associ- ated vertex interval are not active. However, in real-world graphs, the vertex intervals typically span tens of thousands of vertices, and during each superstep the probability of a single vertex being active in a given interval is very high. As a result, GraphChi in practice ends up loading all the shards in every superstep independent of the number of active vertices in that superstep. 76 3.2.2 Shrinking size of active vertices To quantify the number of superfluous page loads that must be performed by shard- based graph processing frameworks we measured the active vertex and the active edge counts in each superstep while running graph coloring application. Complete experi- mental details described in section 3.7. For this application, we ran a maximum of 15 supersteps. Figure 3.2 shows the active vertices and active edges count as a fraction of the total vertices and edges in the graph, respectively. The x-axis indicates the superstep number, the major y-axis shows the ratio of active vertices divided by total vertices, and the minor y-axis shows the number of active edges (updates sent over an edge) divided by the total number of edges in the graph. The fraction of active vertices and active edges shrink dramatically as supersteps progress. However, at the granularity of a shard, even the few active vertices lead to loading many shards since the active vertices are spread across the shards. 3.3 CSR format in the era of SSDs Given the overheads of loading is a significant fraction of the shards even with few active vertices in GraphChi, we evaluated a compressed sparse row (CSR) format for graph processing. CSR format has the desirable property that one may load just the active vertex information more efficiently. Large graphs tend to be sparsely connected. 77 Figure 3.2: Active vertices and edges over supersteps CSR format takes the adjacency matrix representation of a graph and compresses it using three vectors. The value vector, val, stores all the non-zero values from each column sequentially. The column index vector, colIdx, stores the column index of each element in the val vector. The row pointer vector, rowPtr, stores the starting index of each row (adjacency matrix row) in the val vector. CSR format representation for the example graph is shown in Figure 3.1a. The edge weights on the graph are stored in val vector, and the adjacent outgoing vertices are stored in colIdx vector. To access adjacent outgoing vertices associated with a vertex in CSR graph storage format, we first need to access the rowPtr vector to get the starting index in the colIdx vector where the adjacent vertices associated with the vertex are stored in a contiguous fashion. In CSR format, as all the outgoing edges connected to a vertex are stored in a con- tiguous location, while accessing the adjacency information for the active vertices CSR format is suitable for minimizing the number of pages accessed in an SSD and reducing the read amplification. 78 3.3.1 Challenges for graph processing with a CSR format While CSR format looks appealing for accessing active vertices, it suffers one fun- damental challenge. One can either maintain an in-edge list in the colIdx vector or the out-edge list, but not both (due to coherency issues with having the same information in two different vectors). Consider the case that adjacency list stores only in-edges and during the superstep all the updates on the out-edges generate many random accesses to the adjacency lists to extract the out-edge information. Similarly, if the adjacency list stores only out-edges then at the beginning of a superstep each vertex must parse many adjacency lists to extract all the incoming messages. 3.4 MultiLogVC: Three Key Insights In this section we layout three key insights that lead to the design of the MultiLogVC. 3.4.1 Avoid random write overhead with logging To avoid the random access problem with the CSR format, we exploit the first ob- servation. Namely, updates to the out-edges do not need to be propagated by accessing the edge list. Instead, these updates can be simply logged separately. The logged mes- sage has to store the destination vertex alongside the edge update so that the log can track which destination vertex this message is bound for. Hence, logging requires the addition of the destination vertex field to the message. Logging the messages has two 79 benefits. First, like any log structure, all the message writes occur sequentially, enabling an SSD to efficiently write the messages to storage for processing later. Second, one can decouple the message logs from the CSR storage format inefficiencies; namely random accesses to either the in-edge or out-edge lists as we described earlier. Thus, we pro- pose to log the updates sent between the vertices, instead of directly updating the edge values in the CSR format. Note that logging has been proposed in prior graph process- ing works [26, 43]. But as we explain below, prior logging schemes did not exploit the second key observation we make in this work. Prior works maintain a single log for all the updates in a superstep [43]. But the start of the next superstep the entire log must be parsed to find all the messages bound to a given destination vertex. Almost all prior works employ sorting of the log (based on a vertex ID) to efficiently extract all the messages bound for that vertex. In the worst case, the number of updates sent between the vertices may be proportional to the number of edges. Hence, the log itself must be stored in SSD. Even if a small fraction of edges receives an update, they may still overwhelm the host memory, given the size of the graphs. Thus, one has to do external sorting of the message log. Prior works either built a custom accelerator [43] or proposed approaches to reduce the sorting overhead [26]. 3.4.2 Eliminate sorting overhead with a multi-log In this work, we exploit the second key observation to get around the sorting con- straint. Namely, at any time only a subset of vertices is being processed from the entire graph. Even using a highly parallel processor the number of vertices that can be handled 80 concurrently will be limited. Hence, only the incoming messages bound for the currently scheduled vertices must be extracted from the log. Thus the process of sorting can be narrowed to a few vertices within the log at a time and can be concurrently performed while a previous batch of vertices is being processed. To enable the above capability, MultiLogVC creates a new multi-log structure. Mul- tiple logs are maintained to hold the messages bound for different vertex intervals. We partition the graph into several vertex intervals and associate one log for each interval. As such, we create a coarse-grain log for an interval of vertices that stores all the updates bound to those vertices. As vertices place outbound messages to destination vertices, the destination vertex interval is used as an index to place the message in that log. We choose the size of a vertex interval such that typically the entire update log cor- responding to that interval can be loaded into the host memory, and used for processing by the vertices in that interval. As the entire update log to a vertex interval can be fitted into the host memory, MultiLogVC avoids the costly step of performing external sort to group the updates bound to a vertex. 3.4.3 Reduce read overhead with an edge log When a vertex interval is processed using a multi-log, one has to still read the out- going edge information from the CSR so that the processed updates can be placed on the out-edges for processing in the next superstep. The third key observation is that the process of reading the out-edges for each vertex in a vertex interval leads to read ampli- fication. Since the minimum read granularity of an SSD is a page (typically 8-16KBs), 81 Figure 3.3: Accessed graph pages with less than 10% of utilization one has to read the entire page of an edge list vector that stores the outgoing edges of a particular vertex. Given that many real world graphs exhibit power-law distribution, the vast majority of SSD pages contain the out-edges of multiple vertices. Hence, to read the outgoing edges of a single active vertex an entire page must be fetched. We measured the fraction of edge list page data that is actually necessary to process a vertex. The data is shown in Figure 3.3. The X-axis shows a list of graph applications running two different graph datasets. The Y-axis shows the percentage of a page data that was used to process a superstep. As can be seen from this figure, nearly 32% of the accessed pages have greater than 0% and less than 10% of data activity which is necessary for processing. Thus more than 90% of the vertex data in a page is not needed but is nonetheless fetched from flash due to the page granular access. The inefficient use of a page data leads to significant read amplification, in terms of wasted fetch bandwidth as well as power consumption to move the data. To tackle this challenge, we design an innovative scheme that places all the outgoing edges of any vertex in a separate edge log. We initiate the out-edge logging process while processing an active vertex in the current 82 Algorithm 2 Overview of a superstep in MultiLogVC 1: for all vertex intervals do 2: U log = LoadLog(V in )) //Load vertex interval updates 3: S log = SortNGrpLog(U log ) //Group updates on vertex# 4: A = ExtractActiveVert(S log , A) 5: for each vertexV act in A do 6: V inf =LoadVertexInfo(V act ) // GetV act edges 7: V actUpdates = ExtractUpdates(S log ,V act )//extract updates bound forV act fromS log 8: ProcessVertex(V act ,V inf ,V actUpdates ) 9: function SENDUPDATE((v, m)) 10: vint i = vId2IntervalMap(v) 11: Getvint i ’s top page in the log buffer 12: Append update m to the top page 13: Flush a page to the SSD on log overflow superstep if that vertex is likely to be active again in the next superstep. Thus, the edge log contains the outgoing edges of all likely active vertices to drastically reduce the read amplification overhead. Note that whether a vertex is active or not in the next superstep is clearly known if there is a message bound for that vertex in the current superstep. Hence, as vertex intervals are processed the active vertex list for next superstep becomes increasingly obvious. However, in the early part of a superstep we need to predict whether a vertex is likely to be active. We will describe the prediction process in the next section. 83 Figure 3.4: Layout MultilogVC Framework 3.5 Multi-log Architecture In this section, we describe the design and implementation details of the MultiLogVC architecture. There are two components in MultiLogVC: a set of APIs that are used within each superstep to read, write, and sort logs in service of a user-defined graph application. Algorithm 2 shows the pseudo-code of a single superstep actions in Multi- LogVC and the various APIs that activate MultiLogVC functionality. The second com- ponent is the memory buffers that store various logs that are accessed by the APIs. Fig- ure 3.4 shows these modules as highlighted boxes and their interactions with existing OS and SSD functions. An example code snippet for community detection graph appli- cation is shown in Algorithm 3, which uses ProcessVertex() function to provide the vertex-centric computation related to the application. 84 Algorithm 3 Code snippet of community detection program 1: function PROCESSVERTEX(VertexIdV act , VertexDataV inf , VertexUpdatesV actUpdates ) 2: for each update m inV actUpdates do 3: V inf .edge(m.source id).set label(m.data) 4: new label= frequent label(V inf .edges label) 5: old label=V inf .get value() 6: if old label!= new label then 7: V inf .set value(new label) 8: for each edge inV inf .edges() do 9: update m.source id =V act 10: v dest = edge.id(),m.data = new label 11: SendUpdate(v dest ,m) 12: deactivate(V act ) 3.5.1 Multi-Log Update Unit The multi-log update unit is responsible for managing the logs associated with each vertex interval. Its functionality is activated whenever theProcessVertex() usesSendUpdate() to send message m to a destination v dest . SendUpdate first generates a vertex interval vint i for the vertexv dest using vId2IntervalMap function. It then appends the message m to the loglog i associated with that vertex interval. Each message appended to the log is of the format<v dest ;m>. 3.5.1.1 Generating vertex interval Typically, in vertex-centric programming updates are sent over the outgoing edges of a vertex, so the number of updates received by a vertex is at most the number of incoming 85 edges for that vertex. Recall that all updates sent to a single vertex interval are sorted and grouped in the next superstep using in-memory sorting process. Hence, the need to fit the updates of each vertex interval in main memory determines the size of the vertex interval. MultiLogVC conservatively assumes that there may be an update on each incoming edge of a vertex to determine the vertex interval size. It statically partitions the vertices into contiguous segments of vertices, such that the sum of the number of incoming updates to the vertices is less than the memory allocated for the sorting and grouping process. This memory size could be limited by the administrators, application programmer, or could be limited by the size of the virtual machines allocated for graph processing. In our implementation, the amount of memory allocated for bringing all the updates bound to the current vertex interval that is being processed is limited to a fraction (shown as X% in Figure 3.4, which is set to 75% as the default value) of the total available memory in our virtual machine (1GB is the default). 3.5.1.2 Fusing vertex intervals Due to the conservative assumption that there may be a message on each incoming edge, as supersteps progress, the number of messages bound to a vertex interval de- crease, and their total size may be less than the allocated memory. To improve memory usage efficiency, MultiLogVC’s runtime may dynamically fuse contiguous vertex inter- vals into a single large interval to process them at once, as we describe later. To enable fusing, MultiLogVC maintains a count of the messages received at each vertex interval to estimate the size of each vertex interval log. 86 3.5.1.3 SSD-centric log optimizations To efficiently implement logging, MultiLogVC first caches log writes in main mem- ory buffers, called the multi-log memory buffers. Buffering helps to reduce fine-grained writes, which in turn reduces write amplification to the SSD storage. MultiLogVC main- tains memory buffers in chunks of the SSD page size. Since any vertex interval may generate an update to a target vertex that may be present in any other vertex interval, at least one log buffer is allocated for each vertex interval in the entire graph. In our exper- iments, even with the largest graph size, the number of vertex intervals was in the order of a few (<5) thousands. Hence, at least several thousands of pages may be allocated in multi-log memory buffer at one time. This log buffer that stores the updates for the next iteration is shown to occupy A% in Figure 3.4, which is about 5% of the total memory. For each vertex interval log, a top page is maintained in the buffer. When a new update is sent to the multi-log unit, first the top page of that vertex interval where the update is bound for is identified. As updates are just appended to the log, an update that can fit in the available space on the top page is written into it. If there is not enough space on the top page, then a new page is allocated by the multi-log unit, and that new page becomes the top page for that vertex interval log. MultiLogVC maintains a simple mapping table indexed by the vertex interval to identify the top page. When the available free space in the multi-log buffer is less than a certain threshold, some log pages are evicted from the main memory to SSD. When a log is evicted from memory it is appended to the corresponding vertex interval log file. Due to the highly 87 concurrent processing of vertices, multiple vertex logs may receive updates and multi- ple log page evictions may occur concurrently. To maximize log writeback bandwidth, MultiLogVC spans multiple logs across all available SSD channels. Further, each log is interspersed across multiple channels to maximize the read bandwidth when that log file is read back later. We exploit the programmability of the SSD driver in our exper- imental setup to maximize the write and read bandwidths. As we make page granular evictions, most of the SSD bandwidth can be utilized. Even when simultaneously writ- ing to multiple logs, they can be parallelized and pipelined due to the interspersing of logs across many channels. Furthermore, we only need to buffer 1000 SSD pages (each corresponding to one vertex interval log), our host-side buffer size to store the updates for the next iteration is also small (in 10-100s of MBs). The reduced main memory us- age with multiple logs enables us to allocate a significant fraction of the total memory for fetching, sorting and grouping the updates for processing the superstep, rather than storing the updates for the next superstep. 3.5.2 Sort and Group Unit The sort and group unit is responsible for retrieving the updates from logs and sending the update for processing. Its functionality is automatically activated at the start of each superstep within the MultiLogVC runtime. The first API call is the LoadLog() func- tion, which retrieves the log associated for a given a vertex interval. The loading process maximizes the SSD read bandwidth since each log is dispersed across all the available flash channels in SSD. After loading a log, if the size of the log is smaller than the mem- ory allocated for sort the next vertex interval log is automatically fetched. Recall that 88 each vertex interval log maintains a counter which provides a first order approximation of the log size in that interval. Hence, the loading process uses this approximate size to determine whether there is enough available memory to fetch the next interval log. The process continues until the memory allocated for sorting is full. The fused logs in memory are then sorted based on thev dest field in each log update. 3.5.2.1 SSD pre-sorting capability When the top page of the next iteration log is being written back to SSD, the SSD controller can optionally pre-sort each page as it writes the page to the SSD. Since the SSD controller is relatively wimpy, it may not be possible to pre-sort a page when many pages are being written back concurrently. In this case, the SSD controller can skip the pre-sort and directly write the page back to flash. To identify a page that is sorted, a single meta bit in the page is used to indicate whether it is pre-sorted during the writing process. Note that page-level pre-sorting does not eliminate the need for sorting a vertex in- terval log. The updates bound to a given vertex v dest are dispersed throughout the log associated with that vertex interval. Hence, even if each page in the interval is pre-sorted, it is necessary to sort across pages. However, pre-sorting can reduce the cost of sorting when reading back the log. TheSortNGrpLog() function in the MultiLogVC runtime is responsible for this sorting process. 89 3.5.2.2 Active vertex extraction Once the sorting process is complete, every vertexV dest that has at least one incoming message becomes an active vertex. Hence, the list ofV dest from each log is extracted to update the active vertex setA usingExtractActiveVert() API. Then the next superstep only needs to process the active vertices inA by calling theProcessVertex() function. To process a vertex, any graph algorithm would need the adjacency list (typically the outgoing edges) of that vertex. 3.5.2.3 Graph Loader Unit MultiLogVC uses CSR format to store graphs since CSR is more efficient for loading a collection of active vertices. A graph loader unit is responsible to load the graph data for the vertices present in the active vertex list. Graph data unit maintains the row buffer for loading the row pointer and buffer for each of the vertex data (adjacency edge lists/weights). The graph data unit loops over the row pointer array for the range of vertices in the active vertex list, each time fetching vertices that can fit in the graph data row pointer buffer. For the vertices which are active in the row pointer buffer, vertex data required by the application such as out-edges or in-edges, are fetched from the colIdx or val vectors stored in the SSD, accessing only the pages in SSD that have active vertex data. 90 3.5.3 Edge-log optimizer The last component of MultilogVC is the edge-log optimizer module. As we quanti- fied earlier, only a small fraction of the graph pages must be read to fetch the outgoing edges of an active vertex. To reduce this read amplification for such pages, MultiLogVC relies on an edge-log optimizer, which works as follows. At the start of the first super- step, the edge log optimizer monitors each vertexv i that is being processed. Recall that while processing a vertexv i all its outgoing edgesout i are fetched. First, the edge log optimizer predicts whetherv i will likely be active in the next superstep. An inactive ver- tex will be activated in the next superstep if it receives a message on its incoming edge. If a vertex has not received any incoming message yet, the edge log optimizer has to predict the likelihood thatv i will be active; essentially predicting whetherv i will receive messages on the incoming edges. To predict an active vertex, the edge log optimizer uses the past history of active vertices, which are maintained using the active vertex bit vectors. If the current vertex was active at least once in the pastN supersteps, it predicts the vertex to be active. More complex prediction schemes were considered, but this simple history-based prediction withN equal to one proved to be effective. Once a vertex is predicted to be active, the second step is to determine whether its outgoing edges are located on a page with very few other active vertices. Since the active vertices are being added dynamically, it is not feasible to accurately determine the page usage efficiency. Hence, the edge log optimizer predicts the page usage efficiency 91 for the next superstep based on the current superstep usage efficiency. Pages which use less than a threshold in the current superstep will be predicted as inefficiently used pages. Finally, the edge log optimizer logs all the outgoing edgesout i for a predicted active vertex with inefficient page usage into an edge-log, indexed with thev i . The edge-log appends the out-edges of active vertices into a sequential page location. Even if a few predictions are incorrect most of the page data contain out-edges of many active vertices, thereby improving read efficiency. In the next superstep, instead of accessing out-edges from the graph, they can be fetched from the edge-log. In Figure 3.4, this log buffer is set to occupy B% of the total memory, which is set to 5% as the default value. Note that unlike message logs, edge logs essentially replicate some of the original graph data. However, by selecting the thresholds appropriately, graph data replication can be limited. In our implementation, we chose a threshold of 10% for determining whether a page is efficiently used. With such a low threshold, the critical observation is that when loggingN active vertex outgoing edges into a single edge-log page, one can reduceN1 page reads from the original graph. 3.5.4 Supporting the generality of vertex-centric programming As described in the background section, one of the salient features of MultiLogVC is its ability to support the full spectrum of vertex-centric programming applications. By default, all the messages in the multi-logs are preserved as is, and by using vertex-interval based multi-logs, we can perform in-memory sorting of each of these logs. 92 However, some graph algorithms support associative and commutative property on updates. Hence, the updates may be merged into a single update message. MultiLogVC provides an optimization path for such algorithms. For these algorithms, the application developer has to provide the combine operator, which may reduce the updates bound to a destination vertex, along with the vertex processing function. The combine operator is applied to all the updates to a target vertex in a superstep before the target vertex’s pro- cessing function is called. For example, Algorithm 4 shows how the combine function is specified for the pagerank application. When a combine function is defined, the sort and group unit can optimize the performance automatically by preforming the reduction transparently to the user. 3.5.5 Graph structural updates In vertex-centric programming, some graph algorithms may update the graph struc- ture during the supersteps. Graph structure updates in a superstep can be applied at the end of the superstep. In the CSR format, merging the graph structural updates into the column index or value vectors is a costly operation, as one needs to re-shuffle the entire column vectors. To minimize the costly merging operation, we partition the CSR format graph based on the vertex intervals, so that each vertex interval’s graph data is stored separately in the CSR format. Instead of merging each update directly into the vertex interval’s graph data, we batch several structural updates for a vertex interval and later merge them into the graph data 93 after a certain threshold number of structural updates. As graph structural updates gen- erated during the vertex processing can be targeted to any vertex, we buffer each vertex interval’s structural updates in memory. The Graph Loader unit always accesses these buffered updates to fetch the most current graph data for processing accurately. 3.5.6 Programming model We keep the programming model to be consistent with any vertex-centric program- ming framework. For each vertex, the vertex processing function is provided. The main function logic for a vertex is written in this function. In the current MultiLogVC imple- mentation the processing function receives the vertex id including the vertex data (such as vertex value), the list of incoming messages for that vertex, and the vertex adjacency information (in all our applications we need the outgoing edges). Each vertex processes the incoming messages, sends updates to other vertices, and mutate the graph in some applications. Communication between the vertices is implemented usingSendUpdate() function, which automatically calls the multi-log update unit transparent to the applica- tion developer. The vertex also indicates in the vertex processing function if it wants to be deactivated. If a vertex is deactivated, it will be re-activated automatically if it receives an update from any other vertex. By using multiple logs associated with each vertex interval, MultiLogVC preserves the messages while still enabling in-memory sort- ing of messages by the sort and group unit. Thus MultiLogVC supports the generality of vertex-centric programming model, while reducing the sorting and very large log man- agement overheads. 94 Algorithm 4 Code snippet of Pagerank - an associative and commutative program 1: function COMBINE(VertexValue val, update m) 2: val.change+= m.data.change 3: if is set(m.data.activate) then 4: activate(m.target id) 5: function PROCESSVERTEX(VertexIdV act , VertexDataV inf , VertexUpdatesV actUpdates ) 6: for each edge inV inf .edges() do 7: updatev dest = edge.id() 8: m.data =V inf .val.change=V inf .num edges() 9: ifV inf .val.change> Threshold then 10: m.data.activate = 1 11: SendUpdate(v dest ,m) 12: value val.page rank = ((1)change) 13: val.change = 0 14: V inf .set value(val) 15: deactivate(V act ) As described earlier, MultiLogVC provides an optional optimization path if the ap- plication developer wants to perform a reduction operation on the incoming messages. For the synchronous computation model, updates will be delivered to the target vertex by the start of its vertex processing in next superstep. In the asynchronous computation model, the latest updates from the source vertices will be delivered to the target vertices, which can either be from the current superstep or the previous one. Vertices can modify the graph structure, but MultiLogVC requires that these graph modifications be finished by the start of next superstep (which is also a restriction placed on most vertex-centric programming models). 95 3.6 System design and Implementation We implemented the MultiLogVC system as a graph analytics runtime on an Intel i7-4790 CPU running at 4 GHz and 16 GB DDR3 DRAM. We use 2 TB 860 EVO SSDs [98]. We use Ubuntu14:04 operating system which runs on Linux Kernel version 3.19. To simultaneously load pages from several non-contiguous locations in SSD using minimum host side resources, we use asynchronous kernel IO. To match SSD page size and load data efficiently, we perform all the IO accesses at a page granularity of 16KB, typical SSD page size [28]. Note that the load granularity can be increased easily to keep up with future SSD configurations. The SSD page size may keep growing to accommo- date higher capacities and IO speeds, SSD vendors are packing more bits in a cell to increase the density, which leads to higher SSD page sizes [27]. We used OpenMP to parallelize the code for running on multi-cores. We use an 8- byte data type for the rowPtr vector and 4 bytes for the vertex id. Locks were sparingly used as necessary to synchronize between the threads. With our implementation, our system can achieve 80% of the peak bandwidth between the storage and host system. Baseline: We compare our results with the popular out-of-core GraphChi frame- work. While several recent works have optimized GraphChi [5, 110, 121], most of these works have focused on optimizing algorithms that satisfy the commutative and asso- ciative property on their updates. Hence, these approaches tradeoff the generality of GraphChi for improving performance. Since MultiLogVC strives to preserve the gen- erality, we believe GraphChi is the most appropriate comparison framework. On the 96 log-based execution front, we compare our approach with GraFBoost [43]. GraFBoost uses a single log to maintain all updates, and exploits the commutative and associative property of graph algorithms to merge the updates to shorten the log. The merging pro- cess enables them to perform a relatively efficient out-of-memory sorting. Due to the limitation of GraFBoost, we can only compare MultiLogVC when the algorithms satisfy the constraints of GraFBoost. While comparing with GraphChi, we use the same host-side memory cache size as the size of the multi-log buffer used in MultiLogVC. In both our implementation and GraphChi’s implementation, we limit the memory usage to1 GB, primarily because the real-world graph datasets that are available are at most 100 GBs, and this approach has been used in many prior works to emulate a realistic memory-graph size ratio [5,71,121]. In our implementation, we define memory usage by limiting the total size of the multi-log buffer. GraphChi provides an option to specify the amount of memory budget that it can use. We maximized GraphChi performance by enabling multiple auxiliary threads that GraphChi may launch. As such, GraphChi also achieves peak storage access bandwidth. Graph dataset: To evaluate the performance of MultilogVC, we selected two real- world datasets, one from the popular SNAP dataset [57], and the other one is a popular web graph from Yahoo Webscope dataset [117]. These graphs are all undirected graphs, and for an edge, each of its end vertices appears in the neighboring list of the other end vertex. Table 3.1 shows the number of vertices and edges for these graphs. 97 Dataset name Number of vertices Number of edges com-friendster (CF) 124,836,180 3,612,134,270 YahooWebScope (YWS) 1,413,511,394 12,869,122,070 Table 3.1: Graph dataset 3.7 Applications To illustrate the benefits of our framework, we evaluate two classes of graph applica- tions. The first set of algorithms allows updates to be merged with no impact on correct- ness, which are proper workloads for GraFBoost. The second set requires all the updates to be individually handled, which can only be evaluated on GraphChi and MultiLogVC. Merging updates acceptable: BFS, Pagerank (PR) [82]. GraFBoost works well with these algorithms. Merging updates not possible: Community detection (CD) [87], Graph coloring (GC) [30], Maximal independent set (MIS) [90], Random walk (RW) [53], and K-core [86]. All the applications are implemented using the details presented in the references. Additional application details: A vertex in page rank gets activated if it receives a delta update greater than a certain threshold value (0.4). For random walk, we sampled every1000 th node as a source node and performed a random walk for 10 iterations with a maximum step size of 10. Due to extremely high computational load, for all the applications we ran 15 super- steps or less than that if the problem converges before that. Many prior graph analytics 98 systems also evaluate their approach by limiting the superstep count [43]. All the ap- plications mentioned above can be implemented using GraphChi and MulilogVC. How- ever, GraFBoost can only work for associative and commutative applications (Pagerank, BFS). 3.8 Experimental evaluation Figure 3.5a shows the performance comparison of BFS application on our Multi- LogVC and GraphChi frameworks. The X-axis shows the selection of a target node that is reachable from a given source by traversing a fraction of the total graph size. Hence, an X-axis of 0.1 means that the selected source-target pair in BFS requires traversing 10% of the total graph before the target is reached. We ran BFS with different traver- sal demands. The Y-axis indicates speedup which is the application execution time on GraphChi divided by application execution time on MultiLogVC framework. On average BFS performs17:8X better on MultiLogVC when compared to GraphChi. Performance benefits come from the fact that MultiLogVC accesses only the required graph pages from storage. BFS has a unique access pattern. Initially, as the search starts from the source node it keeps widening. Consequently, the size of the graph accessed and correspondingly the update log size grows after each superstep. As such, the perfor- mance of MultiLogVC is much higher in the initial supersteps and then reduces in later intervals. Figure 3.5b show the ratio of page accesses in GraphChi divided by the page accesses in MultiLogVC. GraphChi loads nearly 90X more data when using 0.1 (10%) traversals. 99 However, as the traversal need increases, GraphChi loads only 6X more pages. As such, the performance improvements seen in BFS are much higher with MultiLogVC when only a small fraction of the graph needs to be traversed. Figure 3.5c shows the distri- bution of the total execution time split between storage access time (which is the load time to fetch all the active vertices) and the compute time to process these vertices. The data shows that when there is a smaller fraction of the graph that must be traversed, the storage access time is about 75%, however as the traversal depth increases the storage access time reaches nearly 90% even with MultiLogVC. Note that with GraphChi the storage access time stays nearly constant at over 95% of the total execution time. (a) BFS relative to GraphChi (b) Page access counts ratio relative to MultiLogVC (c) Execution Time Breakdown Figure 3.5: BFS Application Performance 100 (a) Pagerank (b) FLP (c) GC (d) MIS (e) RW Figure 3.6: Application performance relative to GraphChi Figure 3.6a shows the performance comparison of pagerank. On average, pagerank performs 1:2X better with MultiLogVC. Unlike BFS, pagerank has an opposite traver- sal pattern. In the early supersteps, many of the vertices are active and many updates are generated. But during later supersteps, the number of active vertices reduces and MultiLogVC performs better when compared to GraphChi. Figure 3.7a shows the per- formance of MultiLogVC compared to GraphChi over several supersteps. Here X-axis shows the superstep number as a fraction of the total executed supersteps. During the first half of the supersteps MultiLogVC has similar, or in the case of YWS dataset slightly worse performance than GraphChi. The reason is that the size of the log generated is 101 large. But as the supersteps progress and the update size decreases the performance of MultiLogVC gets better. (a) Pagerank (b) FLP (c) GC (d) MIS (e) RW Figure 3.7: Application performance comparisons over supersteps Figure 3.6b shows the performance of community detection (FLP), which is similar in behavior to graph coloring (Figure 3.6c). On average community detection performs 1:7X times better on MultiLogVC over GraphChi. Figure 3.7b shows performance at supersteps. Similar to pagerank application, initially a lot of the vertices are active, and in later supersteps fewer vertices are active. In community detection (and graph coloring) application, target vertices receive updates from source vertices over in-edges. Also active vertices access in-edge weights and store the updates received via source 102 (a) GraFBoost Performance (b) Performance of K-Core algorithm Figure 3.8: Performance comparison vertices so that a vertex can only send its label if it has been changed. As a result, in MultiLogVC framework, community detection application has to access both updates and edge-weights from storage. Whereas in GraphChi, as updates are passed to the target vertices via edge weights, only edge-weights need to be accessed from storage. In later supersteps, MultiLogVC sees few active vertices, and even with the need to access updates and edge-weights separately, it outperforms GraphChi. Figure 3.6d shows the performance of maximal independent set algorithm. On av- erage maximal independent set algorithm performs 3:2 better on MultiLogVC when compared to GraphChi. In this algorithm also, as vertices are selected with a probability, there are fewer active vertices in a superstep. Figure 3.7d shows the performance over supersteps. Figure 3.6e shows the performance comparison of random walk application on Mul- tiLogVC framework with GraphChi framework. On average random walk is 6 faster than GraphChi. This algorithm follows the same pattern as BFS. Initially, its active vertices subset is smaller, and gradually it expands and eventually tapers off. Hence, 103 MultiLogVC shows significant improvements in the earlier supersteps and, gradually the performance delta also tapers off. GraFBoost comparison: Since GraFBoost only works on applications where up- dates can be merged into a single value, we use pagerank, which satisfies GraFBoost limitation. Figure 3.8a presents the performance. The Y-axis illustrates MultilogVC per- formance improvement relative to GraFBoost. To have a fair comparison, for both of the systems we use the same configuration of 1GB memory. For GraFBoost, memory usage is limited by CGGROUP instructions. Also, since GraFBoost currently does not support loading only active graph data, the comparison is done based on only first itera- tion. On average, MultilogVC is2:8 faster than GraFBoost. As can be observed from this figure, Yahoo Web dataset which is a larger dataset shows a significant performance improvement (4 times faster than GraFBoost). The reason is that as the dataset grows, the log file size also grows. Therefore, the cost of sorting large logs that do not fit in memory dominates in GraFBoost. Adapting GraFBoost for applications with non-mergeable updates: We compare the performance of graph coloring application by adapting GraFBoost’s single log struc- ture for passing the update messages. As we cannot merge the updates generated to a target vertex into a single value, we need to keep all the updates and sort them. For graph coloring application, when compared to this adapted GraFBoost, our MultiLogVC performs2:72x and2:67x for CF and YWS, respectively. MultiLogVC uses CSR format, which enables accessing only active vertices, but graph updates are costly. Using multiple intervals as separate CSR structures certainly 104 Figure 3.9: Percentage of inefficient pages predicted help reduce the update cost of CSR format. Figure 3.8b shows the performance of K- core, which is the most demanding application for MultiLogVC. K-core uses delete op- erations that modify the graph. Furthermore, K-core has only one iteration in the al- gorithm, and all the vertices are active in that iteration. GraphChi can directly update the delete bit in its outgoing edge’s shard, whereas in MultiLogVC we log the structural update and later update it in the graph. Even in this pathological case, MultiLogVC achieves about 60% of the GraphChi performance. However, for other structural up- date operations like add edge or vertex, GraphChi and MultiLogVC tackle the structural updates in a similar fashion; namely, they both buffer the updates and merge after a threshold. Edge-log optimizer prediction accuracy: Figure 3.9 shows the percentage of ineffi- ciently used graph pages (i.e. pages with>0 and<10% of their content utilized) that we predicted correctly. On average, our scheme predicts34% of the inefficiently used pages. In FLP and GGC applications, as they converge faster, the number of inefficient pages are fewer (shown in Figure 3.3) and concomitantly our history-based prediction model 105 Figure 3.10: Memory scalability predicts with less accuracy. However, for other applications with higher inefficient pages over several iterations, our prediction accuracy is higher. Memory scalability: To study the scalability, we conducted experiments with in- creasing main memory to 4 GB and 8 GB. Figure 3.10 shows the performance of Mul- tiLogVC over GraphChi for MIS application. As memory size increases, the relative improvement of MultiLogVC over GraphChi stays about the same with about 5-10% increase in the performance of MultiLogVC when using larger memory. 3.9 Related work Graph analytics systems are widely deployed in important domains such as drug pro- cessing, social networking, etc., concomitantly there has been a considerable amount of research in graph analytics systems [30, 61, 77]. For vertex-centric programs with associative and commutative functions, where one can use a combine function to accumulate the vertex updates into a vertex value, GraF- Boost implements an external memory graph analytics system [43]. It logs updates to the storage and pass vertex updates in a superstep. They use a sort-reduce technique for 106 efficiently combining the updates and applying them to vertex values. However, they may access storage multiple times for the updates, as they sort-reduce on a single giant log. In our system, we keep multiple logs for the updates in storage, and in a superstep we access both the graph pages and the update pages once, corresponding to the active vertex list. Also, our MultiLogVC system supports, complete vertex-centric program- ming model rather than just associative and commutative combine functions, which has better expressiveness and makes computation patterns intuitive for many problems [8]. A recent work [26] extends GraFBoost work. In this work, they avoid sorting by partitioning the destination vertices such that they can fit in the main memory and repli- cate the source vertices across the partitions so that when processing a partition, all the source vertices graph data can be streamed in and updates to destination vertex can be performed in the main memory itself. However, one may need to replicate the source vertices and access edge data multiple times. To support complete vertex-centric pro- gramming with this scheme, computing may be prohibitively expensive, as the number of partitions may be high, the replication cost will also be high. X-stream [89] and GridGraph [125] are edge-centric based external memory systems which aim to sequentially access the graph data stored in secondary storage. Edge- centric systems provide good performance for programs which require streaming in all the edge data and performing vertex value updates based on them. However, they are inefficient for programs which require sparse accesses to graph data such as BFS, or programs which require access to adjacency lists for specific vertices such as random- walk. 107 GraphChi [54] is the only external memory based vertex-centric programming system that supports more than associative and commutative combine programs. In this work, we compare with GraphChi as a baseline and show considerable performance improve- ments. There are several works which extend GraphChi by trying to use all the loaded shard or minimizing the data to load in shard [4,110]. However, in this work, we avoid loading data in bulky shards at the first place and access only graph pages for the active vertices in the superstep. Semi-external memory systems such as Flash graph [123], GraphMP [100] stores the vertex data in the main memory and achieves high performance. When processing with a low-cost system and available main memory is less than the vertex value data, these systems suffer from performance degradation due to the fine-grained accesses to the vertex-value vector. Due to the popularity of graph processing, they have been developed in a wide variety of system settings. In distributed computing setting, there are popular vertex-centric pro- gramming based graph analytic systems including Pegasus [45], Pregel [66], Graphlab [61], Blogel [118], PowerGraph [102], Gemini [124], Kickstarter [108], Coral [109]. In the single-node in-memory setting there are many frameworks such as Turbograph [36], Venus [16], Graphene [59], Mosaic [63], GraphM [122]. Ligra [96] provides a frame- work optimized for multi-core processing. Lumos is an out of core graph processing engine that provided a synchronous processing guarantee by using cross iteration value propagation [107]. Polymar [120] is a Numa-aware multicore engine that is designed 108 to optimize the remote access. Some engines such as Galois [78] supports both single machine and distributed systems. 3.10 Chapter Summary Graph analytics are at the heart of a broad set of applications. In external-memory based graph processing system, accessing storage becomes the bottleneck. However, existing graph processing systems try to optimize random access reads to storage at the cost of loading many inactive vertices in a graph. In this work, we use CSR format for graphs that are more amenable for selectively loading only active vertices in each super- step of graph processing. However, CSR format leads to random accesses to the graph during update process. We solve this challenge by using a multi-log update system that logs updates in several log files, where each log file is associated with a single vertex interval. Further, while accessing SSD pages with fewer active vertex data, we reduce the read amplification due to the page granular accesses in SSD by logging the active vertex data in the current iteration and efficiently reading the log in the next iteration. Over the current state of the art out-of-core graph processing framework, our evaluation results show that MultiLogVC framework improves the performance by up to 17:84, 1:19,1:65,1:38,3:15, and6:00 for the widely used breadth-first search, pager- ank, community detection, graph coloring, maximal independent set, and random-walk applications, respectively. 109 Chapter 4 Summarizer: Trading Communication with Computing Near Storage 4.1 Chapter Overview In the previous two chapters we presented how graph applications can be accelerated on current generation storage systems built on SSDs. Based on the insights gained from the prior two chapters we present a more generalized approach to accelerating data inten- sive applications, beyond graph processing algorithms. Processing large volumes of data is the backbone of many application domains, beyond graphs, such as data analytics and data integration. Just as in graph analytics, the cost of transferring data from storage to compute nodes starts to dominate the overall application performance in many big data applications. Applications can spend more than half of the execution time in bringing data from storage to CPU [105]. In magnetic disk storage systems the medium access time, such as seek and rotational latencies, dominates the data access time. However, with the rapid adoption of solid state non-volatile storage technologies the performance 110 bottleneck shifts from medium access time to the operating system overheads and inter- connection bandwidth. As such the prevalent computational model which assumes that storage medium access latency is an unavoidable cost must be rethought in the context of solid state storage. In particular, the computing model must adapt to the realities of bandwidth and OS overheads that dominate storage access. As described in the introduction chapter, modern solid state drives (SSDs) for data centers integrate large DRAM buffers as well as multiple general-purpose embedded cores. It is observed that these cores are often under-utilized, and hence can be repur- posed to move computation closer to data. Some prior works already demonstrated the potential of using near-data processing models to offload data analytics, SQL queries, operating system functions, image processing, and MapReduce operations [12, 14, 22, 42, 46, 88, 93, 95, 103, 114] to data storage devices. Many of these prior works assume the presence of a commodity general purpose processor near storage. However, even high-end data center SSDs are equipped with wimpy embedded cores. Apart from the cost and power consumption constraints, one reason for the inclusion of only wimpy embedded cores is that they already provide sufficient computing capability to handle most of the current SSD operations, such as protocol management, I/O scheduling, flash translation and wear leveling. For instance, recent SSD controllers that support NVMe protocol and high bandwidth PCIe interconnects may partially utilize only three low-end ARM Cortex R5 cores to support firmware operations [68, 91]. These cores are utilized up to 30% even in the worst case. Hence while there is slack in utilizing the embedded cores, it is not practical to offload large computation kernels to these wimpy cores. 111 With wimpy embedded cores the choice of computing near storage must be care- fully managed. Sometimes it can be advantageous to move computation to the em- bedded cores to reduce host-storage communication latency by alleviating heavy traffic and bandwidth demands. On the other hand, with wimpy embedded cores in-storage computation requires much longer processing time compared to computing on host pro- cessors. Thus there is a trade-off between the computation overhead suffered by wimpy cores and the reduction in communication latency to transfer data to the host system. In this chapter we propose Summarizer, a near storage computing paradigm that uses the wimpy near-storage computing power opportunistically whenever it is beneficial to offload computation. Summarizer automatically balances the communication delay with the computing limitations of near-storage processors. This chapter makes the following contributions: 1. This work proposes Summarizer — an architecture and computing model that al- lows applications to make use of wimpy SSD processors for filtering and summariz- ing data stored in SSD before transferring the data to the host. Summarizer reduces the amount of data moved to the host processor and also allows the host proces- sor to compute on filtered/summarized result thereby improving the overall system performance. 2. A prototype Summarizer system is implemented on a custom-built flash storage sys- tem that resembles existing SSD architectures but also enables fine grain computa- tional offloading between the storage processor and host. We enhanced the standard NVMe interface commands to implement the functionality of Summarizer, without 112 changing the NVMe compatibility. Using this prototype, we demonstrate the ben- efits of collaborative computing between the host and embedded storage processor on the board. 3. We evaluated the trade-offs involved in communication versus computation near storage. Considering several ratios of internal SSD bandwidth and the host to SSD bandwidth, ratio of host computation power and SSD computation power, we per- form design space exploration to illustrate the trade-offs. 4. Summarizer dynamically monitors the amount of workload at the SSD processor and selects the appropriate work division strategy among the host processor and the SSD processor. Summarizer’s work division approach quantifies the potential of using both the host processor and SSD processor in tandem to get better perfor- mance. The rest of this chapter is organized as follows: Section 4.2 provides a brief back- ground and motivation for Summarizer. Section 4.3 introduces the architecture and the implementation of Summarizer. Section 4.5 describes the applications used in this chap- ter. Section 4.6 describes our methodology and implementation details. Section 4.7 presents the experimental setup and our results. Section 4.8 provides a summary of re- lated work to put this project in context, and Section 4.9 concludes the chapter. 113 4.2 Background and motivation Summarizer relies on the processing capabilities embedded in modern SSDs and the NVMe (Non V olatile Memory Express) commands used by the host to interface with SSDs. A brief overview about SSD architectures, and the NVMe protocol that the host uses to communicate with the SSD can be found in subsection 1.1.1. Here we will provide a brief overview about the processing potential near modern SSDs. 4.2.1 Potential of in-SSD computing While SSDs provision multiple embedded processors to improve the performance of SSD controller functions, such as FTL management and garbage collection, much of the compute power in the controller remains under utilized. These underutilized embedded cores provide opportunities for offloading computation from the application. In partic- ular, as we look into the future the processing power of even the embedded cores will continue to grow, even though there is likely to be a large gap in the computing capability of a host processor and the embedded core within each SSD. This section will analyze the potential of using these embedded processors to accelerate applications. Note that even though the embedded cores are underutilized the primary rationale for provisioning multiple cores is still performance. The firmware can partition the code to enable concurrent execution of different operations such as parsing commands, look- ing up addresses, locking addresses that are being accessed, and interfacing with flash memory chips. The firmware code also uses several cores to perform garbage collection 114 and wear-leveling in the background. In our lab evaluations concurrently performing 4096 operations in a data center class SSD, the average utilization of these SSD pro- cessors was always lower than 30% and there was always at least one processor in idle state. The under utilization remains the same even when performing garbage collection or wear-leveling. While there is plenty of concurrency in FTL operations, each of the concurrent operation itself is relatively simple. Hence, while multiple cores are useful to exploit concurrency, each core’s utilization during any given operation remains low. Such under utilization reveals the potential for using existing hardware in these data center-scale SSDs for computation. For example, we can add one more stage to the flash data access pipeline to perform low compute intensity operation without impacting the SSD controller performance. As long as the additional stage is shorter than the current critical operation, the SSD will not suffer significant degradation in throughput. Among the steps in the flash data access pipeline, it is well known that the most time consuming step is accessing the flash medium. Even for the fastest flash operation, read operation, a high-performance single-level cell flash memory chip still takes more than 20 micro-seconds to complete the operation [73]. This delay is even longer for multi-level flash cells. Assuming that the flash interface can access b flash channels simultaneously and the accessed data are evenly distributed among all channels, and if each read operation takest seconds to complete, the “slack” that allows the computation stage to finish without hurting the throughput is t b . This slack grows even more when using multi-core processors. If each flash page containsp bytes of data, each embedded processor can processn instructions each second, we can obtain the per-byte operation on the file data without affecting the throughput will be tn bp when we spare one core 115 for in-storage computing. However, if the SSD is provisioned with more cores or if the number of parallel operations do not reach the peak performance, we may expect much larger slack than this first order estimate. For example, if we can spare m cores for computation and perfectly parallelize the computation, the operation we can perform on each byte without affecting throughput can be close to tnm bp . For the MEX SSD controller that Samsung SSDs (such as the EVO range) typically use, each processor core can execute410 8 instructions per second. If we use this pro- cessor in an SSD with 32 banks and 8KB flash pages, we find that the per-byte operation must be restricted to only 1 instruction execution per-byte, or 4 operations per 4-byte word. 4.3 Summarizer As detailed in the section 4.2.1, there is sufficient compute capacity within the SSD controller to enable modest computation near storage. As SSDs integrate more power- ful computing resources then the complexity of near-storage computing can also scale. In this section we describe Summarizer which is our proposed near-storage computing paradigm that automatically scales the near-store computing capability without the need to rewrite the application software with each new generation SSD. We first describe the system architecture and then present our extended NVMe command support provided for the near-storage computation paradigm. 116 Host User applications NVMe host driver Host DRAM PCIe / NVMe host interface SQ CQ SSD Controller NAND Flash SSD Controller Interconnection DRAM (Data buffer) Flash controller DRAM controller I/O controller (NVMe command decoder) Summarizer Flash translation layer (FTL) TQ User functions Task controller Figure 4.1: The overall architecture of NVMe controller and Summarizer 4.4 SSD Controller architecture Figure 4.1 illustrates the overall architecture of NVMe SSD controller and also high- lights the additional components that are introduced for enabling Summarizer (which are described in detail in the next section). The host applications interact with NVMe device through the host-side driver and the SSD controller firmware running on an embedded processor in SSD. The host driver and the controller firmware on NVMe SSD device communicate via PCIe bus. NVMe commands issued by the NVMe host driver are reg- istered in the submission queues (SQs) within the host DRAM space, and the doorbell signal corresponding to the requested command is sent to the SSD controller to notify a new command request from the host. 117 The major functions of the SSD controller are I/O control and flash translation layer (FTL) processing. The SSD controller receives request commands from the host by reading the registered request from the head of SQ. The SSD controller can fetch host request commands as long as the registered requests exist in SQs. After fetching the NVMe command it is decoded into single or multiple page-level block I/O commands. Each page-level request has a logical block address (LBA), which is translated to a phys- ical page number (PPN) by the FTL processing. The flash memory controller accesses flash memory chips by the page-level commands. For a NVMe read command the requested page data is fetched from flash memory chips through a series of physical page reads and the fetched data is buffered in the DRAM on the SSD device. Then the page data is transferred to the host memory via the direct memory access (DMA) mechanism. After NVMe command handling completes, the SSD controller notifies the completion of the previously submitted command by registering the NVMe command and its return code in the completion queue (CQ) on the host memory. 4.4.1 Summarizer architecture and operations In this section we describe the necessary hardware and software modifications to the SSD controller architecture described above to enable Summarizer. Summarizer can be implemented with some minor modifications to the NVMe command interpreter and a software module added to the SSD controller. We envision majority of Summarizer 118 functionality to be implemented in the interface between the SSD controller and the flash memory controller. Summarizer has three core components: (1) a task queue structure, (2) a task con- troller module and (3) user function stacks as shown in Figure 4.1. The task queue (TQ) is a circular queue which stores a pointer to the appropriate user function that must be invoked when the host requests in-SSD processing on a given I/O request. The task con- troller decides whether in-SSD processing is performed for the fetched page data or not. If the controller decides to execute in-storage computation the target function is executed from the user function stacks. In some cases the task controller may not perform in-SSD processing even if the host requested for such a processing. We explored two different options for how the controller can make such a determination, and these two options are described later. Summarizer-enabled SSDs allow the host to associate a specific user-defined function to be executed on the SSD controller with every data access request. To enable such an association we define the new NVMe commands to specify initialization, computation and finalization operations. The extended NVMe commands triggering these three steps are listed in Table 4.1. In the listn after TSK represents an identifier of task TSK. INIT TSKn: When the host NVMe driver issues INIT TSKn command, the SSD controller calls the initialization function for TSKn. This command essentially informs the Summarizer’s task controller that the host intends to execute a user-defined task n near storage. During the initialization step the task’s local variables or any temporary data that may be used by that task are initialized. 119 Flash controller Request queue Response queue TQ Head in-SSD proc? Register in CQ Tail User function call Flash translation layer (FTL) False True NAND Flash DRAM controller DRAM (page data buffer & user code storage) SSD Controller Summarizer Figure 4.2: Detailed in-SSD task control used by Summarizer 120 Command Description INIT TSKn Initialize variables or set queries READ PROC TSKn Read page data and execute the computation kernel for taskn with the data READ FILT TSKn Read page data and filter the data by predefined queries FINAL TSKn Transfer outputs of FNn to the host Table 4.1: New NVMe commands to support Summarizer READ PROC TSKn: READ PROC TSKn command resembles the conventional NVMe READ command, except that this command carries information regarding the desired task that may be executed on the SSD controller once a page is read from the flash memory. Note that existing NVMe READ command has multiple reserved bytes that are not used for any processing. We use these unused bytes to specify the task id in the READ PROC TSKn command. Like the conventional NVMe READ command, the SSD controller issues the read request to flash memory controller to fetch page data. In addition the SSD controller also recognizes that the host is requesting in-SSD processing for this data request and the processing task is specified in the task identifier field (TSKn) of the NVMe command itself. This information is tagged with the request. For this purpose the request queue entry carries two additional fields, a 1-bit in-SSD compute flag, and the task id field. These fields are set by the SSD controller when executing the READ PROC TSKn command. In addition, the SSD controller adds the request to the Summarizer’s task queue. The flash controller processes the read request by accessing the appropriate channel and chip ids. The data fetched is first buffered in the SSD DRAM and the completion 121 signal is sent to the response queue as in any regular SSD. In Summarizer the flash con- troller also transfers the two additional in-SSD computing fields to the response queue. The response queue data is usually sent back to the host via DMA by the SSD con- troller. However, with Summarizer the SSD controller checks the in-SSD compute flag bit. If the bit is set then it is an indication that the host requested in-SSD computation for this page. In this case the task controller decides whether in-SSD processing is per- formed for the fetched page data or not. If the controller decides on in-SSD processing then the computation task pointed by the user function pointer registered in the TQ en- try is invoked. The buffered page data is used as an input of the computation kernel. The intermediate output data produced by the computation kernel updates the variables or the temporary data set that was initiated by the initialization step.Then the special status code is returned to the host to indicate that in-SSD computation is performed for the corresponding page data instead of transferring entire page data to the host’s main memory. Task controller modes: The task controller in Summarizer can execute either in static or dynamic mode. In the static mode whenever in-SSD computing flag is set then that computation is always completed on the fetched data irrespective of the processing delay of the embedded processor. In the static mode, when in-SSD computation request is not possible since TQ is full, the return process is simply stalled. We also explored a dynamic task controller approach. When Summarizer is running in the dynamic mode, if in-SSD computation is delayed in the SSD controller due to lack of computation resource, the buffered page data is transferred to the host even though 122 READ PROC TASKn is issued by the host. This situation happens when the service rate (execution time) of embedded processor is slower compared to the incoming rate of in-SSD computation requests. Such congestion happens frequently in the presence of very wimpy SSD cores if near data processing is applied aggressively on fetched page data. READ FILT TSKn: The operation of READ FILT TSKn is similar to that of READ PROC TSKn except filtering is performed and filtered data is transferred to the host. A filtering request is also a computation task but in this work we consider a request as a filtering task if the host processor only offloads part of the computation task to the SSD processor and it retains some of the computing for execution on the host. For instance, a filtering task may use a simple compare operations on specific data fields within a page to remove some data that is not needed at the host. The filtering conditions are pre-defined during the initialization step by INIT TSKn command. The filtered data size is recorded in the reserved 8 byte region in the NVMe command and registered in CQ when filtering execution is complete. FINAL TSKn: The host machine can gather the output result of the computation kernel for task n using FINAL TSKn command. When that command is issued the results stored in DRAM on SSD is transferred to the host memory. The size of transferred data is also logged in the reserved 8 byte field of the NVMe response command. 123 4.4.2 Composing Summarizer applications As stated in Section 4.4 Summarizer piggybacks on page-level flash read operations to execute user-defined functions before returning processed data to the host. As such there are some basic restrictions on data layout and computing that must be followed. For instance, the input data for Summarizer should be aligned at the page granularity (4 KB – 16 KB). If data overlaps across page boundaries, a more complex Summarizer data management strategy is necessary. In this work we instead provide data layout and computing API to the programmer to satisfy the page granularity based computing restrictions. In particular, we provide the following Summarizer methods which serve as wrappers that allow conventional user programs to use the proposed Summarizer NVMe commands. STORE: To exploit the Summarizer it is necessary to align data sets in a page size memory space. To support page-level alignment STORE primitive of Summarizer API first assigns user data sets in 4 KB or 16 KB data space and then directly issues store block I/O commands to the host NVMe driver. If the valid data is less than one page then that page meta data stores the valid data size. READ: The application programmer can use READ API to specify the data set to compute and the desired computation (i.e. SSD functions) to apply on the data set. As data sets are aligned at the page-granularity, the READ API will be translated into READ PROC TSKn or READ FILT TSKn NVMe commands at page granularity. If the SSD does not support Summarizer functionality the READ command will map to the default NVMe read command thereby preserving compatability with all SSDs. 124 Note that we assume that the READ command is mapped to READ PROC TSKn or READ FILT TSKn explicitly by the programmer or an external compilation system. COMPUTE: Recall that the SSD controller may optionally execute the user function or may return the entire page data back to the host. Thus the dynamic task controller ap- proach requires bit more effort on the host side code to determine whether a page needs processing on the host or not, based on the response received from the SSD controller. As such, the application programmer uses the COMPUTE function as a wrapper to handle the different return values from invoking the READ function. The COMPUTE wrap- per simply encapsulated all host function invocations under a conditional statement that checks for the return code from the SSD controller before initiating host side execution on a page. GATHER: Since computation is distributed on both the host CPUs and SSD devices it is necessary to gather output of kernel computation performed on the SSD devices. So the GATHER wrapper function in the application program issues the finalization NVMe command to collect processing output from the SSD devices. And then the collected output is merged with the output from the CPU computations by the programmer. The programmer can compose in-storage Summarizer programs using imperative pro- gramming languages like C and C++. Using the Summarizer API, it is easy to extend the programs that execute only on host system to also execute functions on the processor near SSD. For the applications that we describe later in section 4.5, it took us 310 person hours for each application on average (and note that the effort level reduced once 125 the first application conversion was completed). As Summarizer inherits imperative pro- gramming model, Summarizer leverages existing ARM programming toolchains to gen- erate the machine code running on SSD controller. 4.5 Case studies Summarizer can provide benefits to a wide range of applications. To evaluate the proposed model, we present several case studies suited for Summarizer execution from database to data integration areas and demonstrate how Summarizer helps avoid redun- dant data transfer and improve application performance. 4.5.1 Data analytics Decision Support Systems (DSS) are a class of data analytics where a user performs complex queries on a database to understand the state of their business. DSS queries are usually exploratory and prefer early feedback to help identify interesting regions. Many of the DSS queries perform significant amount of database filtering and only use a subset of database records to perform complex computations. The amount of computa- tion per byte of data transferred is quite low in these applications. Using Summarizer to enable data filtering or even executing the entire query near SSD helps reduce the data bandwidth demands to the host. We run the TPC-H benchmark to test the performance of data analytics. TPC-H is a well-known data warehouse benchmark. It consists of a suite of business oriented ad-hoc 126 Algorithm 5 TPC-H query 1 select l returnflag, l linestatus, sum(l quantity) as sum qty, sum(l extendedprice) as sum base price, sum(l extendedprice(1l discount)) as sum disc price, sum(l extendedprice(1l discount) (1+l tax)) as sum chrg, avg(l quantity) as avg qty, avg(l extendedprice) as avg price, avg(l discount) as avg disc, count() as count order from lineitem where l shipdate<= date ’19981201’ interval ’[DELTA]’ group by l returnflag, l linestatus order by l returnflag, l linestatus; Algorithm 6 TPC-H query 6 select sum(l extendedprice l discount) as revenue from lineitem where l shipdate>= date ’[DATE]’ and l shipdate< date ’[DATE]’+ interval ’1’ year and l discount between [DISCNT]0.01 and [DISCNT]+0.01 and l quantity< [QUANTITY]; 127 queries. We select TPC-H queries 1, 6, and 14 that require several operations such as where condition, join, group by and order by. These operations are also performed in many other TPC-H queries. TPC-H queries 1,6, and 14 are shown in Algorithm 5, 6 and 7 respectively. In our experiments, we evaluate these queries on the TPC-H databases with scale factors 0.1 (100MB). Note that this scale factor is simply a limitation of our prototype board (described in the next section) due to the limited amount of capacity, not a limitation of Summarizer. Algorithm 7 TPC-H query 14 select 100.00 sum(case when p type like ’PROMO%’ then l extendedprice(1l discount) else 0 end) / sum(l extendedprice (1 l discount)) as promo revenue from lineitem, part where l partkey= p partkey and l shipdate>= date ’[DATE]’ and l shipdate< date ’[DATE]’+ interval ’1’ month; 4.5.2 Data integration Data integration is the problem of combining data from different sources and/or in different formats. This problem is crucial for large enterprises that maintain different kind of databases, for better cooperation among government agencies, each with their own data sources, and for search engines that manage all kinds of web pages on the Internet. Similarity join is an important step in the data integration process. While SQL pro- vides support (such asjoin) to combine complementary data from different sources, it 128 fails if the attribute values of a potential match are not exactly equal due to misspelling or other structuring issues. Similarity join is an effective way to overcome this limita- tion by comparing the similarities of attribute values, as opposed to exactly matching corresponding values. The similarity join problem can be defined as given a collection of records, a simi- larity functionsim() , a similarity thresholdt and a query recordq finding all the pairs of records,<q;x> such that their similarity values are at least above the given thresh- old t, i.e, sim(q;x) t. We adopt the Overlap similarity which can be defined as: O(q;x) = jq\xj min(jqj;jxj) . We use the DBLP dataset which is a snapshot of the bibliography records from the DBLP website. It consists of nearly 0.9M records. Each record con- sists of the list of authors and the title of the publication. The dataset is preprocessed to tokenize each record using white spaces and punctuation. The tokens in each record are sorted based on this frequency in the entire dataset. The records are then sorted based on their lengths (number of tokens). The prefix filtering based similarity join algorithm that we implemented is shown in Algorithms 8, 9. First we filter each recordx which is similar toq from the dataset using the prefix filtering principle [115]. The prefix fil- tering principle is as follows: Let the p-prefix of a recordx be the first p tokens of x. If O(q;x)t, then the (jqjdt:jqje+1)-prefix ofq and the (jxjdt:jxje+1) ofx share at least one token. Only the records that pass the prefix filtering stage are verified to check if they meet the overlap similarity threshold. 129 Algorithm 8 Prefix filtering similarity join Input: query Recordq, tokenized datasetD, thresholdt Output: set of Records similar toq inD S for eachxR do a=jqjdt:jqje+1 b=jxjdt:jxje+1 fori = 1 toa do forj = 1 tob do ifq[i]==x[j] then match true ifmatch istrue then similar Verify(q;x;t) ifsimilar istrue then S S[fxg returnS Algorithm 9 Verify(q,x,t) Input: Query Recordq, matched recordx, thresholdt minLength min(jqj;jxj) overlap 0 fori = 1 tominLength do forj = 1 tominLength do ifq[i]==x[j] then overlap overlap+1 similarity overlap=minLength ifsimilarity>t then returntrue 130 4.6 Methodology and implementation details Summarizer essentially trades computation near-storage with reduced bandwidth to the host. We evaluated four different strategies to study this trade-off. (1) The first strategy is our baseline where the entire computation is done on the host processor as is the case today. In this baseline the host processor will receive all the data required for computation from the SSD. (2) At the other extreme one may consider doing all the computation at the wimpy processor near SSD, which involves only communicat- ing the output values and input values related to the query with the host processor. All the data required for computation is fetched and processed within the SSD. (3) As the cores near SSD have relatively lower processing power, it may be better to use host and wimpy SSD processors collaboratively. To evaluate this strategy, we used two different approaches. One approach is custom hand-coding of the workload. For the hand-coding approach we analyze the applications and map computations to processors according to the strengths of the processors. Intuitively, computations which help in filtering lots of data communication to host processor shall be mapped at the processors near SSD. Part of the program with high computational intensity shall be mapped to the host processor. While computational intensity of any function may be automatically quantified based on simple metrics such as total instruction count, in this work we hand-classified the DSS queries and data integration code into functions that are either data intensive or compu- tation intensive. (4) And the last approach we evaluated is an automatic approach that dynamically selects which pages to compute on the embedded core and which pages to be processed on the host. The automatic approach is agnostic to the entire workload 131 when distributing the computation tasks. For this mode, the host CPU sets all pages as in-SSD computation when block requests are issued to the SSD device. Once the page data is fetched from NAND flash, the SSD controller checks the empty slots in TQ. Re- call TQ is a queue in Summarizer architecture where a page is registered to be considered for processing near SSD. If there are empty slots, the page is registered in TQ and the page data is computed in SSD. Otherwise, all page data is transferred to the host CPU without processing. Clearly executing the entire workload on the host or SSD controller is trivial. The only challenge here is to compile the workload to run on two different ISAs: the host processor in our implementation is based on x86 while the SSD controller core is based on ARM core. However, collaborative execution on two processors requires workload distribution. For the hand-coded version we used the following division of work for each workload as shown in the list below. Note that there are several variants of this hand-coded version that can be implemented (and we in fact evaluated some of these variants), but as we show later in the results section hand-coded optimization, while better than baseline, is generally outperformed by the dynamic workload distribution approach. Dynamic approach is much easier to adopt in practice. The programmer does not have to worry about workload distribution and the system automatically determines where to execute the code based on dynamic system status. • TPC-H query 1,6: For static workload distribution, we implement the where con- dition at processor near SSD and then transfer just the items of the record needed to do group by and aggregation operations. 132 • TPC-H query 14: We implement hash join algorithm to perform equi-join opera- tion between the lineitem and part table. In this algorithm first is the build phase, in this phase using the part table a hash table is built with the table key being used for hashing and the item required in further processing as the values in the hash table. In the next iteration, i.e. probe phase, we traverse over the lineitem table and check if the key item is present in the hash table. For processing at both the processors strategy we check where condition at the processor near the SSD and transfer only item of the record that are needed for hashing and then aggregation. 4.7 Evaluation 4.7.1 Evaluation platform We evaluated the performance of Summarizer using an industrial-strength flash-based SSD reference platform. The architecture of the SSD development board is illustrated in Figure 4.3. The board is equipped with a multi-core ARM processor executing the SSD controller firmware programs (including FTL management, wear-leveling, garbage collection, NVMe command parsing and communication with the host), and an FPGA where the flash error correction logic and NAND flash memory controller logic are im- plemented. The ARM processor communicates with the host processor via PCIe Gen.3 4 bus. Also, the ARM processor and the NAND flash controller logic on the FPGA transfers NAND flash access commands and data through PCIe buses on the SSD devel- opment board. 133 LS2085a Interconnection DDR4 Memory Controller DRAM DRAM CPU L1D (32KB) L2 (1MB) L1I (48KB) CPU L1D (32KB) L1I (48KB) PCIe (host – LS2085a) PCIe (LS2085a - FPGA) FPGA (ALTERA Stratix V) NAND flash DIMM NAND flash DIMMs CPU L1D (32KB) L2 (1MB) L1I (48KB) CPU L1D (32KB) L1I (48KB) (a) Architecture (b) SSD development board Figure 4.3: The SSD development platform 134 The NAND flash controller on the FPGA accesses two NAND flash DIMMs that are equipped with 4 NAND flash chips per DIMM. The prototype board design faithfully performs all the functions of commercial SSDs. Unlike commercial SSD devices our prototype NAND flash interface on the development board has lower internal data band- width since the NAND flash DIMMs have fewer NAND flash stacks and fewer channels. In addition, the NAND flash controller on FPGA runs with lower clock frequency due to the error correction logic implementation limitations. These limitations are purely due to cost considerations in designing our boards. Hence, the internal SSD bandwidth observed in our board is significantly lower than commercial SSDs. To accommodate the lower internal bandwidth limitation, the host-SSD bandwidth is also set to be propor- tionally lower. In commercial Samsung SSDs the external bandwidth is typically 2–4 lower than the peak internal bandwidth. Hence, the host-SSD bandwidth is set as 2 lower in our board compared to the internal FPGA-ARM core bandwidth. Another implementation difference between our board and commercial SSDs is that our board is equipped with more powerful embedded cores than what is seen typically on commercial SSDs. Compare to the reported clock frequency (400 or 500 MHz) of commercial NVMe SSD controllers, the ARM processor on our development platform runs at a faster peak clock frequency (1.6 GHz). Hence, the host-embedded performance ratios favor more embedded core computing. To mimic commercial SSDs we throttled the ARM core frequency. In the next subsection we describe how we select the throttling frequency. 135 0 1 2 3 4 0 0.2 0.4 0.6 0.8 1 DYN HD SSD time host time 0 1 2 3 4 5 6 7 0 0.2 0.4 0.6 0.8 1 DYN HD SSD time host time 0 1 2 3 4 0 0.2 0.4 0.6 0.8 1 DYN HD SSD time host time 0 1 2 3 4 5 6 7 8 0 0.2 0.4 0.6 0.8 1 DYN HD SSD time host time (a) TPC-H Query6 0 1 2 3 4 0 0.2 0.4 0.6 0.8 1 DYN HD SSD time host time 0 1 2 3 4 5 6 7 0 0.2 0.4 0.6 0.8 1 DYN HD SSD time host time 0 1 2 3 4 0 0.2 0.4 0.6 0.8 1 DYN HD SSD time host time 0 1 2 3 4 5 6 7 8 0 0.2 0.4 0.6 0.8 1 DYN HD SSD time host time (b) TPC-H Query1 0 1 2 3 4 0 0.2 0.4 0.6 0.8 1 DYN HD SSD time host time 0 1 2 3 4 5 6 7 0 0.2 0.4 0.6 0.8 1 DYN HD SSD time host time 0 1 2 3 4 0 0.2 0.4 0.6 0.8 1 DYN HD SSD time host time 0 1 2 3 4 5 6 7 8 0 0.2 0.4 0.6 0.8 1 DYN HD SSD time host time (c) TPC-H Query14 0 1 2 3 4 0 0.2 0.4 0.6 0.8 1 DYN HD SSD time host time 0 1 2 3 4 5 6 7 0 0.2 0.4 0.6 0.8 1 DYN HD SSD time host time 0 1 2 3 4 0 0.2 0.4 0.6 0.8 1 DYN HD SSD time host time 0 1 2 3 4 5 6 7 8 0 0.2 0.4 0.6 0.8 1 DYN HD SSD time host time (d) SSJoin Figure 4.4: Execution time by the ratio of in-SSD computation 136 Data set Processing by I/O ratio TPC-H query 6 0.42 TPC-H query 1 1.08 TPC-H query 14 0.39 Similarity Join 0.93 Table 4.2: Processing by I/O ratio on data center SSDs 4.7.2 Calibration based on workload measurements We tested several database applications introduced in Section 4.5 on an Intel NVMe SSD platform for this study. First we measured the ratio of I/O time to compute time of the applications in order to throttle down the host CPU as well as the embedded core performance of the SSD to meet the measured I/O and compute time ratio on the SSD development platform. Table 4.2 shows the compute time by I/O time ratio measured on the real NVMe SSD platform, which is equipped with an Intel i5-6500 (4 cores running at 3.2 GHz) with 8 GB DRAM and Intel NVMe SSD. The Intel NVMe SSD is a 750 Series SSD with PCI-Express 3.0 and 20 nm multi-level cell technology. We then ran the same workload on our platform to create an equivalent processing to I/O ratio. Based on this ratio we set the frequency of the ARM core to be 200 MHz. Note that some of the reduction in frequency is also due to the smaller internal bandwidth we have on our platform. 4.7.3 Summarizer Performance As described in methodology section, we consider workload division at page-level granularity between the host processor and the embedded processor near SSD. Figure 4.4 137 shows the performance change by the degree of in-SSD processing. The X-axis indicates the ratio of number of pages processed in-SSD with Summarizer and number of pages processed at the host processor. Y-axis is the execution time normalized to the baseline, namely all data is processed in the host CPU. For this study we adjust the ratio of the pages marked as in-SSD computation in the host application and compare the perfor- mance change. Thus, zero on the X-axis means all data is processed by the host CPU, namely that is a baseline. If the ratio is one, it means all data is computed in SSD and the host CPU receives only the final result. The results where the X-axis shows specific numbers between 0 and 1 correspond to the static mode operation of Summarizer as described in Section 4.3. The bar labeled DYN represents results obtained using the dynamic mode of Summarizer. In the dynamic mode, all page fetch requests from the host CPU are issued using READ PROC TSKn commands. Summarizer dynamically decides the in-SSD computation for the requested pages. HD means hand-coded task offloading for in-SSD computation. Simple tasks (e.g. a database field filtering function) are performed on the SSD to reduce data traffic. The READ FILT TSKn NVMe command is exploited to perform filtering operations in the HD mode. We tested the performance of the hand-coded version under Summarizer’s static mode. Namely the filtering tasks are performed in the SSD regardless of available resources in the embedded processors. Each execution time bar is split into two components: time spent on the host side (labeled as host time in the bar), and time spent on the SSD side (labeled as SSD time). When using the baseline (X-axis label 0) the time spent on the SSD side is purely used to read the NAND flash pages and transfer them to the host. But for other bars the time 138 spent on the SSD side includes the time to read and process a fraction of the pages on SSD. Clearly processing only at the SSD (X-axis labeled 1) leads to significant perfor- mance degradation since the data computation takes longer on the wimpy SSD controller core. The hand-coded version (labeled HD) provides better performance than the static page-level SSD computation for all or large percentage of pages but in general hand- coding is a static approach that does not adopt to changing system state. As the wimpy cores on the SSD get overloaded, even though filtration tasks do not require lots of com- putation resources, our result shows static in-SSD offloading approach is ineffective as the I/O request rates exceed service rate of wimpy cores. The result in Figure 4.4 also demonstrates that placing all computation on host pro- cessor or SSD processor does not deliver the best performance. As such each application has a sweet spot where collaborative computation between SSD and host gives the best performance. But this sweet spot varies from workload to workload and may even vary based on the input data. In the dynamic mode Summarizer dynamically decides where the user application functions are performed by observing the availability of the embed- ded processor in the SSD. This dynamic approach can reduces the burden of program- mers in deciding the division of in-SSD computing to achieve better performance while exploiting the computation resources in the SSD. The performance of TPC-H query processing with DYN is improved by 16.4%, 10.3% and 20.3% for queries 6, 1 and 14 respectively. On average our current Sum- marizer prototype can improve the TPC-H performance by 15.7%. For similarity join the performance is improved by 6.9%. The percentage improvements that we observed are directly related to the amount of computation that we need to perform on each page 139 0% 20% 40% 60% 80% 100% 1:1 1:2 1:3 1:4 1:1 1:2 1:3 1:4 1:1 1:2 1:3 1:4 1:1 1:2 1:3 1:4 1:1 1:2 1:3 1:4 TPC-H query 6 TPC-H query 1 TPC-H query 14 SSJoin average Chart Title Figure 4.5: Performance improvement by internal/external bandwidth ratio of data. For TPC-H query 6 and 14 as most of the records are filtered by the where condition the amount of work that we do at each page is less and the improvements are higher. For TPC-H query 1 most of the records pass the where condition and the amount of work done at each page is higher when compared to TPC-H queries 6 and 14. Fur- ther, we observe that the amount of work involved at each page in similarity join is even higher and concomitantly the improvements are lower. It is important to note that almost all the bars in the figure use collaborative compu- tation between the host and SSD processor (except for 0 and 1). However, the perfor- mance improvements achieved by DYN are not simply due to the availability of addi- tional wimpy CPU resource. As shown in the bars much of the gains come from reduced I/O processing time, rather than having an additional wimpy core on the SSD. We must also emphasize that the performance improvements seen in this figure are somewhat constrained by the evaluation platform that has severely limited internal band- width than commercial SSDs. As such we believe that these results are only a demon- stration of the Summarizer potential rather than an absolute performance gain. 140 0% 20% 40% 60% 80% 100% 120% 140% x1 x2 x4 x8 x16 x1 x2 x4 x8 x16 x1 x2 x4 x8 x16 x1 x2 x4 x8 x16 x1 x2 x4 x8 x16 TPC-H query 6 TPC-H query 1 TPC-H query 14 SSJoin average Chart Title Figure 4.6: Performance improvement by SSD controller’s computation power 4.7.4 Design space exploration: Internal/external bandwidth ratio Figure 4.5 shows the performance change as a function of the ratio of internal data bandwidth between the SSD controller and NAND flash chips and external bandwidth of PCIe between the host processor and SSD. As stated earlier, even though the inter- nal bandwidth of SSD is easier to increase, current SSD designers have no incentive to increase the internal bandwidth since the external bandwidth determines the system performance. Summarizer provides a compelling reason for decoupling the internal and external bandwidth growth. As shown in the results in-SSD computation is more benefi- cial if internal bandwidth is higher than external bandwidth. 4.7.5 Design space exploration: In-SSD computing power As Summarizer exploits the underutilized computation power of SSD controller pro- cessors, it is expected that the performance benefits from in-SSD computation will im- prove with more powerful embedded processors in SSDs. In order to explore the perfor- mance impacts of powerful embedded processors for in-SSD computation, we measured 141 the performance changes of the overall system by changing the computation power of the embedded processor. As mentioned in the previous section we throttled down the clock frequency of the embedded core to mimic the operation of commercial SSDs. We use the throttling capability to increase the frequency of the embedded core up to 1.6 GHz for 8 computation power, or increase the number of cores for in-SSD computation (2 cores running at 1.6 GHz for 16 computation power). While frequency is not a sole measure of performance we use it as a first order metric in this study. Figure 4.6 shows the performance change as a function of improved computation power of the SSD controller when Summarizer runs in the dynamic mode. Our experi- mental results show that the overall performance can be improved up to 120% for TPC-H query 1 and 94.5% on average with in-SSD computation when the performance of the embedded controller core is increased by 16. As Summarizer uses wimpy core to achieve the above result, Summarizer provides one compelling argument for including a more powerful embedded core in future SSD platforms. 4.7.6 Cost effectiveness In addition to the model that Summarizer proposes, there are also system design op- tions that can improve application performance. This section will compare the cost of Summarizer with other options. Embedded processor vs Host processor: The result of Summarizer may encourage system designers to equip SSDs with more powerful processors. Even though using more powerful embedded processor may increase overall cost of the SSD platforms, it can still 142 $10 $15 $20 $25 $30 $35 $40 1000 1500 2000 2500 3000 3500 4000 CPU rating (CoreMark score) single-core @700MHz quad-core @900MHz quad-core @1.2GHz (a) ARM platform $0 $100 $200 $300 $400 $500 4000 6000 8000 10000 12000 14000 16000 CPU rating (PassMark score) (b) x86 desktop CPU Figure 4.7: Price by processor performance be more cost-effective considering the total cost of ownership (TCO) of the entire system. As shown in Figure 4.6 the overall performance is improved up to 94.5% on average with the more powerful embedded processor assigned to the in-SSD computation. This performance improvement is equivalent to doubling the host processor cores, assuming that the entire application performance is dependent on compute power, and the I/O time is negligible. But in practice to achieve 95% performance improvement on data intensive applications with significant I/O component the host compute power and the rest of system components such as the amount of memory, number of PCIe lanes may need to be scaled up. Figure 4.7 plots the performance improvement as a function of price for an x86 host CPU and ARM embedded cores. Unlike x86 CPUs most of ARM SoCs are directly delivered to product manufacturers, thus it is hard to get exact price information. Hence, we show the price changes of three versions of Raspberry Pi (RPi) that are available 143 on the market today in Figure 4.7a. RPi is a popular single board computer that equips an ARM-based SoC including video processing engines and various peripheral control IPs [113]. X-axis of the figure is the CPU rating measured by CoreMark benchmarks [25] and Y-axis is the price of the RPi boards. While the three generation of RPis have different system capabilities the CoreMark is mostly a CPU benchmark. As such a 4 improvement in CoreMark rating is achieved with less than $20 increase in price. While we acknowledge that the price is determined by many factors in the market, this is a first order approximation to demonstrate how cost effective it is to improve wimpy core performance. Figure 4.7b shows the price as a function x86 host CPU performance as measured by the PassMark CPU benchmark [83]. The prices and performance ratings are selected from Intel’s 6th and 7th generation CPUs. The additional cost for doubling the per- formance of x86 desktop CPUs is around $150 as reported in Figure 4.7b. Again, we acknowledge that it should not come as a surprise to designers that doubling x86 desktop CPU performance requires much higher effort than doubling a wimpy core performance. And the wimpy cores in the context of Summarizer are performing simpler operations such as filtering than a complex x86 CPU. However, the purpose of this section is to demonstrate the cost effectiveness of achieving higher system performance with cheaper in-SSD processors. External bandwidth: Another way to improve the performance of the storage sys- tem is increasing the bandwidth between the host machine and the SSD since higher external bandwidth may alleviate data transfer congestion in the SSD. One approach for higher external bandwidth is increasing the serial link speed of the PCIe interconnect 144 using higher clock frequency. However, this approach demands significant advances of serial data communication technology, and PCIe’s data transfer rate has not changed since the PCIe version 3.0 which was released in 2010. Another approach is to assign more PCIe lanes to the SSD. It requires more I/O pins on the SSD controller SoC (64 pins for PCIe4 and 98 pins for PCIe8 connections) and more complex wiring on the SSD board, which will cause significant cost increase [39]. In addition, the SSD should occupy more PCIe lanes on the host machine, which are limited resources of the system. System cost also increases if the host CPU and the motherboard support more PCIe lanes. On the other hand Summarizer can reduce the data congestion by consuming page data with in-SSD computation. Hence, Summarizer not only releases the computation burden of the host CPU but reduces the data traffic from the SSD. Note that the data I/O time is also reduced when Summarizer is applied as shown in Figure 4.4. Consequently with Summarizer the external bandwidth is effectively improved since more pages are responded to the host within the same period. This improvement is achieved without the cost of increasing the PCIe bus bandwidth. 4.8 Related work Decades ago, projects including ActiveDisks, IDISKS, SearchProcessor and RAP [1, 10, 48, 56, 58, 94] have explored the idea of pushing computation to magnetic storage devices. However, due to long magnetic disk latency and relatively small input/output size, the cost-effectiveness was limited. 145 With the growth of dataset sizes, data movement becomes an increasingly significant overhead when executing applications [29, 60, 119]. With improvements of non-volatile memory technologies enabling rich bandwidth inside data storage devices, recent re- search projects, including Summarizer, started to revisit the idea of pushing computation near storage. Similar to Summarizer, Active Flash [12,101,103], SmartSSD [22,46], Active Disks Meets Flash [18] and Biscuit [31] also try to utilize the embedded cores in a modern SSD to reduce the redundant data movement to free-up the host CPUs and main memory. Ac- tive Flash [101] for instance presents an analytic model for evaluating the potential for in-SSD computation. But the implementation and operational details were not presented. Summarizer presents a detailed description of the application development to SSD of- floading frameworks. SmartSSD [22] focuses on how to improve specific database op- erations, such as aggregation, using in-SSD computation. Biscuit [31] states that the approach is based on flow-based programming model. Hence, the applications running on Biscuit are similar to task graphs with data pipes to enable inter-task communication. Summarizer presents a set of general purpose NVMe commands and a programming model that can be used across different application domains to show the full potential of in-SSD computing. Summarizer presents an automated approach to determine when to offload computations for applications written in imperative languages without any restrictions on the code structure. Summarizer employees general-purpose ARM-based cores that are popular in SSD controllers. Therefore, the system design can implement most architectural supports that Summarizer needs through updating the firmware. On the other hand, Active Disks 146 Meets Flash [18], Ibex [114] or BlueDBM [42] leverages re-configurable hardware or specialized processor architectures to achieve the same goal, limiting the flexibility of applications but increasing the cost of devices. To expose hidden processing power in SSDs to applications, Summarizer, Biscuit [31], Morpheus [106], SmartSSD [22], and KAML [41] all extended standard NVMe or SATA protocols for applications to describe the desired computation. Unlike KAML or SmartSSD which extended the protocols specifically for database related workloads, Summarizer’s NVMe command set provides more flexibility in using the SSD proces- sors. In terms of programming models, Summarizer leverages the matured development tools in ARM platforms to compose and generate code running on storage devices, with- out needing application designers to deal with very low-level hardware details or signif- icantly changing existing code. Biscuit’s data-flow inspired programming model [31] or the limited API support in Morpheus [106] are more appropriate for specific application scenarios. Summarizer utilizes existing processors inside flash-based SSDs that originally are used for FTL processing but are also idle most of the time, thus minimizing the additional hardware costs. Processors-in-memory [29], Computational RAMs [23, 50, 65, 85, 104], Moneta [14] and Willow [95] require additional processors in the corresponding data storage units, decreasing the cost-efficiency of proposed designs. 147 4.9 Chapter Summary Big data analytics are hobbled by the limited bandwidth and long latency of access- ing data on storage devices. With the advent of SSDs there are new opportunities to use the embedded processors in SSDs to enable processing near storage. However, these processors have limited compute capability and hence there is trade-off between the bandwidth saved from near storage processing and the computing latency. In this pa- per we present Summarizer a near-storage processing architecture that provides a set of APIs for the application programmer to offload data intensive computations to the SSD processor. The SSD processor interprets these API calls and dynamically determines whether a particular computation can be executed near storage. We implemented Sum- marizer on a fully functional SSD evaluation platform and evaluated the performance of several data analytics applications. Even with a severely restrictive SSD platform we show that when compared to the baseline that performs all the computations at the host processor, Summarizer improves the performance by up to 20% for TPC-H queries. When using more powerful cores within the SSD, we show that this performance can be boosted significantly thereby providing a compelling argument for higher near-SSD compute capability. 148 Chapter 5 Conclusion Graph processing plays a key role in a number of applications, such as social net- works, drug discovery, and recommendation systems. Flash storage with its higher bandwidth and better random-access latency when compared to a hard disk provides an effective alternative while performing large-scale graph processing, where large graphs may not be able to fit in the available main memory and accessing storage typically becomes the bottleneck. One way to improve storage access performance is by making SSDs aware of graph semantics. For accessing pages, SSD keeps a flash translation table mapping from a logical page number to the physical page number. SSD uses a DRAM to cache this flash translation table and a reasonable processor to manage this translation and other flash management tasks. We can make use of this computational capability near the SSD and make them aware of graph semantics. In this work, we proposed GraphSSD, a graph-semantic-aware SSD framework that allows storage controllers to directly access graph data natively on the flash memory. We presented the graph translation layer (GTL), which translates the vertex ids to the physical page address on the flash memory media directly. In conjunction with GTL, we 149 propose an efficient indexing format that reduces the overhead of GTL with only a small increase in the per-page metadata overhead. We also presented multiple optimizations to handle graph updates using delta graphs, which are merged to reduce the update penalty while at the same time balancing the write amplification concerns. We implemented the GraphSSD framework on an SSD development platform to show the performance improvements over two different baselines. Our evaluation results show that the GraphSSD framework improves the performance by up to 1.85 for the basic graph data fetch functions and on average 1:40, 1:42, 1:60, 1:56, and 1:29 for the widely used breadth-first search(BFS), connected components, random-walk, maxi- mal independent set, and page rank applications, respectively. We further demonstrated significant performance improvements over the state-of-the-art out-of-core graph pro- cessing engine. Grouping of graph updates helped us to reduce write amplification. Also, while merging these updates into the graph, our schemes enabled us to access only the SSD pages that have any updates and reduce amplification. These write handling processes enabled read access efficient CSR format to be used for dynamic graphs also, which hitherto had write amplification concerns. GraphSSD demonstrated the benefits of exposing graph semantics to the storage con- troller. GraphSSD exposes graph-centric capabilities of the SSD to the user through a set of basic graph access APIs, which are robust enough to support complex, large-scale graph processing. Large-scale graph processing can be further improved by designing graph framework algorithms suitable to flash storage. We consider the vertex-centric graph analytics frameworks as they are at the heart of a broad set of applications. Prior 150 to the SSD era, several out-of-core (also called external memory) graph analytics sys- tems focused on reducing random accesses to the hard disk drives while operating on graphs. These systems primarily operate on graphs by splitting the graph into chunks and operating on each chunk that fits in the main memory. This graph organization leads to better utilization of hard disk bandwidth as it is accessed in large sequential chunks. However, we observed that accessing these large chunks leads to significant underuti- lization of the accessed data. SSDs can be efficiently accessed at page granularities by accessing multiple channels. This SSD capability motivates us to revisit the out-of-core graph analytics framework. In this work, we use CSR format for graphs that are more amenable for selectively loading only active vertices in each superstep of graph processing. However, the CSR format leads to random accesses to the graph during the update process. We solve this challenge by using a multi-log update system that logs updates in several log files, where each log file is associated with a single vertex interval. Further, while accessing SSD pages with fewer active vertex data, we reduce the read amplification due to the page granular accesses in SSD by logging the active vertex data in the current iteration and efficiently reading the log in the next iteration. Over the current state of the art out-of-core graph processing framework, our eval- uation results show that MultiLogVC framework improves the performance by up to 17:84, 1:19, 1:65, 1:38, 3:15, and 6:00 for the widely used breadth-first search, pagerank, community detection, graph coloring, maximal independent set, and random-walk applications, respectively. These improvements come because our Multi- LogVC framework is aware of the SSD access capabilities. 151 Processing a large volume of data is the backbone of many application domains be- yond graph processing. The cost of transferring data from storage to compute nodes starts to dominate the overall application performance in many domains, such as databases. Applications can spend more than half of the execution time in bringing data from stor- age to CPU [105]. With the advent of SSDs, there are new opportunities to use the embedded processors in SSDs to enable processing near storage. In the last part of the thesis, we generalized our GraphSSD approach to enable near storage computing for a broad range of applications, called Summarizer. However, these embedded storage processors have limited compute capability, and hence there is a trade-off between the bandwidth saved from near storage processing and the computing latency. In this thesis, we presented Summarizer, a near-storage process- ing architecture that provides a set of APIs for the application programmer to offload data-intensive computations to the SSD processor. The SSD processor interprets these API calls and dynamically determines whether a particular computation can be executed near storage. We implemented Summarizer on a fully functional SSD evaluation plat- form and evaluated the performance of several data analytics applications. Using this prototype, we demonstrated the benefits of collaborative computing be- tween the host and embedded storage processor on the board. Even with a severely restrictive SSD platform, we show that when compared to the baseline that performs all the computations at the host processor, Summarizer improves the performance by up to 20% for TPC-H queries. When using more powerful cores within the SSD, we show that this performance can be boosted significantly, thereby providing a compelling argument for higher near-SSD compute capability. 152 Reference List [1] Anurag Acharya, Mustafa Uysal, and Joel Saltz. Active disks: Programming model, algorithms and evaluation. In Proceedings of the 8th International Con- ference on Architectural Support for Programming Languages and Operating Sys- tems, ASPLOS ’98, pages 81–91, New York, NY , USA, 1998. ACM. [2] Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. A scalable processing-in-memory accelerator for parallel graph processing. In Pro- ceedings of the 42Nd Annual International Symposium on Computer Architecture, ISCA ’15, pages 105–117, New York, NY , USA, 2015. ACM. [3] Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. A scalable processing-in-memory accelerator for parallel graph processing. ACM SIGARCH Computer Architecture News, 43(3):105–117, 2016. [4] Zhiyuan Ai, Mingxing Zhang, Yongwei Wu, Xuehai Qian, Kang Chen, and Weimin Zheng. Squeezing out all the value of loaded data: An out-of-core graph processing system with reduced disk i/o. In 2017 USENIX Annual Technical Con- ference (USENIX ATC 17). USENIX Association, Santa Clara, CA, pages 125– 137, 2017. 153 [5] Zhiyuan Ai, Mingxing Zhang, Yongwei Wu, Xuehai Qian, Kang Chen, and Weimin Zheng. Clip: A disk i/o focused parallel out-of-core graph processing system. IEEE Transactions on Parallel and Distributed Systems, 30(1):45–62, 2018. [6] Tatsuya Akutsu, Satoru Miyano, and Satoru Kuhara. Identification of genetic net- works from a small number of gene expression patterns under the boolean network model. In Biocomputing’99, pages 17–28. World Scientific, 1999. [7] Amber Huffman. NVM Express Revision 1.1. http://nvmexpress.org/ wp-content/uploads/2013/05/NVM_Express_1_1.pdf, 2012. [8] Apache Flink, Iterative Graph Processing,. https://ci.apache. org/projects/flink/flink-docs-stable/dev/libs/gelly/ iterative_graph_processing.html, 2019. [9] Introduction to Apache Giraph,. https://giraph.apache.org/intro. html, 2019. [10] J. Banerjee, D. K. Hsiao, and K. Kannan. Dbc a database computer for very large databases. IEEE Trans. Comput., 28(6):414–429, June 1979. [11] Christopher L Barrett, Keith R Bisset, Stephen G Eubank, Xizhou Feng, and Mad- hav V Marathe. Episimdemics: an efficient algorithm for simulating the spread of infectious disease over large realistic social networks. In SC’08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, pages 1–12. IEEE, 2008. 154 [12] Simona Boboila, Youngjae Kim, Sudharshan S. Vazhkudai, Peter Desnoyers, and Galen M. Shipman. Active flash: Out-of-core data analytics on flash storage. In IEEE 28th Symposium on Mass Storage Systems and Technologies, MSST ’12, pages 1–12, April 2012. [13] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst., 30(1-7):107–117, April 1998. [14] Adrian M. Caulfield, Arup De, Joel Coburn, Todor I. Mollow, Rajesh K. Gupta, and Steven Swanson. Moneta: A high-performance storage array architecture for next-generation, non-volatile memories. In Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’43, pages 385–395, Washington, DC, USA, 2010. IEEE Computer Society. [15] Feng Chen, Rubao Lee, and Xiaodong Zhang. Essential roles of exploiting in- ternal parallelism of flash memory based solid state drives in high-speed data processing. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture, HPCA ’11, pages 266–277, Washing- ton, DC, USA, 2011. IEEE Computer Society. [16] Jiefeng Cheng, Qin Liu, Zhenguo Li, Wei Fan, John CS Lui, and Cheng He. Venus: Vertex-centric streamlined graph computation on a single pc. In 2015 IEEE 31st International Conference on Data Engineering, pages 1131–1142. IEEE, 2015. 155 [17] Raymond Cheng, Ji Hong, Aapo Kyrola, Youshan Miao, Xuetian Weng, Ming Wu, Fan Yang, Lidong Zhou, Feng Zhao, and Enhong Chen. Kineograph: Taking the Pulse of a Fast-changing and Connected World. In Proceedings of the 7th ACM European Conference on Computer Systems, EuroSys ’12, pages 85–98, New York, NY , USA, 2012. ACM. [18] Sangyeun Cho, Chanik Park, Hyunok Oh, Sungchan Kim, Youngmin Yi, and Gre- gory R. Ganger. Active disk meets flash: A case for intelligent ssds. In Proceed- ings of the 27th International ACM Conference on International Conference on Supercomputing, ICS ’13, pages 91–102, New York, NY , USA, 2013. ACM. [19] Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM symposium on Cloud computing, pages 143–154. ACM, 2010. [20] Guohao Dai, Yuze Chi, Yu Wang, and Huazhong Yang. Fpgp: Graph process- ing framework on fpga a case study of breadth-first search. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Ar- rays, pages 105–110. ACM, 2016. [21] Jaeyoung Do, Yang-Suk Kee, Jignesh M. Patel, Chanik Park, Kwanghyun Park, and David J. DeWitt. Query processing on smart ssds: Opportunities and chal- lenges. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD ’13, pages 1221–1230, New York, NY , USA, 2013. ACM. 156 [22] Jaeyoung Do, Yang-Suk Kee, Jignesh M. Patel, Chanik Park, Kwanghyun Park, and David J. DeWitt. Query processing on smart ssds: Opportunities and chal- lenges. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD ’13, pages 1221–1230, New York, NY , USA, 2013. ACM. [23] Jeff Draper, Jacqueline Chame, Mary Hall, Craig Steele, Tim Barrett, Jeff La- Coss, John Granacki, Jaewook Shin, Chun Chen, Chang Woo Kang, Ihn Kim, and Gokhan Daglikoca. The architecture of the diva processing-in-memory chip. In Proceedings of the 16th International Conference on Supercomputing, ICS ’02, pages 14–25, 2002. [24] D. Ediger, R. McColl, J. Riedy, and D. A. Bader. STINGER: High performance data structure for streaming graphs. In 2012 IEEE Conference on High Perfor- mance Extreme Computing, pages 1–5, Sept 2012. [25] EEMBC. Coremark scores. http://www.eembc.org/coremark, 2017. [26] Nima Elyasi, Changho Choi, and Anand Sivasubramaniam. Large-scale graph processing on emerging storage devices. In 17thfUSENIXg Conference on File and Storage Technologies (fFASTg 19), pages 309–316, 2019. [27] NAND Flash Memory,. https://www.enterprisestorageforum. com/storage-hardware/nand-flash-memory.html, 2019. [28] Understanding Flash: Blocks, Pages and Program / Erases,. https://flashdba.com/2014/06/20/ 157 understanding-flash-blocks-pages-and-program-erases/, 2014. [29] Mingyu Gao, Grant Ayers, and Christos Kozyrakis. Practical near-data processing for in-memory analytics frameworks. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation, PACT ’15, pages 113–124, Washington, DC, USA, 2015. IEEE Computer Society. [30] Joseph E Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. Powergraph: distributed graph-parallel computation on natural graphs. In OSDI, page 2, 2012. [31] B. Gu, A. S. Yoon, D. H. Bae, I. Jo, J. Lee, J. Yoon, J. U. Kang, M. Kwon, C. Yoon, S. Cho, J. Jeong, and D. Chang. Biscuit: A framework for near-data processing of big data workloads. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pages 153–165, June 2016. [32] Boncheol Gu, Andre S. Yoon, Duck-Ho Bae, Insoon Jo, Jinyoung Lee, Jonghyun Yoon, Jeong-Uk Kang, Moonsang Kwon, Chanho Yoon, Sangyeun Cho, Jaeheon Jeong, and Duckhyun Chang. Biscuit: A framework for near-data processing of big data workloads. In Proceedings of the 43rd International Symposium on Computer Architecture, ISCA ’16, pages 153–165, Piscataway, NJ, USA, 2016. IEEE Press. [33] Tae Jun Ham, Lisa Wu, Narayanan Sundaram, Nadathur Satish, and Margaret Martonosi. Graphicionado: A high-performance and energy-efficient accelerator 158 for graph analytics. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’16, pages 1–13, Oct 2016. [34] Wentao Han, Youshan Miao, Kaiwei Li, Ming Wu, Fan Yang, Lidong Zhou, Vi- jayan Prabhakaran, Wenguang Chen, and Enhong Chen. Chronos: A graph engine for temporal graph analysis. In Proceedings of the Ninth European Conference on Computer Systems, EuroSys ’14, pages 1:1–1:14, New York, NY , USA, 2014. ACM. [35] Wook-Shin Han, Sangyeon Lee, Kyungyeol Park, Jeong-Hoon Lee, Min-Soo Kim, Jinha Kim, and Hwanjo Yu. TurboGraph: A Fast Parallel Graph Engine Handling Billion-scale Graphs in a Single PC. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’13, pages 77–85, New York, NY , USA, 2013. ACM. [36] Wook-Shin Han, Sangyeon Lee, Kyungyeol Park, Jeong-Hoon Lee, Min-Soo Kim, Jinha Kim, and Hwanjo Yu. Turbograph: a fast parallel graph engine han- dling billion-scale graphs in a single pc. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 77–85. ACM, 2013. [37] Safiollah Heidari, Yogesh Simmhan, Rodrigo N Calheiros, and Rajkumar Buyya. Scalable graph processing frameworks: A taxonomy and open challenges. ACM Computing Surveys (CSUR), 51(3):60, 2018. [38] Amber Huffman. NVM Express, 2013. 159 [39] ITRS. International Technology Roadmap for Semiconductors 2009 Edition: Assembly and Packaging. http://www.itrs2.net/itrs-reports. html, 2009. [40] Y . Jin, H. W. Tseng, Y . Papakonstantinou, and S. Swanson. Kaml: A flexible, high-performance key-value ssd. In 2017 IEEE International Symposium on High Performance Computer Architecture, HPCA ’17, pages 373–384, Feb 2017. [41] Yanqin Jin, Hung-Wei Tseng, Yannis Papakonstantinou, and Steven Swanson. Kaml: A flexible, high-performance key-value ssd. In 2017 IEEE International Symposium on High Performance Computer Architecture, HPCA ’17, pages 373– 384, Feb 2017. [42] Sang-Woo Jun, Ming Liu, Sungjin Lee, Jamey Hicks, John Ankcorn, Myron King, Shuotao Xu, and Arvind. Bluedbm: An appliance for big data analytics. In Pro- ceedings of the 42Nd Annual International Symposium on Computer Architecture, ISCA ’15, pages 1–13, New York, NY , USA, 2015. ACM. [43] Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu, and Arvind. GraF- Boost: Accelerated Flash Storage for External Graph Analytics. ISCA, 2018. [44] Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu, et al. BigSparse: High- performance external graph analytics. arXiv preprint arXiv:1710.07736, 2017. [45] U Kang, Charalampos E Tsourakakis, and Christos Faloutsos. Pegasus: A peta- scale graph mining system implementation and observations. In Proceedings of 160 the 2009 Ninth IEEE International Conference on Data Mining, pages 229–238. Washington, DC, USA, 2009. [46] Yangwook Kang, Yang-Suk Kee, Ethan L. Miller, and Chanik Park. Enabling cost-effective data processing with smart ssd. In Mass Storage Systems and Tech- nologies, MSST ’13, 2013. [47] Yangwook Kang, Yang suk Kee, Ethan L. Miller, and Chanik Park. Enabling cost-effective data processing with smart ssd. In IEEE 29th Symposium on Mass Storage Systems and Technologies, MSST ’14, pages 1–12, May 2013. [48] Kimberly Keeton, David A. Patterson, and Joseph M. Hellerstein. A case for intelligent disks (idisks). SIGMOD Rec., 27(3):42–52, September 1998. [49] Roger Kelley. Compute performance distance of data as a mea- sure of latency. https://www.formulusblack.com/blog/ compute-performance-distance-of-data-as-a-measure-of-latency/, February 2019. [50] Peter M. Kogge. Execube-a new architecture for scaleable mpps. In Proceedings of the 1994 International Conference on Parallel Processing - Volume 01, ICPP ’94, pages 77–84, Washington, DC, USA, 1994. IEEE Computer Society. [51] Gunjae Koo, Kiran Kumar Matam, Te I, H. V . Krishna Giri Narra, Jing Li, Hung- Wei Tseng, Steven Swanson, and Murali Annavaram. Summarizer: Trading 161 Communication with Computing Near Storage. In Proceedings of the 50th An- nual IEEE/ACM International Symposium on Microarchitecture, MICRO-50 ’17, pages 219–231, New York, NY , USA, 2017. ACM. [52] Matam Kiran Kumar. Accelerating Sparse Matrix Kernels on Graphics Process- ing Units. PhD thesis, International Institute of Information Technology Hyder- abad, 2012. [53] Aapo Kyrola. Drunkardmob: billions of random walks on just a pc. In Proceed- ings of the 7th ACM conference on Recommender systems, pages 257–264. ACM, 2013. [54] Aapo Kyrola, Guy Blelloch, and Carlos Guestrin. Graphchi: Large-scale graph computation on just a pc. In Proceedings of the 10th USENIX Conference on Op- erating Systems Design and Implementation, OSDI ’12, pages 31–46, Berkeley, CA, USA, 2012. USENIX Association. [55] Jinho Lee, Heesu Kim, Sungjoo Yoo, Kiyoung Choi, H Peter Hofstee, Gi-Joon Nam, Mark R Nutter, and Damir Jamsek. ExtraV: boosting graph processing near storage with a coherent accelerator. Proceedings of the VLDB Endowment, 10(12):1706–1717, 2017. [56] Hans-Otto Leilich, G¨ unther Stiege, and Hans Christoph Zeidler. A search proces- sor for data base management systems. In Proceedings of the Fourth International Conference on Very Large Data Bases - Volume 4, VLDB ’78, pages 280–287. VLDB Endowment, 1978. 162 [57] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014. [58] Chyuan Shiun Lin, Diane C. P. Smith, and John Miles Smith. The design of a rotating associative memory for relational database applications. ACM Trans. Database Syst., 1(1):53–65, March 1976. [59] Hang Liu and H Howie Huang. Graphene: Fine-grainedfIOg management for graph computing. In 15thfUSENIXg Conference on File and Storage Technolo- gies (fFASTg 17), pages 285–300, 2017. [60] Yang Liu, Hung-Wei Tseng, Mark Gahagan, Jing Li, Yanqin Jin, and Steven Swanson. Hippogriff: Efficiently moving data in heterogeneous computing sys- tems. In 2016 IEEE 34th International Conference on Computer Design, ICCD ’16, pages 376–379, Oct 2016. [61] Yucheng Low, Joseph E Gonzalez, Aapo Kyrola, Danny Bickson, Carlos E Guestrin, and Joseph Hellerstein. Graphlab: A new framework for parallel ma- chine learning. arXiv preprint arXiv:1408.2041, 2014. [62] Michael Luby. A simple parallel algorithm for the maximal independent set prob- lem. SIAM journal on computing, 15(4):1036–1053, 1986. [63] Steffen Maass, Changwoo Min, Sanidhya Kashyap, Woonhak Kang, Mohan Ku- mar, and Taesoo Kim. Mosaic: Processing a trillion-edge graph on a single ma- chine. In Proceedings of the Twelfth European Conference on Computer Systems, pages 527–543. ACM, 2017. 163 [64] Peter Macko, Virendra J Marathe, Daniel W Margo, and Margo I Seltzer. LLAMA: Efficient graph analytics using large multiversioned arrays. In Data En- gineering (ICDE), 2015 IEEE 31st International Conference on, pages 363–374. IEEE, 2015. [65] Ken Mai, Tim Paaske, Nuwan Jayasena, Ron Ho, William J. Dally, and Mark Horowitz. Smart memories: A modular reconfigurable architecture. In Proceed- ings of the 27th Annual International Symposium on Computer Architecture, ISCA ’00, pages 161–171, New York, NY , USA, 2000. ACM. [66] Grzegorz Malewicz, Matthew H Austern, Aart JC Bik, James C Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Con- ference on Management of data, pages 135–146. ACM, 2010. [67] Nasim Mansurov. Nvme vs ssd vs hdd performance. https:// photographylife.com/nvme-vs-ssd-vs-hdd-performance, Ac- cessed in 2019. [68] Marvell. High performance pcie ssd controller (88ss1093). http://www. marvell.com/storage/ssd/88SS1093, 2017. [69] Kiran Kumar Matam, Siva Rama Krishna Bharadwaj, and Kishore Kothapalli. Sparse matrix matrix multiplication on hybrid cpu+ gpu platforms. In Proc. of 19th Annual International Conference on High Performance Computing (HiPC), Pune, India, pages 1–10, 2012. 164 [70] Kiran Kumar Matam, Hanieh Hashemi, and Murali Annavaram. PartitionedVC: Partitioned External Memory Graph Analytics Framework for SSDs. arXiv e- prints, page arXiv:1905.04264, May 2019. [71] Kiran Kumar Matam, Gunjae Koo, Haipeng Zha, Hung-Wei Tseng, and Murali Annavaram. Graphssd: graph semantics aware ssd. In Proceedings of the 46th International Symposium on Computer Architecture, pages 116–128. ACM, 2019. [72] Kiran Kumar Matam and Kishore Kothapalli. Accelerating sparse matrix vector multiplication in iterative methods using gpu. In 2011 International Conference on Parallel Processing, pages 612–621. IEEE, 2011. [73] Micron. Micron nand flash by technology. https://www.micron.com/ products/nand-flash, 2017. [74] Microsemi. Flashtec NVMe Controllers. http://www.microsemi. com/products/storage/flashtec-nvme-controllers/ flashtec-nvme-controllers, 2017. [75] Alan Mislove. Online Social Networks: Measurement, Analysis, and Applications to Distributed Information Systems. PhD thesis, Rice University, Department of Computer Science, May 2009. [76] Dipti Prasad Mukherjee, Nilanjan Ray, and Scott T Acton. Level set analysis for leukocyte detection and tracking. IEEE Transactions on Image processing, 13(4):562–572, 2004. 165 [77] Lifeng Nai, Yinglong Xia, Ilie G Tanase, Hyesoon Kim, and Ching-Yung Lin. GraphBIG: understanding graph computing in the context of industrial solutions. In High Performance Computing, Networking, Storage and Analysis, 2015 SC- International Conference for, pages 1–12. IEEE, 2015. [78] Donald Nguyen, Andrew Lenharth, and Keshav Pingali. A lightweight infrastruc- ture for graph analytics. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 456–471. ACM, 2013. [79] OpenSSD. Open-source solid-state drive project for research and education. http://openssd.io. [80] Muhammet Mustafa Ozdal, Serif Yesil, Taemin Kim, Andrey Ayupov, John Greth, Steven Burns, and Ozcan Ozturk. Energy efficient architecture for graph analyt- ics accelerators. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pages 166–177. IEEE, 2016. [81] Muhammet Mustafa Ozdal, Serif Yesil, Taemin Kim, Andrey Ayupov, John Greth, Steven Burns, and Ozcan Ozturk. Energy efficient architecture for graph analytics accelerators. In Proceedings of the 43rd International Symposium on Computer Architecture, ISCA ’16, pages 166–177, Piscataway, NJ, USA, 2016. IEEE Press. [82] Pagerank application,. https://github.com/GraphChi/ graphchi-cpp/blob/master/example_apps/streaming_ pagerank.cpp. [83] PassMark. Passmark cpu benchmark. http://www.cpubenchmark.net. 166 [84] PassMark. Hard drive benchmarks - solid state drive (ssd) chart. http://www. harddrivebenchmark.net/ssd.html, 2017. [85] David Patterson, Thomas Anderson, Neal Cardwell, Richard Fromm, Kimberley Keeton, Christoforos Kozyrakis, Randi Thomas, and Kathy Yelick. Intelligent ram (iram): chips that remember and compute. In Solid-State Circuits Conference, 1997. Digest of Technical Papers. 43rd ISSCC., 1997 IEEE International, pages 224–225, Feb 1997. [86] Louise Quick, Paul Wilkinson, and David Hardcastle. Using pregel-like large scale graph processing frameworks for social network analysis. In Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012), ASONAM ’12, pages 457–463, Washington, DC, USA, 2012. IEEE Computer Society. [87] Usha Nandini Raghavan, R´ eka Albert, and Soundar Kumara. Near linear time algorithm to detect community structures in large-scale networks. Physical review E, 76(3):036106, 2007. [88] Erik Riedel, Christos Faloutsos, Garth A. Gibson, and David Nagle. Active disks for large-scale data processing. Computer, 34(6):68–74, June 2001. [89] Amitabha Roy, Ivo Mihailovic, and Willy Zwaenepoel. X-stream: Edge-centric graph processing using streaming partitions. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 472–488. ACM, 2013. 167 [90] Semih Salihoglu and Jennifer Widom. Optimizing graph algorithms on pregel- like systems. Proceedings of the VLDB Endowment, 7(7):577–588, 2014. [91] Samsung. Samsung ssd 850 pro data sheet, rev.2.0 (january, 2015). http:// www.samsung.com/semiconductor/minisite/ssd/downloads/ document/Samsung_SSD_850_PRO_Data_Sheet_rev_2_0.pdf, 2015. [92] Samsung. Ssd 970 evo plus nvme m.2. https://www.samsung. com/us/computing/memory-storage/solid-state-drives/ ssd-970-evo-plus-nvme-m-2-1-tb-mz-v7s1t0b-am/, 2019. [93] Mohit Saxena, Michael M. Swift, and Yiying Zhang. Flashtier: A lightweight, consistent and durable storage cache. In Proceedings of the 7th ACM European Conference on Computer Systems, EuroSys ’12, pages 267–280, New York, NY , USA, 2012. ACM. [94] Stewart A. Schuster, H. B. Nguyen, Esen .A. Ozkarahan, and Kenneth C. Smith. Rap.2 - an associative processor for databases and its applications. Computers, IEEE Transactions on, C-28(6):446–458, June 1979. [95] Sudharsan Seshadri, Mark Gahagan, Sundaram Bhaskaran, Trevor Bunker, Arup De, Yanqin Jin, Yang Liu, and Steven Swanson. Willow: A user-programmable ssd. In 11th USENIX Symposium on Operating Systems Design and Implementa- tion (OSDI 14), pages 67–80, Broomfield, CO, October 2014. USENIX Associa- tion. 168 [96] Julian Shun and Guy E Blelloch. Ligra: a lightweight graph processing frame- work for shared memory. In ACM Sigplan Notices, pages 135–146. ACM, 2013. [97] Yong Ho Song. Cosmos+ openssd: A nvme-based open source ssd platform. In Flash Memory Summit 2016, Santa Clara, CA, USA, 2016. [98] Samsung SSD 860 EVO 2TB. https://www.amazon.com/ Samsung-Inch-Internal-MZ-76E2T0B-AM/dp/B0786QNSBD, 2019. [99] Power Loss Protection in SSDs,. https://image-us.samsung. com/SamsungUS/b2b/resource/2016/07/08/b2b_resource_ WHP-SSD-POWERLOSSPROTECTION-r1-JUL16J.pdf, 2019. [100] Peng Sun, Yonggang Wen, Ta Nguyen Binh Duong, and Xiaokui Xiao. Graphmp: an efficient semi-external-memory big graph processing system on a single ma- chine. In 2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS), pages 276–283. IEEE, 2017. [101] Devesh Tiwari, Simona Boboila, Sudharshan S. Vazhkudai, Youngjae Kim, Xi- aosong Ma, Peter J. Desnoyers, and Yan Solihin. Active flash: Towards energy- efficient, in-situ data analytics on extreme-scale machines. In Proceedings of the 11th USENIX Conference on File and Storage Technologies, FAST’13, pages 119–132, Berkeley, CA, USA, 2013. USENIX Association. [102] Devesh Tiwari, Sudharshan S. Vazhkudai, Youngjae Kim, Xiaosong Ma, Simona Boboila, and Peter J. Desnoyers. Reducing data movement costs using energy 169 efficient, active computation on ssd. In Proceedings of the 2012 USENIX Con- ference on Power-Aware Computing and Systems, HotPower ’12, Berkeley, CA, USA, 2012. USENIX Association. [103] Devesh Tiwari, Sudharshan S. Vazhkudai, Youngjae Kim, Xiaosong Ma, Simona Boboila, and Peter J. Desnoyers. Reducing data movement costs using energy effi- cient, active computation on ssd. In Proceedings of the 2012 USENIX Conference on Power-Aware Computing and Systems, HotPower’12, pages 4–4, Berkeley, CA, USA, 2012. USENIX Association. [104] Josep Torrellas. Flexram: Toward an advanced intelligent memory system: A ret- rospective paper. In Computer Design, 2012 IEEE 30th International Conference on, ICCD ’12, pages 3–4, Sept 2012. [105] Hung-Wei Tseng, Yang Liu, Mark Gahagan, Jing Li, Yanqin Jin, and Steven Swanson. Gullfoss: Accelerating and simplifying data movement among het- erogeneous computing and storage resources. Technical Report CS2015-1015, Department of Computer Science and Engineering, University of California, San Diego technical report, 2015. [106] Hung-Wei Tseng, Qianchen Zhao, Yuxiao Zhou, Mark Gahagan, and Steven Swanson. Morpheus: Creating application objects efficiently for heterogeneous computing. In Proceedings of the 43rd International Symposium on Computer Architecture, ISCA ’16, pages 53–65, Piscataway, NJ, USA, 2016. IEEE Press. 170 [107] Keval V ora. fLUMOSg: Dependency-driven disk-based graph processing. In 2019fUSENIXg Annual Technical Conference (fUSENIXgfATCg 19), pages 429–442, 2019. [108] Keval V ora, Rajiv Gupta, and Guoqing Xu. Kickstarter: Fast and accurate com- putations on streaming graphs via trimmed approximations. ACM SIGOPS Oper- ating Systems Review, 51(2):237–251, 2017. [109] Keval V ora, Chen Tian, Rajiv Gupta, and Ziang Hu. Coral: Confined recovery in distributed asynchronous graph processing. ACM SIGOPS Operating Systems Review, 51(2):223–236, 2017. [110] Keval V ora, Guoqing (Harry) Xu, and Rajiv Gupta. Load the edges you need: A generic i/o optimization for disk-based graph processing. In USENIX Annual Technical Conference, pages 507–522, 2016. [111] Duncan J Watts. Six degrees: The science of a connected age. WW Norton & Company, 2004. [112] Overview of Wear Leveling With SSD Controllers,. https://www. ontrack.com/blog/2016/10/25/wear-leveling/, 2019. [113] Wikipedia. Raspberry pi. http://en.wikipedia.org/wiki/ Raspberry_Pi, 2017. [114] Louis Woods, Zsolt Istv´ an, and Gustavo Alonso. Ibex: An intelligent storage engine with support for advanced sql offloading. Proc. VLDB Endow., 7(11):963– 974, July 2014. 171 [115] Chuan Xiao, Wei Wang, Xuemin Lin, and Jeffrey Xu Yu. Efficient similarity joins for near duplicate detection. In Proceedings of the 17th International Conference on World Wide Web, pages 131–140, 2008. [116] Xilinx. Zynq-7000 all programmable soc data sheet. https: //www.xilinx.com/support/documentation/data_sheets/ ds190-Zynq-7000-Overview.pdf. [117] Yahoo WebScope. Yahoo! altavista web page hyperlink connectivity graph, circa 2002. http://webscope.sandbox.yahoo.com/, 2018. [118] Da Yan, James Cheng, Yi Lu, and Wilfred Ng. Blogel: A block-centric frame- work for distributed computation on real-world graphs. Proceedings of the VLDB Endowment, 7(14):1981–1992, 2014. [119] Yuan Yuan, Rubao Lee, and Xiaodong Zhang. The yin and yang of processing data warehousing queries on gpu devices. Proc. VLDB Endow., 6(10):817–828, August 2013. [120] Kaiyuan Zhang, Rong Chen, and Haibo Chen. Numa-aware graph-structured an- alytics. ACM SIGPLAN Notices, 50(8):183–193, 2015. [121] Mingxing Zhang, Yongwei Wu, Youwei Zhuo, Xuehai Qian, Chengying Huan, and Kang Chen. Wonderland: A novel abstraction-based out-of-core graph pro- cessing system. ACM SIGPLAN Notices, 53(2):608–621, 2018. [122] Jin Zhao, Yu Zhang, Xiaofei Liao, Ligang He, Bingsheng He, Hai Jin, Haikun Liu, and Yicheng Chen. Graphm: an efficient storage system for high throughput of 172 concurrent graph processing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, page 3. ACM, 2019. [123] Da Zheng, Disa Mhembere, Randal Burns, Joshua V ogelstein, Carey E. Priebe, and Alexander S. Szalay. Flashgraph: Processing billion-node graphs on an ar- ray of commodity ssds. In 13th USENIX Conference on File and Storage Tech- nologies (FAST 15), FAST ’15, pages 45–58, Santa Clara, CA, 2015. USENIX Association. [124] Xiaowei Zhu, Wenguang Chen, Weimin Zheng, and Xiaosong Ma. Gemini: A computation-centric distributed graph processing system. In 12thfUSENIXg Symposium on Operating Systems Design and Implementation (fOSDIg 16), pages 301–316, 2016. [125] Xiaowei Zhu, Wentao Han, and Wenguang Chen. Gridgraph: Large-scale graph processing on a single machine using 2-level hierarchical partitioning. In USENIX Annual Technical Conference, pages 375–386, 2015. 173
Abstract (if available)
Abstract
Graph analytics play a key role in a number of applications such as social networks, drug discovery, and recommendation systems. As the size of graphs continues to grow, the storage access latency is a critical performance hurdle for accelerating graph analytics. Solid state drives (SSDs), with lower latency and higher bandwidth when compared to hard drives, along with the computing infrastructure embedded in SSDs, have the potential to accelerate large-scale graph applications. However, to exploit the full potential of SSDs in graph applications there is a need to rethink how graphs are stored on SSDs, as well as design new protocols for how SSDs access and manipulate graph data. ❧ This thesis presents two approaches to enable such a rethinking for efficient graph analytics on SSDs. First, this thesis makes a case for making SSDs be semantically aware of the structure of graph data to improve their access efficiency. This thesis presents the design and implementation of graph semantics aware SSD, called GraphSSD. GraphSSD exploits the unique page access indirection mechanisms used in SSDs to store graph data in a compact form. This storage format reduces unnecessary page reads by enabling direct access to a vertex's meta data through the enhanced page access indirection mechanism. Further, GraphSSD supports efficient graph modifications by enabling incremental updates to the graph. These graph-centric capabilities of the SSD are exposed to the user through a set of basic graph access APIs, which are robust enough to support complex, large-scale graph processing. The second part of the thesis presents MultiLogVC, a novel approach for reducing SSD page accesses while executing out-of-core graph algorithms that use vertex-centric programming model. This approach is based on the observation that nearly all graph algorithms have a dynamically varying number of active vertices that must be processed in each iteration. To efficiently access the storage proportional to the number of active vertices, MultiLogVC proposes to use a multi-log update mechanism that logs updates separately rather than directly update the active edge in a graph. The proposed multi-log system maintains a separate log per each vertex interval, thereby enabling each superstep in a vertex-centric programming model to efficiently load and process each sub-log of updates. The last part of the thesis presents a generalized near-storage computing model, for algorithms beyond graphs, that exploits the built-in computing capabilities of SSDs. The proposed computing model called Summarizer presents a set of APIs and data access protocols to offload the data access intensive parts of a computation to the embedded storage controller on SSDs. Through these three innovations this thesis makes a strong case for how to design and deploy next generation storage systems for graph analytics and beyond.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Architectural innovations for mitigating data movement cost on graphics processing units and storage systems
PDF
Efficient memory coherence and consistency support for enabling data sharing in GPUs
PDF
Efficient processing of streaming data in multi-user and multi-abstraction workflows
PDF
Scaling recommendation models with data-aware architectures and hardware efficient implementations
PDF
Efficient transforms for graph signals with applications to video coding
PDF
Exploiting variable task granularities for scalable and efficient parallel graph analytics
PDF
Graph-based models and transforms for signal/data processing with applications to video coding
PDF
Efficient graph learning: theory and performance evaluation
PDF
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
PDF
Scalable exact inference in probabilistic graphical models on multi-core platforms
PDF
Demand based techniques to improve the energy efficiency of the execution units and the register file in general purpose graphics processing units
PDF
Critically sampled wavelet filterbanks on graphs
PDF
Component-based distributed data stores
PDF
Towards data-intensive processing architectures for improving efficiency in graph processing
PDF
Dynamic graph analytics for cyber systems security applications
PDF
Human activity analysis with graph signal processing techniques
PDF
Scaling up deep graph learning: efficient algorithms, expressive models and fast acceleration
PDF
Hardware and software techniques for irregular parallelism
PDF
Taming heterogeneity, the ubiquitous beast in cloud computing and decentralized learning
PDF
Learning the geometric structure of high dimensional data using the Tensor Voting Graph
Asset Metadata
Creator
Matam, Kiran Kumar
(author)
Core Title
Efficient graph processing with graph semantics aware intelligent storage
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
02/17/2020
Defense Date
12/03/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
dynamic workload offloading,flash storage,graph processing framework,graphs,near data processing,OAI-PMH Harvest,out-of-core,SSD,storage systems
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Annavaram, Murali (
committee chair
), Ghandeharizadeh, Shahram (
committee member
), Ortega, Antonio (
committee member
)
Creator Email
kiran20202005@gmail.com,kmatam@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-270419
Unique identifier
UC11674743
Identifier
etd-MatamKiran-8179.pdf (filename),usctheses-c89-270419 (legacy record id)
Legacy Identifier
etd-MatamKiran-8179.pdf
Dmrecord
270419
Document Type
Dissertation
Rights
Matam, Kiran Kumar
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
dynamic workload offloading
flash storage
graph processing framework
near data processing
out-of-core
SSD
storage systems