Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Scaling up deep graph learning: efficient algorithms, expressive models and fast acceleration
(USC Thesis Other)
Scaling up deep graph learning: efficient algorithms, expressive models and fast acceleration
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SCALING UP DEEP GRAPH LEARNING: EFFICIENT ALGORITHMS, EXPRESSIVE MODELS AND FAST ACCELERATION by Hanqing Zeng A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER ENGINEERING) December 2022 Copyright 2022 Hanqing Zeng Dedication To my wife and parents. ii Acknowledgments I would like to express my deep gratitude to my PhD advisor, Prof. Viktor Prasanna, who has carefully trained me with fundamental research skills, inspired me with vision and ambition, and been patient during my exploration along new research directions. I would like to thank Prof. Rajgopal Kannan for his guidance on my various projects and his words of wisdom on research, and on life in general. I am especially thankful to Ajitesh (now Prof. Srivastava!) who is always passitionate about “unsolvable” problems and optimistic during the most challenging times. He never refused my requests of discussion, whether I was on campus, doing internship at Facebook or working from home during the pandamic. For all the papers we jointly worked on, he never failed to bring big nice surprises with his intelligence and diligence. I would like to send my warmest hugs to my wife and my parents, without whose support I could never finish my PhD. I can recall the many sleepless nights during the past six years when I struggled through paper rejections, unprovable ideas, implementation bugs, etc.. Their love and understandings are the lights guiding me through the darkness. My wife, who is also a fellow PhD student, has shared so many common ups and downs with me. Our entire PhD journeys overlap with each other! In a sense, we are more than family – we are like comrade-in-arms ready for any challenges in the future. “Have heart, my dear.” I would like to thank my colleagues, mentors and manager at Facebook, where I did summer internships twice. The collaboration with them has led to an important chapter of my dissertation. I deeply appreciate Yinglong Xia’s trust in my ability and the opportunities he has constantly provided me with, both within and outside Facebook. Muhan Zhang, who is so knowledgable iii in graph learning, has given me invaluable feedbacks, both in GNNs and paper writing skills. I admire his creativity and insights and I feel so fortunate to have a mentor like him. I owe many thanks to my manager, Andrey Malevich, who has helped me so much in understanding and solving production-scale problems. I would also like to thank Ren Chen and Long Jin for their supports and advice during the two fruitful summers. I feel lucky to have been in a diverse and active lab. I would like to thank Ren Chen, Shijie Zhou and Sanmukh Kuppannagari for their guidance during my initial PhD years. They are so kind and sincere in sharing their life experiences. I feel very lucky to be able to work closely with Hongkuan Zhou and Bingyi Zhang on several research papers. Both are brilliant students with perseverance and all the best to their own PhD life. Many thanks to Kartik Lakhotia who has inpired me with his expertise in graph analytics. Thanks to Chi Zhang, Da Tong, Charith Wickramaarachchi, Yuan Meng, Sasindu Wijeratne, Jason Lin, Tian Ye, Pengmiao Zhang and many others for their inspiring discussions. Lastly, I would like to thank my defense committee, Prof. C.-C. Jay Kuo and Prof. Robin Jia, who have made many valuable suggestions on my dissertation and have been very responsive to my various requests. I would like to thank my qualifying exam committee, Prof. Xiang Ren, Prof. Xuehai Qian, Prof. Antonio Ortega and Prof. Rajgopal Kannan for their helpful advice on my thesis directions. P.S. I must thank my cat Lagrange for being so calm and warm. He would stay up all night with me when I tried to catch a deadline, but apparently he is more interested in shreded paper than papers. iv Contents Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Deep Graph Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Challenges and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Chapter 2 Background and Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1 Graph Neural Networks from a Computation Perspective . . . . . . . . . . . . . . 10 2.1.1 Node-centric Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.2 Graph-centric Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1.3 Layerwise Node-centric Perspective . . . . . . . . . . . . . . . . . . . . . 13 2.1.4 Example GNN Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Expressive Power of GNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.1 Graph Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.2 Graph Isomorphism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3 Efficient GNN Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4 GNNs Enhanced by Graph Analytics . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.5 Computing Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.6 Accelerators for Deep Learning and Graph Processing . . . . . . . . . . . . . . . . 25 Chapter 3 Overview and Categorization of Proposed Works . . . . . . . . . . . . . . . . . 27 Chapter 4 Efficient Algorithms: A Novel Training Paradigm via Subgraph Sampling . . . . 29 4.1 Sampling Methods for GNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.1.1 Node Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.1.2 Layer Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2 GraphSAINT: Graph Sampling Based Inductive Learning Method . . . . . . . . . . 31 4.2.1 Training via Subgraph Sampling . . . . . . . . . . . . . . . . . . . . . . . 32 4.2.2 Correcting the Sampling Bias . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2.3 Reducing the Sampling Variance . . . . . . . . . . . . . . . . . . . . . . . 36 4.2.4 General Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 v 4.2.5 Discussions and Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3 Subgraph Transformation for Computation Redundancy Reduction . . . . . . . . . 42 4.3.1 Redundancy in Neighbor Aggregation . . . . . . . . . . . . . . . . . . . . 42 4.3.2 Reconstructing Subgraph from Aggregation Graph . . . . . . . . . . . . . 43 4.3.3 Iterative Redundancy Reduction . . . . . . . . . . . . . . . . . . . . . . . 45 4.3.4 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.4.1 Datasets and Learning Tasks . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.4.2 Accuracy and Convergence Speed . . . . . . . . . . . . . . . . . . . . . . 49 4.4.3 Cost of Sampling and Pre-Processing . . . . . . . . . . . . . . . . . . . . 51 4.4.4 Effect of Graph Sampling Algorithms . . . . . . . . . . . . . . . . . . . . 54 4.4.5 GNN Architecture Variants and Deeper Models . . . . . . . . . . . . . . . 54 4.4.6 Subgraph-level Redundancy Reduction . . . . . . . . . . . . . . . . . . . 55 Chapter 5 Expressive models: A New Class of GNNs via Depth-Scope Decoupling . . . . 57 5.1 Limitations on GNN Expressivity . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.1.1 Oversmoothing and Oversquarshing . . . . . . . . . . . . . . . . . . . . . 57 5.1.2 Upper Bound from 1D Weisfeiler-Lehman Test . . . . . . . . . . . . . . . 58 5.2 SHADOW-GNN : Decoupling the Depth and Scope of GNNs . . . . . . . . . . . . 59 5.2.1 Decoupled Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2.2 Expressivity from the Graph Signal Processing Perspective . . . . . . . . . 61 5.2.3 Expressivity from the Function Approximator Perspective . . . . . . . . . 65 5.2.4 Expressivity from the Topology Learning Perspective . . . . . . . . . . . . 68 5.2.5 Neural Architecture Extensions . . . . . . . . . . . . . . . . . . . . . . . . 71 5.3 Identifying Important Neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.3.1 Heuristics Based on Graph Analytics . . . . . . . . . . . . . . . . . . . . . 74 5.3.2 Other Methods Based on Modeling and Learning . . . . . . . . . . . . . . 76 5.3.3 Practical Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.4.1 Datasets, Learning Tasks and Experimental Setup . . . . . . . . . . . . . . 78 5.4.2 Characteristic of the GNN Scope . . . . . . . . . . . . . . . . . . . . . . . 81 5.4.3 Accuracy and Inference Speed for Node Classification . . . . . . . . . . . 82 5.4.4 Extension to Link Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.4.5 Scaling to Graphs of 100M Nodes . . . . . . . . . . . . . . . . . . . . . . 85 5.4.6 Accuracy-Speed Tradeoff During Inference . . . . . . . . . . . . . . . . . 86 Chapter 6 Parallelization and Acceleration Techniques . . . . . . . . . . . . . . . . . . . 88 6.1 Intensive Computation, Irregular Memory Accesses and Load Balance . . . . . . . 88 6.2 Computation Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.3 ParaGNN: Parallel GNN Training on Multi-Core CPUs . . . . . . . . . . . . . . . 93 6.3.1 Transpose of the sparse adjacency matrix . . . . . . . . . . . . . . . . . . 93 6.3.2 Data Structure for Parallel Subgraph Sampling . . . . . . . . . . . . . . . 94 6.3.3 Intra- and Inter-subgraph Parallelization . . . . . . . . . . . . . . . . . . . 98 6.3.4 Extension to Other Graph Sampling Algorithms . . . . . . . . . . . . . . . 103 6.3.5 Parallel Algorithm for Minibatch Training . . . . . . . . . . . . . . . . . . 104 vi 6.3.6 Complete Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.4 GraphACT: Accelerating GNN Training on CPU-FPGA Heterogeneous Platforms . 112 6.4.1 Hardware-friendly Minibatch Training Algorithm . . . . . . . . . . . . . . 112 6.4.2 FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.4.3 CPU-FPGA Co-processing . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.4.4 Theoretical Performance Analysis . . . . . . . . . . . . . . . . . . . . . . 119 6.4.5 Discussion on GPU-Based Acceleration . . . . . . . . . . . . . . . . . . . 123 6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.5.1 Datasets, Platforms and Implementations . . . . . . . . . . . . . . . . . . 125 6.5.2 Scalability of Parallel Graph Sampling . . . . . . . . . . . . . . . . . . . . 127 6.5.3 Scalability of Parallel Training . . . . . . . . . . . . . . . . . . . . . . . . 128 6.5.4 Evaluation on Synthetic Graphs . . . . . . . . . . . . . . . . . . . . . . . 131 6.5.5 Speedup of Hardware Accelerator . . . . . . . . . . . . . . . . . . . . . . 132 Chapter 7 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 7.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 7.2.1 Heterogeneous and Temporal Graphs . . . . . . . . . . . . . . . . . . . . 137 7.2.2 Large Scale Deployment on Distributed Systems . . . . . . . . . . . . . . 138 7.2.3 Federated Learning on Large Graphs . . . . . . . . . . . . . . . . . . . . . 139 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 vii List of Tables 2.1 Notations related to graphs and GNNs . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.1 Specification of the graph datasets (“m” stands for multi-class classification, and “s” for single-class.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2 Comparison on test set F1-micro score between variants of GraphSAINT (4 sampling algorithms) and the state-of-the-art methods . . . . . . . . . . . . . . . . . . . . . . . 50 4.3 Comparison with ClusterGCN on test set F1-micro score (the depth and hidden dimen- sions follow the configurations of the original Cluster-GCN paper) . . . . . . . . . . . 50 4.4 Per epoch training time breakdown for AS-GCN . . . . . . . . . . . . . . . . . . . . . 52 4.5 Clustering time of ClusterGCN via executing METIS algorithm . . . . . . . . . . . . . 53 4.6 Comparison between GraphSAINT and ClusterGCN on total convergence time (pre- processing + sampling + training, unit: second) . . . . . . . . . . . . . . . . . . . . . 54 5.1 READOUT(·) function for different pooling operations, whereH [v] is the matrix con- sisting of all the last layer representation vectors of nodes inG [v] . . . . . . . . . . . . . 72 5.2 Average number of nodes touched by the approximate PPR computation on 3 large graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.3 Key statistics of the graphs we evaluate. ogbl-collab is used for link prediction and all the other graphs are for node classification. . . . . . . . . . . . . . . . . . . . . . 78 5.4 Recommended minimum hardware resources for SHADOW-GNN . . . . . . . . . . . 80 5.5 Comparison with state-of-the-art on test accuracy / F1-micro score and inference cost (both SHADOW-GNN and the baselines are tuned with DropEdge to improve the training quality on deep models). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.6 Comparing SHADOW-GNN with the top-performing methods on the public OGB leaderboard for the link prediction datasetogbl-collab . . . . . . . . . . . . . . . . 84 5.7 Comparing SHADOW-GNN with the top-performing methods on the public OGB leaderboard for the largest public graphogbn-papers100M . . . . . . . . . . . . . . . 85 5.8 Memory consumption of theogbn-papers100M leaderboard methods . . . . . . . . . 85 6.1 Summary of symbols related to the data structure and parallelization of the frontier sampling algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.2 Computation-communication ratio of various training algorithms. LetB = P L− 1 ℓ=0 α ℓ b 0 , andB ′ =L· b 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 viii 6.3 Dataset Statistics for the graphs evaluated by our CPU parallel algorithm and CPU- FPGA heterogeneous accelerator. We generate synthetic graphs of various sizes to better understand the scalability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.4 Comparison with state-of-the-art with respect to hardware resources and training time (until convergence) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 ix List of Figures 3.1 Evolution of various computation perspectives on GNN and their tradeoffs between algorithmic complexity and hardware utilization (GC: Graph-Centric; NC: Node-Centric; LNC: Layerwise Node-Centric; GC subgraph: graph-centric subgraph construction; NC subgraph: node-centric subgraph construction) . . . . . . . . . . . . . . . . . . . 28 4.1 Overview of the GraphSAINT training algorithm. In each iteration, we sample a new subgraphG s ofG and then contruct a full GNN onG s without considering nodes in V−V s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2 Example on reducing redundancy in feature propagation, using a graph theoretical approach. (a): the original subgraph. (b): the (weighted) aggregation graph and the maximum weight matching. (c): transformed subgraph with virtual nodes representing aggregated common neighbors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.3 Degree distribution of the evaluated graphs . . . . . . . . . . . . . . . . . . . . . . . . 49 4.4 Convergence curves of 2-layer models onGraphSAINT and baselines (averaged over 3 runs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.5 Comparison of training efficiency when increasing the GNN depth ( GraphSAINT: lin- ear complexity; VRGCN: exponential complexity due to neighborhood explosion) . . . 53 4.6 Cost of various subgraph sampling algorithms measured by the fraction of total train- ing time. The expensive MRW algorithm can be accelerated by our parallelization techniques in Section 6.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.7 Sensitivity of test accuracy on sampling parameters (number of walkers for the random walk sampler) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.8 GraphSAINT variants based on the JK-net and GAT model architectures (evaluated on the Reddit graph as an example) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.9 Reduced redundancy vs. Storage overhead . . . . . . . . . . . . . . . . . . . . . . . . 56 5.1 ExampleG ′ [v] andG ′′ [v] showing that the normal GraphSAGE model cannot approximate the linear τ function. The graph on the left shows that GraphSAGE can only has 2 rounds of neighbor propagation to exclude the influence of v ′ . The graph on the right shows that a 2-layer GraphSAGE cannot generate different representations forv inG ′ [v] and inG ′′ [v] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.2 An example 3-regular graph and the 1-hop subgraph of the target nodes . . . . . . . . 69 5.3 An example 2-regular graph with two connected components (CC) . . . . . . . . . . . 69 x 5.4 Difference in neighborhood composition between the normal GNN and SHADOW- GNN . The PPR-basedEXTRACT filters out most distant neighbors in the original graph and thus the resulting subgraph is dominated by nearby neighbors. . . . . . . . . . . . 80 5.5 Oversmoothing of SGC and local smoothing of SHADOW-SGC . “Power on e A or e A [v] ” is equivalent to the depth of a GCN without non-linear activation. . . . . . . . . 81 5.6 Measured execution time for PPREXTRACT and the GNN model computation . . . . . 82 5.7 Inference performance tradeoff between accuracy and time. For each SHADOW-GNN architecture and graph, we pre-train a single 5-layer model and apply it to perform inference on subgraphs of various sizes without re-training. . . . . . . . . . . . . . . 87 6.1 Overview of the processing pipeline on FPGA (L=2) . . . . . . . . . . . . . . . . . 114 6.2 Per-minibatch scheduling between CPU and FPGA (up), and between the two compu- tation modules on FPGA (down). The sampling on CPU and the layer computation on FPGA are pipelined, and the execution of the two computation modules on FPGA is also pipelined. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.3 Execution time speedup for the parallel subgraph sampling algorithm under both inter- & intra-subgraph parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.4 Evaluation on scalability with respect to the number of CPU cores (hidden feature dimensions: 512 and1024) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6.5 Execution time per training iteration for synthetic graph under various sizes and aver- age degrees. Left: the total graph size (|V|) grows from 1 million to 33 million with a fixed average degree of 16; Right: the average degree grows from 8 to 512 with a fixed total graph size ( |V|) of2 20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 xi Abstract Graphs are powerful models for capturing complex relationships in real-world data. Representa- tion learning on graphs embeds the nodes with irregular edge connections into a structured, low- dimensional vector space, such that downstream analytics (e.g., node classification, link prediction, graph classification) can be performed accurately. Recently, deep learning models such as Graph Neural Networks (GNNs) have been shown to generate node representations with high quality, by simultaneously preserving information from edge connections and node features. However, cur- rent deep graph learning approaches do not scale well with respect to model depth and graph size due to the following challenges: 1. Computation redundancy: propagating information to multi- ple target nodes can incur redundant computation due to common neighbors. Such redundancy is amplified by the recursive neighborhood expansion of multiple GNN layers, and thus results in computation complexity that grows exponentially with the model depth. 2. Information flow : on large graphs, the size of the multi-hop neighborhood of a target node can be massive. In the worst case, it can be comparable to the full graph size. Information flows from such a neighborhood into a single representation vector of the target node, thus creating a bottleneck for GNNs to extract local features from the global context. In this dissertation, we address the above GNN scalability challenges via algorithm-model- architecture co-optimization. To address computation redundancy, we develop efficient algo- rithms based on a novel subgraph-centric training paradigm. Our minibatch training framework (GraphSAINT), optimized with bias correction and variance reduction techniques, achieves sig- nificant accuracy improvement in orders of magnitude less convergence time compared with the xii state-of-the-art. We further propose a graph theoretic approach to exploit common neighbors for subgraph-level redundancy reduction. To overcome the bottleneck of information flow by expres- sive models, we define a new class of GNNs by decoupling the depth and scope of the model. We theoretically prove the improvement in expressivity from three different perspectives: graph signal processing, function approximation and topological learning. We further propose a practical implementation (SHADOW-GNN) which achieves state-of-the-art accuracy with 1000× smaller neighborhood size on the largest publicly available graph. Finally, to achieve fast GNN execution, we explore parallelization and hardware acceleration techniques. We propose parallel training algorithms for multi-core CPUs and design a pipelined architecture (GraphACT) for cloud FPGAs. We theoretically prove the performance bound as well as empirically demonstrate significant exe- cution speedup compared with state-of-the-art GPU implementations. xiii Chapter 1 Introduction 1.1 Deep Graph Learning Graphs are ubiquitous representation of data: molecular structures can be captured by graphs where nodes are the atoms of the compound and edges represent the chemical bonds; roads in a city form a graph where nodes are road sections and edges capture the road connectivity; in a social network, a node represents each individual and an edge captures social relationships such as friendship; aca- demic papers form a citation graph with edges capturing the citation relationship between papers. Learning good representations from graphs facilitate applications such as drug discovery, protein analysis, traffic forecasting, friend / product recommendation, advisement and knowledge under- standing [85, 17, 92]. Realistic graphs contain rich information from various sources. For example, in a social net- work, the structure of the graph reflects human communities and analysis of such information can be performed via discrete algorithms or combinatorial optimization (e.g., Louvain community detection [15] and betweenness centrality computation [12]). On the other hand, each node itself may be associated with personalized features represented by a fixed-length vector ( e.g., the node feature can be generated by a language model based on all the posts by the user [37]). The node features reside in a continuous vector space, but many traditional graph analytics are not designed to directly operate on high dimensional vectors. The objective of deep graph learning is to learn meaningful representation vectors of each node by integrating the discrete topological information 1 and the continuous feature information. Formally, deep graph learning aims at obtaining a function g, which is parameterized byΘ 1 : y v =g Θ (v,G(V,E,X)) (1.1) wherey v is the representation vector ofv, and the graphG is defined by the node set V, edge set E and the matrix X ∈ R |V|× f composed of the length-f feature vectors of each node inV. In state-of-the-art models, the functiong Θ is often approximated by Graph Neural Networks (GNNs) whereΘ consists of the weight matrices for the layers. Upon obtaining the representation vectorsy v , we can easily perform many downstream tasks with classic machine learning algorithms. For example, by applying an MLP classifier on y v , we can categorize nodes in the graph (i.e., node classification); by searching for y u most similar toy v via nearest neighbor search algorithms, we can form new edges(u,v) in the graph without altering the graph characteristics (i.e., link prediction); by performing pooling on{y u | u ∈ V} and then applying an MLP classifier, we can infer a label for the full graph ( i.e., graph classification). 1.2 Challenges and Motivation In realistic applications, the graphs are often of massive sizes and have complicated structure. Learning good node representations on such graphs requires expressive GNN models implemented by many layers. However, various challenges in scalability exist when applying a deep GNN on a large graph. Challenge in computation redundancy Edges of a graph define neighbors of each node, and thus common neighbors of multiple nodes. Depending on the computation scheme, such common 1 For example,g is a function approximated by a neural network, and Θ is the set of weight parameters of such a neural network. 2 neighbors may repeatedly process the same information to propagate to different target nodes. This leads to redundancy in computation. The amount of redundancy accumulates and is amplified as the GNN depth increases: an L-layer GNN recursively expands the neighborhood L times, and the total number of neighbors propagating to a single target node can be O d L where d is the average degree of the graph. When such redundancy is not addressed, each target node would independently compute its own representation from theO d L neighbors, resulting in exponential computation and memory complexity to generate a single node representation. Such challenge is also referred to as “neighborhood explosion” in the literature [37, 19, 97]. Note that math- ematically, there is a simple way to achieve low redundancy by reducing the complexity from exponential to linear with respect to the model depth. The idea is to treat all nodes in the graph as target nodes. Thus, the joint neighborhood of all target nodes is just the set of target nodes itself (denoted asV). Each GNN layer can simply propagate fromV toV along all graph edgesE, com- pletely avoiding the explosion of the neighborhood size for the entire GNN model. Unfortunately, such a computation scheme cannot be practically implemented on realistic large graphs due to two reasons. First, in many application scenarios, the set of target nodes only corresponds to a small portion ofV and it would be wasteful to treat allV as target nodes. Second, when the full graph does not fit in fast memory of the hardware ( e.g., global memory of GPU, whose size is only tens of GBs), the execution of the “V-to-V” propagation can result in low hardware utilization due to memory access bottlenecks to external slow memory (e.g., disk). Challenge in the quality of information flow When generating the representation of a target node, the GNN needs to “compress” the rich information contained in a massive neighborhood into a single embedding vector. Thus, the expressive power of the model plays a critical role in improving learning quality. While a deeper GNN may be more expressive, the inter-layer informa- tion flow also becomes more complicated due to larger neighborhood size. Thus, simply increasing the model depth does not necessarily lead to better learning. In some cases, a deeper model may even provably degrade the performance. Consider the example of Graph Convolutional Networks 3 (GCN) [50], one of the most fundamental and popular GNN architectures. It has been shown that a deep GCN acts like a low-pass filter on the graph signals [84] ( i.e., the original features of nodes). When adding more GCN layers, the model effectively suppresses the high frequency components and reduces the variations of features across nodes. As a result, the output representation vectors of all nodes converge to a small subspace of the full vector space, making the nodes indistinguish- able from each other. Formally, this issue corresponding to the GCN architecture is referred to as “oversmoothing” [60]. The more general challenge in the low-quality information flow within a GNN is referred to as “oversquashing” [8]. The above two challenges, exemplified by the well-known phenomenon of “ neighborhood explosion”, “oversmoothing” and “oversquashing”, significantly hurdle the practical deployment of existing GNN models as well as the development of new GNN models. The existing literature provides partial solutions: sampling based methods alleviate neighborhood explosion at the cost of additional memory consumption [19] or accuracy degradation [37, 18]; simplified GNN models [84, 30, 51] treating neighbor propagation as preprocessing achieve fast training and inference exe- cution, yet significantly sacrifice expressive power; expressive GNN models have been proposed to overcome oversmoothing [20, 73] and to improve the GNN information flow via diffusion [52]. However, none of these expressive models address the challenge of high computation and memory complexity. In summary, there is still an urgent need to develop scalable GNN models, training algorithms and accelerator designs. 1.3 Metrics Corresponding to the two challenges above, the objective of this dissertation is to scale up GNN computation by improving both the efficiency and expressivity (i.e., expressive power). Efficiency The challenge of high redundancy potentially results in low efficiency. We define efficiency both theoretically and practically, by measuring the following: 4 • Algorithmic complexity: it counts how many arithmetic operations are required and how much data needs to be stored. This metric is independent of the hardware configuration and software implementation. • Hardware utilization: it measures how well the hardware computation resources are utilized. Low utilization means the implementation is not parallelized well or the overall execution is bottlenecked by memory accesses (and thus the computation units idle). Note that it is possible that a low complexity model suffers low hardware utilization (see “graph-centric” execution in Chapter 2.1), or a high complexity model enjoys high hardware uti- lization (see “node-centric” execution in Chapter 2.1). Significant execution acceleration is only possible when both low algorithmic complexity and high hardware execution are satisfied. Expressivity The challenges in poor information flow harms the expressive power of the GNN. Expressive power can be rigorously analyzed or empirically evaluated via accuracy metrics: • Theoretical expressive power: we analyze the class of functions a GNN can approximate, and the types of graph structures a GNN can distinguish. • Empirical accuracy: we measure the accuracy of the GNN model in node classification and link prediction tasks. 1.4 Contributions In this dissertation, we propose algorithm-model-architecture co-optimization to address the above challenges. Algorithms We propose two efficient algorithms to reduce the redundancy in GNN training at different levels. Sampling is a common way to train GNNs on large graphs. Traditionally, the GNN directly operates on the full input graph and sampling is performed on the fullV andE as part of the 5 operations of each GNN layer. Our novelty lies in a new subgraph-centric computation paradigm: we take sampling out of the GNN and instead treat it as a data-processing step. Specifically, our sampler keeps generating small subgraphsG s to capture the key characteristics of the full graphG. The GNN viewsG s as the new input graph, treats allV s as the target nodes and propagates along all E s . As discussed in Section 1.2, redundancy is automatically addressed when propagating information from V s to the same set of V s . Specifically, the subgraph-centric computation can be efficiently executed on commercial hardware by choosing small subgraph sizes. Our results include: • We propose GraphSAINT, a general minibatch training framework based on subgraph sam- pling. It resolves neighborhood explosion for a wide range of GNN architectures. • Through rigorous theoretical analysis, we address the bias and variance of the GraphSAINT minibatch estimators to improve convergence quality. – We propose normalization techniques to eliminate bias introduced by any (non I.I.D.) subgraph sampling algorithms. – We propose various light-weight subgraph sampling algorithms based on variance- reduction analysis. • GraphSAINT has been widely used as a standard benchmark (e.g., Open Graph Benchmark [41]) as well as included in the most popular open-source GNN librarys (PyTorch Geometric [29] and Deep Graph Library [81]). • To further reduce the redundancy within each subgraph, we propose a graph theoretic approach to transform the input graph into an aggregation graph using a low-complexity greedy algorithm. Propagating neighbor information within the aggregation graph requires significantly less arithmetic operations and memory accesses compared with within the full input graph. 6 Models Instead of improving the expressive power by exploring new GNN layer operations [88, 78, 22], we define a new class of GNNs by expanding the design space associated with exist- ing layer operations. The key idea is to decouple the depth and scope (i.e., receptive field) of a GNN such that anL-layer GNN not necessarily propagate information from the fullL-hop neigh- borhood of a target node. Similar toGraphSAINT, our decoupled GNN performs subgraph-centric computation. Yet the key difference is that one subgraph of the decoupled GNN only captures the characteristics of one target node, rather than the full input graph. Specifically, to generate repre- sentation of a node v, the decoupled model first samples a small subgraph G [v] V [v] ,E [v] around v by selecting intoV [v] the most relevant and important neighbors ofv. Then we apply a GNN on G [v] where each layer performsV [v] -to-V [v] propagation along allE [v] . Finally, we perform pooling on all the representation vectors ofu∈V [v] (i.e., the last layer output) to obtain the representation of v. “Decoupling” means that the scope (defined by the subgraph sampling algorithm) and the GNN depth can be independently configured by hyperparameter tuning. Decoupled GNNs address the computation redundancy in a way similar toGraphSAINT. In addition, under practical settings, the size ofG [v] remains small even when the size of the full graphG grows significantly. Thus, the computation of a deep decoupled GNN can be performed efficiently by resource constrained hardware. Finally, the flexibility to increase the model depth without significantly increasing the computation complexity leads to increased GNN expressive power. Our results include: • We propose a “depth-scope” decoupling design principle for existing GNN models, enabling efficient and accurate computation of deep GNNs on large graphs. • We theoretically analyze the benefits in expressive power due to decoupling: – From the graph signal processing perspective, we prove that decoupled-GCN performs useful local-smoothing rather than harmful oversmoothing, as long as the scopes of different target nodes are different. 7 – From the function approximation perspective, we construct a linear target function on neighbor features and show that decoupling the GraphSAGE model reduces the func- tion approximation error. – From the topological learning perspective, we prove that decoupled-GIN can differen- tiate more neighborhood structure than the original GIN, and thus is more powerful than the classic 1-dimensional Weisfeiler-Lehman test. • We develop various neighbor selection algorithms to constructG [v] . We show that traditional graph analytics such as Personalized PageRank (PPR) [10] can provide useful inductive bias to various GNN models. • We perform extensive evaluation of decoupled GNNs on various downstream tasks (node classification and link prediction), as well as on the largest publicly available dataset of 111M nodes. Architectures Both the above algorithm and model innovations lead to hardware-friendly GNN execution. The neighbor propagation within the subgraphs can be expressed as computation ker- nels such as Sparse-Matrix Multiplication (SpMM) and dense matrix multiplication (GEMM), where the matrices can fit in the fast memory of the hardware accelerators under suitable samplers. Motivated by this, we develop mapping techniques to parallelize the GNN computation on multi- core CPUs by optimizing the memory accesses across the cache hierarchy. We further propose an FPGA pipelined architecture to accelerate the GraphSAINT training with good load-balance. Our results include: • We propose parallelization techniques for subgraph-centric execution on multi-core CPUs: – To support parallel subgraph sampling, we propose a novel data structure that sup- ports fast, incremental and parallel updates to a probability distribution. Our parallel sampler based on this data structure theoretically and empirically achieves near-linear scalability with respect to number of processing units. 8 – We parallelize all the key operations of a GNN layer to scale the overall minibatch training to a large number of processing cores. Specifically, for subgraph neighbor propagation, we perform intelligent partitioning along the feature dimension to achieve near-optimal DRAM and cache performance. • We develop an FPGA-architecture, GraphACT, to accelerate the GNN execution by pipelin- ing: – We design dedicated hardware modules for various sparse and dense operations of a GNN layer. – We develop an accurate performance model for GraphACT, and accordingly prove that the on-chip pipeline achieves good load-balance despite the fact that the subgraph struc- ture can be irregular. The performance model identifies the architectural parameters of the FPGA design, and the algorithmic parameters of the minibatch sampler. 1.5 Organization The rest of the dissertation is organized as follows. In Chapter 2, we review the fundamentals of Graph Neural Networks, including its various layer operations and the popular model variants. We also discuss the connections between the emerging GNNs and the traditional graph analytics. In Chapter 3, we conceptually unify all our completed works in the following chapters. In Chapter 4, we discuss our two efficient algorithms to address the computation redundancy in neighbor propa- gation – one on subgraph-based GNN training (GraphSAINT) and the other on common neighbor based graph transformation. In Chapter 5, we discuss the new class of GNNs (SHADOW-GNN) which achieves superior accuracy and efficiency by “depth-scope” decoupling. In Chapter 6, we present the parallelization (ParaGNN) and acceleration (GraphACT) techniques for the subgraph- centric GNN computation discussed in Chapters 4 and 5. In Chapter 7, we conclude the dissertation by summarizing the impact of the completed works and possible future directions. 9 Chapter 2 Background and Related Works 2.1 Graph Neural Networks from a Computation Perspective Graph Neural Networks (GNNs), including Graph Convolutional Network (GCN) [50], Graph- SAGE [37], Graph Attention Network (GAT) [78] and Graph Isomorphism Network (GIN) [88], are the state-of-the-art deep learning models for graph embedding. They have been widely shown to learn highly accurate and robust representations of the graph nodes. Like Convolutional Neural Networks (CNNs) on images, GNNs belong to a type of multi-layer neural network, which gen- erates node representation vectors as follows. The input to a GNN is a graph whose each node is associated with a feature vector (i.e., node attribute). The GNN propagates the features of each node layer by layer, where each layer performs tensor operations based on the model weights and the input graph topology. The last GNN layer outputs representation vector of each node in the graph. Essentially, both the input node features and the topological information of the graph are “embedded” into the output vectors. We next formally define the GNN forward propagation. In this dissertation, we consider undi- rected, unweighted graphs for simplicity of description. Most of our techniques can be extended to the case of directed and weighted graphs. Let the input graph beG(V,E,X), whereX ∈R |V|× f stores the initial node attributes, andf is the initial feature length. A full GNN is build by stack- ing multiple layers, where the inputs to the next layer are the outputs of the previous one. We use superscript “(ℓ)" to denote GNN layer-ℓ parameters. For a layerℓ, it contains|V| layer nodes corre- sponding to the|V| graph nodes. Each input and output node of the layer is associated with a feature vector of length f (ℓ− 1) and f (ℓ) , respectively. DenoteX (ℓ− 1) ∈ R |V|× f (ℓ− 1) andX (ℓ) ∈ R |V|× f (ℓ) as the input and output feature matrices of the layer, whereX (0) = X and f (0) = f. An input 10 Table 2.1: Notations related to graphs and GNNs Symbol Meaning G an undirected, unweighted graph V the set of nodes ofG E the set of edges ofG A the|V|×|V| binary adjacency matrix ofG deg(v) the degree of D the diagonal degree matrix whereD v,v = deg(v) e A symmetrically normalized adjacency matrixD − 1 2 AD − 1 2 b A row-wise normalized adjacency matrixD − 1 A f the dimension of node feature x v the length-f feature vector of nodev X the node feature matrix with shapeR |V|× f , whereX v,: =x T v N v the set of neighbors ofv whereu∈N v iff (u,v)∈E dist(u,v) the shortest path distance betweenu andv N ℓ v the set of all nodesu such thatdist(u,v)≤ ℓ nodev (ℓ− 1) of a layer is connected to an output nodeu (ℓ) if and only if(v,u)∈E. If we view the input and output nodes as a bipartite graph, then the bi-adjacency matrixA (ℓ) equals the adjacency matrixA ofG. Further, denoteN v ={u|(u,v)∈E} as the set of direct (i.e., 1-hop) neighbors of v. DenoteN k v as the set of all nodes no more thank hops away fromv (i.e.,∀u∈N k v , the shortest distance of betweenu andv,dist(u,v), satisfies dist(u,v)≤ k). The operations of a GNN layer can be described in two mathematically equivalent ways: one from the node-centric and the other from the graph-centric perspective. The two perspectives imply different tradeoffs in GNN computation. We also illustrate their connections via a “layerwise node-centric” perspective. 2.1.1 Node-centric Perspective The operation of a layer is specified by separately considering the operation of each node: 11 x (ℓ) v = UPDATE x (ℓ− 1) v ,AGGR {x (ℓ− 1) u |u∈N v } (2.1) whereAGGR(·) is a neighbor aggregation taking the set of neighbor features as the input, and returns a single aggregation vector as the output; UPDATE(·) is an update function taking two vectors (the previous-layer output ofv and the aggregation vector) as the inputs. AGGR(·) can be implemented as simple permutation invariant function such as sum or mean of all x (ℓ− 1) u . UPDATE(·) can be implemented by matrix multiplication of the layer-ℓ weight matrices. Computing under the node-centric perspective enables us to flexibly form minibatches and is suitable when the number of target nodes is much less than the graph size|V|. However, as dis- cussed in Section 1.2, it incurs significant redundancy in computation, reflected by “neighborhood explosion”. Sampling methods such as [37, 19, 44] follow the node-centric perspective. They sample from the neighborhood of each nodev to reduce the amount of GNN computation. See Section 4.1.1 for details. 2.1.2 Graph-centric Perspective The operation of a layer is defined by jointly considering the behavior of all nodes in the layer: X (ℓ) = UPDATE X (ℓ− 1) ,AGGR V (ℓ− 1) ,E (ℓ− 1) ,X (ℓ− 1) (2.2) where AGGR(·) can often be expressed as matrix operations between the adjacency matrixA and the feature matrixX (ℓ− 1) . Computing under the graph-centric perspective addresses redundancy since every layer deals with the same set of nodes and edges (V (ℓ) =V andE (ℓ) =E). However, the computation is only 12 efficient if most nodes in V are target nodes and the full graph informationV (ℓ) ,E (ℓ) andX (ℓ) fits in the fast memory of the hardware. Many GNN models [50, 78, 84, 113, 54, 6] are initially defined under the graph-centric per- spective. It also helps with the analysis on the model performance from the scope of the full graph (e.g., oversmoothing). 2.1.3 Layerwise Node-centric Perspective Following Equation 2.1, when computingx (L) v , we recursively expand the neighbors until the GNN has visited allN L v . The naïve recursive expansion can be further optimized to reduce redundancy, by performing node-centric computation in a “layerwise” fashion. We first use an example to understand such redundancy. Consider a target nodev and two1-hop neighborsu andw. Suppose nodep is 2 hops away fromv and connects to bothu andw. Suppose the GNN has 3 layers (L=3). Then computation of x (2) u requires x (1) p , and x (1) p is further computed from {x (0) q | q ∈ N p }. Similarly,x (2) w requires the computation from{x (0) q | q ∈ N p } as well. Under the vanilla node- centric perspective, neighbor propagation from the same{x (0) q | q ∈ N p } is computed twice to support the computation ofx (2) u andx (2) w , causing redundancy. To address this issue, we observe that the layer-ℓ outputs of anL-layer GNN corresponds toN L− ℓ v of the target nodev. Thus, instead of letting eachx (ℓ) u independently perform the recursion (Equation 2.1), theℓ-th layer of the model generates all{x (ℓ) u |u∈N L− ℓ v } at once by propagating from all{x (ℓ− 1) u |u∈N L− ℓ+1 v } in a single pass. Essentially, we form a bipartite graph with the two parts beingN L− ℓ+1 v andN L− ℓ v , and the layer propagates along all the bipartite edges. Note that we still categorize such style of operation as “node-centric” since allN ℓ v are relative to v. In practice, however, the layerwise node-centric computation does not significantly reduce the complexity of the vanilla node-centric computation. This is simply because the size ofN ℓ v often still grows at an exponential rate with respect toℓ when the full graph size is massive. Thus, the “neighborhood explosion” challenge persists. 13 Layerwise node-centric computation can be extended to a minibatch setting where a minibatch contains a setT of multiple target nodes. Then{x (L) v | v ∈ T} is generated at once from the last layer following the bipartite graph connections. If the nodes inT are randomly selected from V, thenN ℓ v andN ℓ u barely overlap for any u,v ∈ T . So constructing minibatches does not help reduce complexity. However, ifT is selected by a biased sampler such that all minibatch nodes are close to each other, then the neighborhoods of different target nodes significantly overlap with each other. In the extreme case,N ℓ u =N ℓ v for anyu,v∈T and the layerwise node-centric computation onG is equivalent to a graph-centric computation on the subgraph formed byT (see “subgraph- centric” in Chapter 3). This becomes the intuition of ourGraphSAINT design, and also brigdes the gap between the two different node-centric and graph centric perspectives. Layerwise node-centric computation has been used in many practical implementations of GNNs [37, 92] to improve the vanilla node-centric computation. In addition, the works [18, 114] adopt the layerwise perspective to compute sampled minibatches during training. See Section 4.1.2 for detailed discussion. 2.1.4 Example GNN Models We consider four widely used GNNs as representative models: Graph Convolutional Network (GCN) [50], GraphSAGE [37], MixHop [6] and Graph Attention Network (GAT) [78]. We first introduce in detail the GraphSAGE model architecture, and then summarize the layer operations of the other three. We mainly focus on the graph-centric perspective since it is closely related to our subgraph-centric computation paradigm. We give an example of node-centric perspective for GraphSAGE. Each GraphSAGE layer contains two learnable weight matrices: self-weightW ◦ and neighbor- weightW ⋆ . The forward propagation of a layer is defined by: X (ℓ) = ReLU X (ℓ− 1) · W (ℓ) ◦ b A· X (ℓ− 1) · W (ℓ) ⋆ (2.3) 14 Or equivalently: x (ℓ) v = ReLU W (ℓ) ◦ T · x (ℓ− 1) v 1 deg(v) W (ℓ) ⋆ T X u∈Nv x (ℓ− 1) u ! (2.4) where “∥” is the column-wise matrix concatenation operation, and b A is the normalized adjacency matrix. The normalization can be calculated as b A = D − 1 · A, whereA is the binary adjacency matrix ofG andD is the diagonal degree matrix ofA (i.e.,D ii = deg(i)). From Equation 2.3, each layer performs two key operations: 1. Feature aggregation / Neighbor propagation (via AGGR(·)): Each layer-ℓ node collects fea- tures of its layer-(ℓ− 1) neighbors and then calculates the weighted sum, as shown by b A· X (ℓ− 1) . 2. Weight transformation (via UPDATE(·)): The aggregated neighbor features are transformed (via matrix multiplication) byW (ℓ) ⋆ . The features of a layer-(ℓ− 1) node itself are trans- formed byW (ℓ) ◦ . After obtaining the node representation vectors from the outputs of the last GNN layer, we can further perform various downstream tasks by analyzing the embedding vectors. For example, we can use a simple Multi-Layer Perceptron (MLP) to classify the graph nodes intoC classes. LetL be the total number of GNN layers. SoX (L) is the final node embedding. Following the design of [37, 18, 44], the classifier MLP generates the node prediction by: X MLP =ReLU X (L) · W MLP (2.5) Y =σ (X MLP ) (2.6) where W MLP ∈ R f (L) × C . Function σ (·) is the element-wise sigmoid or row-wise softmax to generate the probability of a node belonging to a class. 15 Under the supervised learning setting, each node ofV is also provided with the ground-truth class label(s). LetY ∈R |V|× C be the binary matrix encoding the ground-truth labels. Comparing the prediction with the ground-truth, we can obtain a scalar loss value,L, by cross-entropy (CE) loss function: L= CE Y,Y (2.7) For the other three types of GNNs under consideration, we need to update Equation 2.3 for different forward propagation rules. Specifically, for GCN [50], the main difference from Graph- SAGE is that there is not an explicit termX (ℓ− 1) · W (ℓ) ◦ to capture the influence of a node to itself. Instead, the self-influence is propagated by adding a self-connection in the graph. Therefore, the adjacency matrix becomes I +A and the normalization is performed differently. The forward propagation of each layer is as follows: X (ℓ) = ReLU e A· X (ℓ− 1) · W (ℓ) (2.8) where e A is a symmetrically normalized adjacency matrix calculated by e A=(I +D) − 1 2 · (I +A)· (I +D) − 1 2 , andI is the identity matrix. For MixHop [6], each layer is able to propagate influence from nodes up to K-hops away (i.e., u is said to be K-hops away from v if the shortest path from u to v has length K). The forward propagation of each layer is defined as: X (ℓ) = ReLU K k=0 e A k · X (ℓ− 1) · W (ℓ− 1) k (2.9) where “∥” is again the operation for matrix concatenation. e A k means the symmetrically normalized adjacency matrix raised to the power ofk. And “order”K is a hyperparameter of the model. 16 For GAT [78], instead of aggregating the features from the previous layer (i.e.,X (ℓ− 1) ) using a fixed adjacency matrix ( i.e., e A in GCN or b A in GraphSAGE), each GAT layer learns the weight of the adjacency matrix as the “attention”. The forward propagation of a GAT layer is specified as: X (ℓ) = ReLU A (ℓ− 1) att · X (ℓ− 1) · W (ℓ) (2.10) where each element in the attention adjacency matrixA (ℓ− 1) att is calculated as: h A (ℓ− 1) att i u,v = softmax LeakyReLU a T · W (ℓ) · x (ℓ− 1) u W (ℓ) · x (ℓ− 1) v (2.11) wherea is a learnable vector andx u means the feature vector of nodeu (i.e., theu-th row of the feature matrixX (ℓ− 1) ). Again softmax(·) denotes the softmax function – it normalizes each row ofA (ℓ− 1) att such that the sum equals 1. As an extension, Equation 2.10 can be modified to support “multi-head” attention. Note that the computation pattern of “multi-head” GAT is the same as that of “single-head” captured by Equation 2.10 and our parallelization strategy can be easily extended to support the multi-head version. We therefore restrict to Equation 2.10 for the discussion on GAT. In summary, considering all the four models, the full forward propagation during training takes X as the input and generates loss L as the output by traversing the GNN layers, the classifier layers, and the loss layer. After obtainingL, we perform backward propagation from the loss layer all the way to the first GNN layer and update the weights by gradients. The gradients are computed by chain-rule. In Section 6.2, we analyze the computation in backward propagation and propose parallelization techniques for each of the key operations. 17 2.2 Expressive Power of GNNs The expressive power of GNNs, just like that of other neural networks, can be measured by their ability of approximating some class of target functions. Specifically for GNNs, such target func- tions can be related to how the two sources of graph information (node features and graph topology) are meaningfully manipulated. 2.2.1 Graph Convolution One of the fundamental GNN models, the Graph Convolutional Network (GCN) [50], is inspired by classic graph signal processing techniques. Recall that convolution on a 2D image can be equiv- alently defined from the spatial domain as well as from the frequency domain, where the spatial domain operation performs sliding window of the filters along the horizontal and vertical direc- tions. However, for graphs, due to the irregular edge connections, there is not a good notation of horizontal or vertical directions, and thus the convolution can only be well-defined in the frequency domain. The spectral graph convolution is defined by the graph Laplacian, which is a normalized form of the binary adjacency matrixA: L=I− D − 1 2 AD − 1 2 =UΛ U T (2.12) where the second equation is by performing eigen-decomposition of the symmetric matrixL. Each column of the square matrixU is an eigenvector ofL, and the diagonal matrixΛ consists of all the eigenvalues. We consider the case where each node has a 1D feature, and thus the graph signal to be con- volved on isx∈R |V| . Fourier transform on the graph is defined as: 18 e x=U T x (2.13) And inverse Fourier transform is defined as: Ue x=UU T x=x (2.14) whereUU T =I is due toU being ann orthogonal matrix. Thus, graph convolution with a spatial filter y is equivalent to the spectral convolution with parametersθ : y∗ x=U· U T y ◦ U T x =U· θ ◦ U T x =U· diag(θ )· U T x (2.15) We can further parameterizeθ by the eigenvaluesΛ asθ =h θ ′(Λ ), and approximateh θ ′ with a orderK polynomial ofΛ , ash θ ′ ≈ P K i=0 θ ′ i Λ i . Then, U· diag(θ )· U T x ≈ U· K X i=0 θ ′ i Λ i ! · U T x = K X i=0 θ ′ i L i x (2.16) where we note that L i = UΛ U T · UΛ U T ... = UΛ i U T according to U T U = I. Equation 2.16 shows that the spectral convolution can be efficiently computed by neighbor propagation for K times. 19 Specifically, each layer of the GCN model [50] implements K =1 with an additional constraint of θ 0 = − 2θ 1 . If ignoring non-linear activation of the GCN model, then a K-layer GCN can implement Equation 2.16 of orderK. Each layer of the MixHop [6] model implementsK >1 and thus enables multi-hop propagation within. From the graph signal processing perspective, [84] analyzes that when K is large, the graph convolution acts as a low-pass filter on the graph signalx. In the extreme case when K → ∞, only the lowest frequency component is preseved as the output of convolution. Thus oversmoothing happens in a deep GCN. Adding residue connections [20] or modifying the graph adjacency matrix [84] can both affect the filtering response of graph convolution, and accordingly lead to different expressive power of the model. In our SHADOW-GNN, we theoretically analyze the effect of graph convolution on the subgraph of each target node, and prove that our SHADOW-GCN outperforms GCN in terms of expressive power. 2.2.2 Graph Isomorphism Two graphsG 1 andG 2 are isomorphic if there exists a permutationπ (·) on the node set such that V 1 =π (V 2 ) andE 1 =π (E 2 ). Equivalently, ifG 1 andG 2 are isomorphic, there exists a permutation matrixP such that the two adjacency matrices satisfyA 1 =PA 2 P T . Graph isomorphism test is a useful tool to characterize the GNN expressive power. When we recursively expand the multi-hop neighbors of a target v, we essentially build an L-level subtree structure rooted atv. If the two substrees rooted atu andv are isomorphic, then the two represen- tation vectors should be the same – this can be easily satisfied by many GNNs following Equation 2.1. On the other hand, if the subtrees rooted at u and v are non-isomorphic, then the two node representations should be different so that the we can use the representation vectors to distinguish the two nodes – many popular GNNs (e.g., [37, 50]) are not expressive enough to satisfy such a property. 20 While no polynomial-time algorithm is known to test isomorphism [33, 32, 11], there exists an efficient algorithm, 1-dimensional Weisfeiler-Lehman test (1-WL), to check isomorphism for a wide class of graphs. 1-WL is an iterative algorithm where in each iteration neighbor propagation is performed to update the “colors” (i.e., features) of all the graph nodes. The algorithm terminates when the colors of all nodes do not change any more, and two graphs are said to be isomorphic if they have the same set of colors in the final iteration. Different from GNN, the 1-WL neigh- bor propagation does not involve any learnable parameters. 1-WL fails on certains graphs. For example, it cannot distinguish twod-regular graphs 1 which are non-isomorphic. It has been shown that the expressivity of GNNs following Equation 2.1 (on the full graph) is upper-bound by 1-WL, and Graph Isomorphism Network (GIN) [88] is a GNN model achieving such an upper bound. In our SHADOW-GNN, we show that applying the existing layer operation of GIN on customized subgraphs around the target nodes, the expressive power of SHADOW-GIN exceeds that of GIN. 2.3 Efficient GNN Variants To overcome the challenge of high computation complexity, many works have studied simplified GNN models which effectively turn the feature aggregation step into preprocessing of the model. Take the Simplified Graph Convolution (SGC) [84] model as an example. The simplification of SGC is to remove the non-linear activation of each layer of GCN, and thus the layer operations become: 1 In ad-regular graph, all nodes have degreed. 21 X (L) = e AX (L− 1) W (L) = e A e AX (L− 2) W (L− 1) W (L) = e A L X (0) W ′ (2.17) where e A L denotes e A raised to theL-th power, andW ′ = L Q i=1 W (i) . The main benefit of SGC model is that the propagation of L-hop neighbors does not depend on the learnable weight parameters. Thus, during training, even when the weights are being updated each iteration, the term e A L X (0) stay fixed. Thus, we can compute e A L X (0) once as pre-processing and then run the training for many iterations until the weights converge. The training process does not involve e A at all, and the computation of SGC is similar to that of a normal MLP. Following the idea of SGC, subsequent works such as SIGN [30] have been proposed. Instead of just using the results of L-times propagation, [30] concatenates all e A ℓ X (0) for 0 ≤ ℓ ≤ L as the preprocessing outputs. Thus, the model preserves information from both nearby and far- away neighbors. While being efficient, such simplified models sacrifice expressive power since the neighbor propagation produces fixed results independent of trained weights. The development of efficient and expressive models is still highly desirable. Another line of research reduces the GNN multi-layer computation to variations of the label propagation algorithms [87, 86, 42]. For example, C&S [42] proposes a post-processing step after the normal GNN training so that the labels predicted by a trained GNN model are smoothed via label propagation and to improve accuracy. GraphHop [87] and GraphHop++ [86] formalize a constrained optimization problem jointly on the node features as well as on the labels, where the objective is to predict the (small fraction of) known labels accurately while maintaining smoothness of the predictions. The solutions to such optimization problems are obtained via an iterative process performing weighted propagation of known labels and psedu-labels. Such algorithms achieve low 22 computation complexity and can be efficiently executed on commercial hardware. They achieve good accuracy on graphs with high homophily. 2.4 GNNs Enhanced by Graph Analytics Traditional graph analytics rely heavily on heuristics and hand-tuned metrics, while the emerging GNNs leave the discovery of salient graph features to an automated training process. Despite their differences, classic graph analytics algorithms can be combined with GNNs to serve as useful inductive bias. For example, many GNN variants benefit from the Personalized PageRank (PPR) algorithm [10]. GDC [52] understands the neighbor propagation as a information diffusion process. It proposes to rewire the edges in the input graph to improve the diffusion quality. It computes PPR from all nodes and create a new graph where edges connect pairs of nodes with highest PPR scores. Then the new graph serves as the input to existing GNN models for their layer operations. APPNP [51] modifies the neighbor propagation process to simulate the iterative update procedure in the PPR algorithm – one layer is analogous to one iteration. PPRGo [16] improves the efficiency of APPNP. It directly computes the PPR algorithm until convergence. Then it propagates features of the neighbors with high PPR scores to the target node via direct shortcut connections. Inspired by the above works, our completed works also combines the graph analytics algo- rithms with the GNN model. GraphSAINT (Chapter 4) uses various random walk algorithms to construct training subgraphs. SHADOW-GNN (Chapter 5) defines the relevance of a neighbor to the target node by the PPR score and accordingly specifies the scope of the GNN model. Our works further improve the scalability of the above existing works. 23 2.5 Computing Platforms We consider the GNN execution on a single machine, which consists of a DRAM memory, a host CPU and / or an accelerator card attached via PCIe connections. The CPU has a number of pro- cessing cores which can execute different computation tasks in parallel (task-level parallelism). Each processing core can also execute vector instructions (instruction-level parallelism). A hier- archy of cache enables the CPU to quickly access frequently used data without going to the slow DRAM. The accelerator card (a GPU or an FPGA) has access to its own memory as well as to the DRAM (via PCIe), where the access to DRAM can be orders of magnitude slower than the access to the accelerator memory. We assume that the full graph fits in DRAM but may not fit in the CPU cache or the accelerator memory. If the full graph is too large, sampling can return small subgraphs which fit in the fast memory, thus addressing the performance bottleneck due to data communication. An FPGA accelerator contains various types of hardware resources, including Digital Signal Processing units (DSPs), on-chip Block RAMs (BRAMs), Configurable Logic Blocks (CLBs) and programmable interconnections. The DSPs can implement various hardware IP cores to efficiently perform arithmetic operations (e.g., floating point multiplication). Each BRAM block supports a random read and a random write per cycle. For GNNs where propagating neighbor features can incur random accesses, on-chip data layout needs to be customized to avoid extra BRAM latency. The programmable logic and interconnections enables the development of fine-grained, applica- tion specific processing pipelines. For example, we can instantiate separate hardware modules to execute AGGR(·) and UPDATE(·). The resources allocated to the two modules can be configured based on overall load-balance. 24 2.6 Accelerators for Deep Learning and Graph Processing Accelerators for graph analytics Various FPGA-based accelerators [109, 36, 34, 79, 68] have been proposed to train CNNs. The work in [109] proposes a modular design based on the types of layer operations, and improves performance via reconfiguration. The work in [34] proposes a scalable framework to training CNNs on multi-FPGA clusters. Its partitioning and mapping strategy ensures load-balance. The works in [36, 68] accelerate training by model compression. The reduced model size alleviates the burden on BRAM and thus improves resource utilization. Although many GNN model can be seen as an extension of CNNs to graphs, challenges to acceler- ate GNNs are significantly different. GNNs require both sparse and dense matrix operations. Apart from intensive computation, GNN accelerators need to address issues such as irregular memory access and load-balance. Accelerators for traditional graph analytics Many accelerator designs [23, 24, 112, 101, 48] have been proposed for traditional graph analytic problems, such as PageRank (PR), single source shortest path (SSSP) and breadth first search (BFS). The works in [23, 112] process input graphs in a two-phased gather-scatter manner, and utilize graph partitioning to improve access locality. The work in [24] extends the above memory optimization to multi-FPGA platforms, and the works in [101, 48] propose optimization specific to HBM and HMC to further boost FPGA performance. The memory optimizations proposed in the above works do not directly apply to GNN accelerators. First of all, traditional graph analytics often propagate scalars along the graph edges, while GNNs propagate long feature vectors. Thus, although in both cases memory accesses are irregular, the access patterns are very different. Secondly, traditional graph analytics often propagate information within the full graph, while GNNs propagate within minibatches. Thus, techniques such as graph partitioning may not be effective since minibatch size is much smaller than the full graph size. Accelerators for GNNs To accelerate GNN training, our works in [96, 98] propose paralleliza- tion techniques for the multi-core CPU platform. We partition features to increase cache-hit of each 25 core. We use feature partitioning to improve cache hit rate on each processing core, and propose dedicated data structures for parallel node sampling according to a probability distribution. For GPU execution, many works [50, 37, 44, 19] perform parallelization by directly utilizing the deep learning frameworks such as Tensorflow [5] and PyTorch [69]. Software frameworks / libraries for deep graph learning such as PyTorch Geometric (PyG) [29] and Deep Graph Library (DGL) [81, 110] have also been developed, which extend the native PyTorch with dedicated optimizations on GNN operations. PyG optimizes the scatter-gather operations on GPUs so that the sparse oper- ations of neighbor propagation can be accelerated by utilizing the massive number of GPU cores. DGL is capable of scheduling and load-balancing among multiple GPU devices, thus enabling distributed GNN computation on large graphs. While frameworks such as PyG and DGL are very powerful and general, they mainly focus on system-level optimizations. In this dissertation, we follow an algorithm-model-architecture co-optimization approach to achieve better performance and hardware utilization. 26 Chapter 3 Overview and Categorization of Proposed Works In Section 2.1, we categorize the GNN computation into node-centric, graph-centric and layerwise node-centric perspectives. Each of the above existing perspectives have their limitations in terms of scalability. Our completed works are based on a new subgraph-centric perspective: Given the full input graphG, we first construct subgraphs G s . The GNN does not seeG and only views eachG s as the new full graph. Since eachG s is of a small size, the GNN can practically operate following the graph-centric perspective on top of a subgraph. Both GraphSAINT and SHADOW-GNN follows the subgraph-centric computation. Yet their difference lies in how each subgraph is constructed. We further categorize the subgraph construc- tion process into graph-centric and node-centric. • Graph-centric subgraph construction: The objective is to construct eachG s such that it preserves the key characteristics ofG (e.g., degree distribution). Then learning onG can be approximated by learning onG s . The subgraph sampling ofGraphSAINT follows this way. • Node-centric subgraph construction: The objective is to construct eachG s such that it pre- serves the key characteristics of a specific node v (and thus we denote the subgraph asG [v] ). Then learning onG [v] can facilitate the generation of v’s representation vector. SHADOW- GNN follows this way to construct the GNN scope. From acceleration point of view, the exact perspective to construct subgraphs does not matter as long as the subgraphs are small enough to fit in the fast memory of the accelerators. Thus, 27 our parallelization and acceleration works, ParaGNN and GraphACT, can be applied to the general subgraph-centric GNN computation. Algorithmic complexity Hardware utilization GC NC SC LNC GC subgraph: GraphSAINT NC subgraph: SHADOW-GNN Subgraph-Centric Acceleration: ParaGNN &GraphACT Figure 3.1: Evolution of various computation perspectives on GNN and their tradeoffs between algorithmic complexity and hardware utilization (GC: Graph-Centric; NC: Node-Centric; LNC: Layerwise Node-Centric; GC subgraph: graph-centric subgraph construction; NC subgraph: node-centric subgraph construction) Figure 3.1 conceptually compares the evolution of various computation perspectives under the efficiency metrics discussed in Section 1.3. According to Section 2.1, graph-centric computation is converted to Subgraph-centric via subgraph sampling on the full graph. Node-centric computation is converted to layerwise node-centric via switching the order of computation from node-by-node to layer-by-layer. LNC is reduced to NC when the neighbors of theN ℓ v nodes rarely overlap, and is reduced to SC when the neighbors fully overlap (potentialy after sammpling). 28 Chapter 4 Efficient Algorithms: A Novel Training Paradigm via Subgraph Sampling In this chapter, we discuss two efficient algorithms based on the subgraph-centric computation developed by us. The first algorithm, GraphSAINT, proposes a minibatch training algorithm by sampling small subgraphs from the large input graph. While many sampling methods have been proposed in the literatue, they perform sampling on the neighborhood of each target node or in a layer-by-layer fashion. In contrast, GraphSAINT perform sampling at the subgraph level, which leads to significantly improved computation complexity and convergence quality. The second algorithm further reduces the redundancy in subgraph-level information propaga- tion among common neighbors. It uses a graph theoretic approach to transform the raw subgraph into an aggregation graph, where propagating within the aggregation graph requires less vector operations as well as less memory accesses. 4.1 Sampling Methods for GNNs To improve the speed of training GNNs on large graphs, various sampling methods have been pro- posed in the literature. The common goal for these methods is to reduce the neighborhood size so that the minibatch computation is less expensive. The key challenge is that the edges impose unstructured dependency among nodes. Thus, different from sampling on other types of Euclidean 29 data (e.g., images), the sampling for GNNs cannot be performed in an I.I.D. fashion. The sam- pling process can introduce bias and variance for the minibatch estimators, affecting the conver- gence quality of training. We theoretically address the bias and variance issues in ourGraphSAINT framework. We first give an overview on the two popular ways of GNN sampling in the literatue. 4.1.1 Node Sampling Works such as GraphSAGE [37], VRGCN [19] and PinSAGE [92] perform sampling on each of the ℓ-hop neighborhood (1 ≤ ℓ ≤ L) of the target node v. Sampling reduces the computation complexity by decreasing the size ofN ℓ v of each target node. Specifically, GraphSAGE [37] samples the neighbors of each ℓ-hop node uniformly at random according to a pre-defined sampling budget b (ℓ) for each layer ℓ. Its computation complexity is O Q L ℓ=1 b (ℓ) which can be controlled by b (ℓ) . VRGCN [19] further restricts neighborhood size by requiring only two support nodes in the previous layer. The idea is to use the stored histori- cal activations of previous training epochs to approximate the activations of the current iteration. Its computation complexity is further reduced toO 2 L at the cost ofO(|V|· fL) memory over- head. PinSAGE [37] applies the GraphSAGE algorithm to real-world web-scale recommendation systems at Pinterest, by a simpliar sampling mechanism. While node sampling improves the speed of minibatch training on large graphs, the computa- tion complexity is still exponential with respect to the GNN depth. Scalablable sampling methods are yet to be developed by taking an alternative view on the sampling process. 4.1.2 Layer Sampling Works such as FastGCN[18], LADIES [114] and AS-GCN [44] perform sampling in a layerwise fashion. The sampler considers all theV (ℓ) nodes of each layerℓ and sampleb (ℓ) nodes according to a calculated probability distribution. 30 Specifically, FastGCN [18] performs sampling on each set of V (ℓ) independently and the node sampling distribution for each layer is the same. The main drawback of such a sampler is the sparse connection among sampled nodes in adjacent layets. Due to the independent sampling, a layer-ℓ sampled nodes may not connect to any layer-(ℓ+1) sampled node, resulting in the layer-ℓ sam- ple to be useless. LADIES [114] improves FastGCN by addressing the sparse connection issue. LADIES performs “layer-dependent” sampling: the candidate nodes in layerℓ− 1 are defined as the neighbors of the layer-ℓ samples, instead ofV (ℓ− 1) . Thus, all layer-(ℓ− 1) samples are guar- anteed to be connected to the layer-ℓ samples. The sampling probability of LADIES encourages the minibatch to contain more edge connections given fixed node budgets for each layer, leading to reduced variance and improved convergence quality. ASGCN [44] improves FastGCN by an additional sampling neural network to enable the sampling probability to evolve with the training process. Thus, the sampler can potentially construct higher-quality minibatches to further improve the learning quality. Layer sampling algorithms can potentially address “neighborhood explosion”, e.g., by setting b (ℓ) as a constant for all ℓ. However, FastGCN may suffer accuracy loss and the sampling cost of LADIES and ASGCN is significant. There is still room to further improve the scalability and accuracy of minibatch training. 4.2 GraphSAINT: Graph Sampling Based Inductive Learning Method Subgraph sampling based minibatch training is motivated by the challenges in scalability (in terms of model depth and graph size). We analyze the bias (Section 4.2.2) and variance (Section 4.2.3) introduced by subgraph sampling, and thus, propose feasible sampling algorithms (Section 4.2.4). We show the applicability ofGraphSAINT to other architectures, both conceptually (Section 4.2.5) and experimentally (Section 4.4.5). 31 Recall the notations defined in Table 2.1. A is the adjacency matrix of an unweighted, undi- rected graph and b A is the random-walk normalized from A (i.e., b A = D − 1 A, and D is the diagonal degree matrix). For brievity, our analysis only considersA under the random-walk nor- malization (which is used in the GraphSAGE [37] and PinSAGE [92]). Our techniques can be easily adapted to other kinds of normalization. f (ℓ) is the dimension of the layer-ℓ node vector x (ℓ) v . We consider the following defintion of layer operations: x (ℓ+1) v =σ X u∈V b A v,u W (ℓ) T x (ℓ) u ! (4.1) where b A v,u is an element of b A, andσ (·) is the activation function (e.g.,ReLU). We use subscript “s” to denote parameterd of the sampled graph (e.g.,G s ,V s ,E s ). For the learning settings, note that GNNs can be applied under both the inductive and transduc- tive settings. WhileGraphSAINT is applicable to both, in the following analysis, we only focus on inductive learning. It has been shown that inductive learning is especially challenging [37] – during training, neither attributes nor connections of the test nodes are present. Thus, an inductive model has to generalize to completely unseen graphs. An example ofGraphSAINT under the transductive setting is included as a standard baseline in Open Graph Benchmark [41]. 4.2.1 Training via Subgraph Sampling 0 1 2 3 5 7 4 6 8 9 0 1 2 3 5 7 0 1 2 3 5 7 0 1 2 3 5 7 G s = SAMPLE(G) Full GNN onG s Figure 4.1: Overview of the GraphSAINT training algorithm. In each iteration, we sample a new subgraphG s ofG and then contruct a full GNN onG s without considering nodes inV−V s . 32 Algorithm 1GraphSAINT training algorithm Input: Training graphG(V,E,X); LabelsY ; SamplerSAMPLE; Output: GNN model with trained weights 1: Pre-processing: SetSAMPLE parameters; Compute normalization coefficients α ,λ . 2: for each minibatch do 3: G s (V s ,E s )← Sampled sub-graph ofG according toSAMPLE 4: GNN construction onG s . 5: {y v |v∈V s }← Forward propagation of{x v |v∈V s }, normalized byα 6: Backward propagation fromλ -normalized lossL(y v ,y v ). Update weights. 7: end for GraphSAINT follows the design philosophy to directly sample the training graphG, rather than from the nodes of each GNN layerV (ℓ) or from the ℓ-hop neighborhoodN ℓ v . Our goals are to 1. extract appropriately connected subgraphs so that little information is lost when propagating features within each subgraph, and 2. combine information of many subgraphs together so that the overall training process learns good representation corresponding to the full graph neighborhood. As discussed in Chapter 3, we categorize the subgraph sampling of GraphSAINT as following the “grpah-centric” perspective, meaning that all the subgraphs sampled during training should jointly reveal the characteristics of the full graph. This shows another way to understand the goal 2 above (to combine information across subgraphs). Figure 4.1 and Algorithm 1 illustrate the training algorithm. Before the training starts, we per- form light-weight pre-processing onG with a given samplerSAMPLE. The pre-processing estimates the probability of a nodev∈V and an edgee∈E being sampled by SAMPLE. Such probability is later used to normalize the subgraph neighbor aggregation and the minibatch loss (Section 4.2.2). Afterwards, training proceeds by iterative weight updates via SGD. Each iteration starts with an independently sampledG s (where|V s |≪|V| ). We then build a full GNN onG s to generate node representations and calculate loss for every v ∈ V s . In Algorithm 1, we show the task of node classification under the supervised setting, where each training node v comes with a ground truth labely v . 33 Intuitively, there are two requirements for SAMPLE: 1. Nodes having high influence on each other should be sampled in the same subgraph so that the intra-subgraph neighbor propagation gives reasonable results. 2. Each edge inE should have non-negligible probability to be sampled so that its information can be picked up by some weight update iterations in training. For requirement (1), an ideal SAMPLE would consider the joint information from node connections as well as raw features. However, the resulting algorithm may have high complexity as it would need to infer the relationships between features. For simplicity, we define “influence” from the graph connectivity perspective and design topology based samplers. Requirement (2) leads to better generalization since it enables the neural net to explore the full feature and label space. 4.2.2 Correcting the Sampling Bias A sampler that preserves connectivity characteristic ofG will almost inevitably introduce bias into minibatch estimation. In the following, we present normalization techniques to eliminate biases. Analysis of the complete multi-layer GNN is difficult due to non-linear activations. Thus, we analyze the embedding of each layer independently. This is similar to the treatment of layers independently by prior work [18, 44]. Consider a layer-(ℓ+1) nodev and a layer-ℓ nodeu. Ifv is sampled (i.e.,v∈V s ), we can compute the aggregated feature ofv as: ζ (ℓ+1) v = X u∈V b A v,u α u,v W (ℓ) T x (ℓ) u 1 u|v = X u∈V b A v,u α u,v ˜ x (ℓ) u 1 u|v , (4.2) where ˜ x (ℓ) u = W (ℓ) T x (ℓ) u , and 1 u|v ∈{0,1} is the indicator function givenv is in the subgraph (i.e., 1 u|v =0 ifv∈V s ∧(u,v)̸∈E s ; 1 u|v =1 if(u,v)∈E s ; 1 u|v not defined if v̸∈V s ). We refer to the constantα u,v as aggregator normalization. Define p u,v = p v,u as the probability of an edge (u,v)∈E being sampled in a subgraph, andp v as the probability of a nodev∈V being sampled. Proposition 1. ζ (ℓ+1) v is an unbiased estimator of the aggregation of v in the full (ℓ+1) th GNN layer, ifα u,v = pu,v pv . i.e.,E ζ (ℓ+1) v = P u∈V b A v,u ˜ x (ℓ) u . 34 Proof. Under the condition thatv is sampled in a subgraph: E ζ (ℓ+1) v =E X u∈V b A v,u α u,v ˜ x (ℓ) u 1 u|v ! = X u∈V b A v,u α u,v ˜ x (ℓ) u E 1 u|v = X u∈V b A v,u α u,v ˜ x (ℓ) u P ((u,v) sampled|v sampled) = X u∈V b A v,u α u,v ˜ x (ℓ) u P ((u,v) sampled) P (v sampled) = X u∈V b A v,u α u,v ˜ x (ℓ) u p u,v p v (4.3) where the second equality is due to linearity of expectation, and the third equality (conditional edge probability) is due to the initial condition thatv is sampled in a subgraph. It directly follows that, whenα u,v = pu,v pv , E ζ (ℓ+1) v = X u∈V b A v,u ˜ x (ℓ) u Assuming that each layer independently learns an embedding, we use Proposition 1 to normal- ize feature propagation of each layer of the GNN built by GraphSAINT. Further, letL v be the loss on v in the output layer. The minibatch loss is calculated as L batch = P v∈Gs Lv λ v , where λ v is a constant that we term loss normalization. We setλ v =|V|· p v so that: E(L batch )= 1 |G| X Gs∈G X v∈Vs L v λ v = 1 |V| X v∈V L v . (4.4) 35 Feature propagation within subgraphs thus requires normalization factors α and λ , which are computed based on the edge and node probabilityp u,v ,p v . In the case of random node or random edge samplers,p u,v andp v can be derived analytically. For other samplers in general, closed form expression is hard to obtain. Thus, we perform pre-processing for estimation. Before training starts, we run the sampler repeatedly to obtain a set ofN subgraphsG. We setup a counterC v and C u,v for eachv∈V and(u,v)∈E, to count the number of times the node or edge appears in the subgraphs ofG. Then we setα u,v = Cu,v Cv = Cv,u Cv andλ v = Cv N . The subgraphsG s ∈G can all be reused as minibatches during training. Thus, the overhead of pre-processing is small. 4.2.3 Reducing the Sampling Variance We derive samplers for variance reduction. Let e be the edge connecting u, v, and b (ℓ) e = b A v,u ˜ x (ℓ− 1) u + b A u,v ˜ x (ℓ− 1) v . It is desirable that variance of all estimators ζ (ℓ) v is small. With this objective, we define: ζ = X ℓ X v∈Gs ζ (ℓ) v p v = X ℓ X v,u b A v,u p v α u,v ˜ x (ℓ) u 1 v 1 u|v = X ℓ X e b (ℓ) e p e 1 (ℓ) e . (4.5) where 1 e = 1 ife∈E s ; 1 e = 0 ife̸∈E s . And 1 v = 1 ifv∈V s ; 1 v = 0 ifv̸∈V s . The factorp u in the first equality is present so that ζ is an unbiased estimator of the sum of all node aggregations at all layers:E(ζ ) = P ℓ P v∈V E ζ (ℓ) v . Note that 1 (ℓ) e = 1 e ,∀ℓ, since once an edge is present in the sampled graph, it is present in all layers of our GCN. We define the optimal edge sampler to minimize variance for every dimension ofζ . We restrict ourselves to independent edge sampling. For each e ∈ E, we make independent decision on whether it should be in G s or not. The probability of including e is p e . We further constrain P p e = m, so that the expected number of sampled edges equals to m. The budget m is a given sampling parameter. 36 Theorem 2. Under independent edge sampling with budget m, the optimal edge probabilities to minimize the sum of variance of eachζ ’s dimension is given by: p e = m P e ′ P ℓ b (ℓ) e ′ P ℓ b (ℓ) e . Proof. In the following, we use Cov(·) to denote covariance and Var(·) to denote variance. For independent edge sampling, Cov 1 (ℓ 1 ) e 1 , 1 (ℓ 2 ) e 2 =0,∀e 1 ̸=e 2 . And for a full GNN on the subgraph, Cov 1 (ℓ 1 ) e , 1 (ℓ 2 ) e = p e − p 2 e . To start the proof, we first assume that the b (ℓ) e is one dimensional (i.e., a scalar) and denote it byb (ℓ) e . Now, Var(ζ )= X e,ℓ b (ℓ) e p e ! 2 Var 1 (ℓ) e +2 X e,ℓ 1 <ℓ 2 b (ℓ 1 ) e b (ℓ 2 ) e p 2 e Cov 1 (ℓ 1 ) e , 1 (ℓ 2 ) e = X e,ℓ b (ℓ) e 2 p e − X e,ℓ b (ℓ) e 2 +2 X e,ℓ 1 <ℓ 2 b (ℓ 1 ) e b (ℓ 2 ) e p 2 e p e − p 2 e = X e P ℓ b (ℓ) e 2 p e − X e X ℓ b (ℓ) e ! 2 (4.6) Let a given constantm = P e p e be the expected number of sampled edges. By Cauchy-Schwarz inequality: P e P ℓ b (ℓ) e 2 pe m = P e P ℓ b (ℓ) e √ pe 2 P e √ p e 2 ≥ P e,ℓ b (ℓ) e 2 . The equality is achieved when P ℓ b (ℓ) e √ pe ∝ √ p e . i.e., variance is minimized whenp e ∝ P ℓ b (ℓ) e . It directly follows that: p e = m P e ′ P ℓ b (ℓ) e ′ X ℓ b (ℓ) e For the multi-dimensional case ofb (ℓ) e , following similar steps as above, it is easy to show that the optimal edge probability to minimize P i Var(ζ i ) (wherei is the index forζ ’s dimensions) is: p e = m P e ′ P ℓ b (ℓ) e ′ X ℓ b (ℓ) e 37 Note that calculatingb (ℓ) e requires computing ˜ x (ℓ− 1) v , which increases the complexity of sam- pling. As a reasonable simplification, we ignore ˜ x (ℓ) v to make the edge probability dependent on the graph topology only. Therefore, we choosep e ∝ b A v,u + b A u,v = 1 deg(u) + 1 deg(v) . The derived optimal edge sampler agrees with the intuition in Section 4.2.1. If two nodesu,v are connected and they have few neighbors, thenu andv are likely to be influential to each other. In this case, the edge probabilityp u,v =p v,u should be high. The above analysis on edge samplers also inspires us to design other samplers, which are presented in Section 4.2.4. Remark We can also apply the above edge sampler to perform layer sampling (see Section 4.1.2). Under the independent layer sampling assumption of [18], one would sample a connec- tion u (ℓ) ,v (ℓ+1) with probabilityp (ℓ) u,v ∝ 1 deg(u) + 1 deg(v) . For simplicity, assume a uniform degree graph (of degree d). Then p (ℓ) e = p. For an already sampled u (ℓ) to connect to layer ℓ + 1, at least one of its edges has to be selected by the layer ℓ+1 sampler. Clearly, the probability of an input layer node to “survive” theL number of independent sampling process is 1− (1− p) d L− 1 . Such layer sampler potentially returns an overly sparse minibatch for L > 1. On the other hand, connectivity within a minibatch ofGraphSAINT never drops with GNN depth. If an edge is present in layerℓ, it is present in all layers. 4.2.4 General Framework Based on the above variance analysis, we present several light-weight and efficient samplers that GraphSAINT has integrated. Algorithm 2 shows all the sampling algorithms. Random node sampler We sample|V s | nodes fromV randomly, according to a node probability distributionP (u)∝ b A :,u 2 . The sampling distribution is inspired by that of the layer sampler of [18]. Random edge sampler We perform edge sampling as described in Section 4.2.3. 38 Algorithm 2 Graph sampling algorithms byGraphSAINT Input: Training graphG(V,E); Sampling parameters: node budgetn; edge budgetm; number of rootsr; random walk lengthh Output: Sampled graphG s (V s ,E s ) 1: function NODE(G,n) ▷ Node sampler 2: P (v) := e A :,v 2 / P v ′ ∈V e A :,v ′ 2 3: V s ← n nodes randomly sampled (with replacement) fromV according toP 4: G s ← Node induced subgraph ofG fromV s 5: end function 6: function EDGE(G,m) ▷ Edge sampler (approximate version) 7: P ((u,v)) := 1 deg(u) + 1 deg(v) / P (u ′ ,v ′ )∈E 1 deg(u ′ ) + 1 deg(v ′ ) 8: E s ← m edges randomly sampled (with replacement) fromE according toP 9: V s ← Set of nodes that are end-points of edges inE s 10: G s ← Node induced subgraph ofG fromV s 11: end function 12: function RW(G,r,h) ▷ Random walk sampler 13: V root ← r root nodes sampled uniformly at random (with replacement) fromV 14: V s ←V root 15: forv∈V root do 16: u← v 17: ford=1 toh do 18: u← Node sampled uniformly at random fromu’s neighbor 19: V s ←V s ∪{u} 20: end for 21: end for 22: G s ← Node induced subgraph ofG fromV s 23: end function 24: function MRW(G,n,r) ▷ Multi-dimensional random walk sampler 25: V FS ← r root nodes sampled uniformly at random (with replacement) fromV 26: V s ←V FS 27: fori=r+1 ton do 28: Selectu∈V FS with probability deg(u)/ P v∈V FS deg(v) 29: u ′ ← Node randomly sampled fromu’s neighbor 30: V FS ← (V FS \{u})∪{u ′ } 31: V s ←V s ∪{u} 32: end for 33: G s ← Node induced subgraph ofG fromV s 34: end function 39 Random walk based samplers Another way to analyze graph sampling based multi-layer GNN is to ignore activations. In such a case, theL layers can be represented as a single layer with edge weights given byB = b A L . Following a similar approach as Section 4.2.3, if it were possible to pick pairs of nodes (whether or not they are directly connected in the original b A) independently, then we would setp u,v ∝B u,v +B v,u , whereB u,v can be interpreted as the probability of a random walk to start at u and end at v in L hops (andB v,u vice-versa). Even though it is not possible to sample a subgraph where such pairs of nodes are independently selected, we still consider a random walk sampler with walk length L as a good candidate for L-layer GNNs. There are numerous random walk based samplers proposed in the literature [72, 56, 40, 61]. In the experiments, we implement a regular random walk sampler (with r root nodes selected uniformly at random and each walker goesh hops), and also a multi-dimensional random walk sampler defined in [72]. For all the above samplers, we return the subgraph induced from the sampled nodes. The induc- tion step adds more connections into the subgraph, and empirically helps improve convergence. 4.2.5 Discussions and Extensions Extensions GraphSAINT admits two orthogonal extensions. First, GraphSAINT can seamlessly integrate other graph samplers. Second, the idea of training by graph sampling is applicable to many GNN architecture variants: 1. Jumping knowledge [89]: since our GCNs constructed dur- ing training are complete, applying skip connections to GraphSAINT is straightforward. On the other hand, for some layer sampling methods [18, 44], extra modification to their samplers is required, since the jumping knowledge architecture requires layer-ℓ samples to be a subset of layer-(ℓ− 1) samples 1 . 2. Attention [78, 28, 102]: while explicit variance reduction is hard due to the dynamically updated attention values, it is reasonable to apply attention within the subgraphs which are considered as representatives of the full graph. Our loss and aggregator normalizations 1 The skip-connection design proposed by [44] does not have such “subset” requirement, and thus is compatible with both graph sampling and layer sampling based methods. 40 are also applicable 2 3. Others: To support high order layers [113, 54, 6] or even more complicated networks for the task of graph classification [93], we replace the full adjacency matrix A with the (normalized) one for the subgraphA s to perform layer propagation. Comparison GraphSAINT enjoys: 1. high scalability and efficiency, 2. high accuracy, and 3. low training complexity. Point (1) is due to the significantly reduced neighborhood size compared with [37, 92, 19]. Point (2) is due to the better inter-layer connectivity compared with [18], and unbiased minibatch estimator compared with [21]. Point (3) is due to the simple and trivially parallelizable pre-processing compared with the sampling of [44] and clustering of [21]. 2 When applyingGraphSAINT to GAT [78], we remove the softmax step which normalizes attention values within the same neighborhood, as suggested by [44]. 41 4.3 Subgraph Transformation for Computation Redundancy Reduction 4.3.1 Redundancy in Neighbor Aggregation 0 1 2 3 (a)G s 0 1 2 3 (b)G a andM a u 1 0 3 2 v (c)G ′ s Figure 4.2: Example on reducing redundancy in feature propagation, using a graph theoretical approach. (a): the original subgraph. (b): the (weighted) aggregation graph and the maximum weight matching. (c): transformed subgraph with virtual nodes representing aggregated common neighbors. While the subgraph based minibatch training has already resolved “neighborhood explosion”, redundancy can be further reduced for subgraph-level neighbor propagation. First, to see the exis- tence of such redundancy, take Figure 4.2a as an example. Suppose the graph sampler returns a subgraphG s as the minibatch. In each GNN layer, nodes 1, 2, 3 and 4 aggregate neighbor fea- tures, with node0 calculating 1 3 (x 1 +x 2 +x 3 ), node1 calculating 1 2 (x 0 +x 2 ), node2 calculat- ing 1 3 (x 0 +x 1 +x 3 ) and node3 calculating 1 2 (x 0 +x 2 ) (omitting superscript(ℓ) for simplicity). This corresponds to 6 vector additions and 10 vector reads in total. Observe that the vector pair (x 0 ,x 2 ) appears in the aggregation of both nodes1 and3. Similarly, the pair(x 1 ,x 3 ) is aggregated by both nodes0 and2. Thus, we can perform pre-processing to compute the partial sumx 0 +x 2 andx 1 +x 3 . The aggregation of the four nodes after pre-processing requires only 2 additions and 4 reads. 42 4.3.2 Reconstructing Subgraph from Aggregation Graph In general, even considering the pre-processing cost, we can still significantly reduce both compu- tation and communication of feature aggregation. The key to reduce redundancy is to identify sets of nodes appearing frequently in the neighbor lists of v ∈ V s . To simplify the problem, we aim at finding such sets of size 2 – we identify common pairs of neighbors. We first construct an un-directed aggregation graphG a fromG s by Algorithm 3. Each edge (u,v) inG a represents potential vector sum operations betweenu andv. The attribute of(u,w) consists of a set of nodes{v 1 ,...,v n }, meaning that the neighbor list ofv i (1≤ i≤ n) contains bothu andw. The weight of an edge is simply the size of its attribute set (we remove edges of weight 1). In Algorithm 3,W a andD a can be implemented as a hash table. And W a [(u,w)] (orD a [(u,w)]) returns the value corresponding to the key (u,w). Figure 4.2b shows the aggregation graphG a for our above example, where edges have weight 2. Intuitively, the next step upon gettingG a is to extract the edges with largest weights, so that pre-computation of the corresponding vector sums reduces the most redundancy. However, there is one subtlety to notice. Suppose two edges of G a , (u i ,u j ) and (u j ,u k ), have large weights, and{v 1 ,...,v m }, the intersection of their attributes, is non-empty. Consider the aggregation of nodes{v 1 ,...,v m } after pre-computingx ′ = x i +x j andx ′′ = x j +x k . By replacing the pair (x i ,x j ) with x ′ , the other pair (x j ,x k ) disappears in aggregation. So x ′′ does not help reduce redundancy. To avoid such useless pre-computation onx ′′ , a good solution is to find a maximum weight matching ofG a , so that the selected pairs imply high redundancy, and share no common nodes. Theorem 3. For feature aggregation of each layer, number of vector reads and additions can decrease by at least P e∈M ∗ a (W a [e]− 2) and P e∈M ∗ a (W a [e]− 1), where M ∗ a is the maximum weight matching ofG a . Proof. We first consider the reduction in number of reads. Since edges in a matching are disjoint, for each(u,v)∈M ∗ a , accessingx u +x v instead ofx u andx v saves(2− 1)·W a [(u,v)] number 43 Algorithm 3 Construction of aggregation graphG a Input: SubgraphG s (V s ,E s ) Output: Aggregator graphG a 1: V a ←V s 2: W a ←∅ ▷ Key-value pairs mapping edges to edge weights 3: D a ←∅ ▷ Key-value pairs mapping edges to edge attributes 4: forv∈V a do 5: foru∈ neigh(v) do 6: forw∈ neigh(v)\u do 7: if Key(u,w) not inW a then 8: Add key-value pair((u,w),1) toW a 9: else 10: W a [(u,w)]←W a [(u,w)]+1 11: end if 12: if Key(u,w) not inD a then 13: Add key-value pair((u,w),{v}) toD a 14: else 15: D a [(u,w)]←D a [(u,w)]∪{v} 16: end if 17: end for 18: end for 19: end for 20: Remove key-value pairs of weight-1 edges inW a andD a 21: G a ← Un-directed graph based onW a andD a of reads. The pre-computation of x u +x v consumes 2 reads. In sum, the total reduction is P (W a [e]− 2). The proof for reduction in number of additions can be similarly derived. Note that Theorem 3 considers the total cost including any pre-computation cost. Also, although polynomial complexity algorithm [66] exists for computing the maximum weight match- ing, it is still too expensive in practice. Thus, we propose a greedy approach (Algorithm 4) to quickly compute a good matchingM a . The next step after obtainingM a is to update the originalG s toG ′ s (following Algorithm 5), so that the pre-computed sums can propagate inG ′ s . The idea is to merge each pair of nodes in M a into a new node, whose feature vector is returned by pre-computation. Compared withG s , the 44 Algorithm 4 Greedy approach to find a good matching M a Input: Aggregation graphG a ; thresholdθ Output: MatchingM a 1: M a ←∅ 2: H← Max-heap of edgesE a , ordered by edge weights 3: S← Length-|V a | vector of valuesTrue 4: whilemaxWeight(H)>θ do 5: (u,v)← extractMaxEdge(H) 6: if¬(S u ∧S v ) then ▷(u,v) violates edge disjointness 7: continue 8: end if 9: S u ← False; S v ← False 10: M a ←M a ∪{(u,v)} 11: end while updatedG ′ s has more nodes (|V ′ s | = |V s |+|M a |), but less edges. Note thatG ′ s is directed, even whenG s is un-directed. The exampleG ′ s is shown in Figure 4.2c. Algorithm 5 Construction of the update subgraphG ′ s Input: Original subgraphG s ; MatchingM a ; Edge attributeD a Output: Updated (directed) subgraphG ′ s (V ′ s ,E ′ s ) 1: V ′ s ←V s ; E ′ s ←E s 2: for(u,v)∈M a do 3: Assign a new nodev ′ corresponding to the edge(u,v) 4: V ′ s ←V ′ s ∪{v ′ } 5: forw∈D a [(u,v)] do 6: Remove(u,w) fromE ′ s 7: Replace(v,w) with(v ′ ,w) inE ′ s 8: end for 9: end for 4.3.3 Iterative Redundancy Reduction After obtainingG ′ s , redundancy still exists when aggregating features ofG ′ s . ViewingG ′ s as the new G s , we can again apply Algorithms 3, 4 and 5 to obtain aG ′′ s . This process can continue until few edges can be reduced (i.e., the matching is small). Note that although the training graph (and thus 45 G s ) is often un-directed, Algorithms 3, 4 and 5 still apply whenG s is directed. Therefore, iterative redundancy reduction is feasible. For sake of notation, define one round of redundancy reduction as one invocation of Algo- rithms 3, 4 and 5. Define the subgraph output by the last round as G # s , and its adjacency matrix asA # s . Define the set of matchings in all rounds as M a = {M a ,M ′ a ,M ′′ a ,...}. Define γ add := numAdd(G # s )+ P M∈Ma |M| numAdd(Gs) as the redundancy reduction rate for additions, wherenumAdd(G)= P v∈V;deg(v)≥ 1 (deg(v)− 1) denotes the number of additions to aggregate by traversingG’s neigh- bor lists. Define γ read := numRead(G # s )+ P M∈Ma 2|M| numRead(Gs) as the redundancy reduction rate for memory reads, where numRead(G) = P v∈V deg(v) = d·|V| denotes number of reads to aggregate fea- tures. 4.3.4 Complexity Analysis Complexity of Algorithms 3, 4 and 5 are low compared with the feature aggregation. Complex- ity of Algorithm 3 isO(|E a |) = O P v∈Vs deg(v) 2 . Although in the worst case, O(|E a |) = O(|V s |· d 2 max ), for the graphs in practice, we may assumeO(|E a |)=O |V s |· d 2 =O |E s |· d , where d max and d are the max and average degree of G s . Complexity of Algorithm 4 is O(|E a |+Nlog|E a |), where the first term counts for line 2, and the second term is for the loop from line 4 to 9. Number of times max-edge is extracted from heap,N, depends on the threshold θ . Typically,N ≪|E a |,|V s | < 5000,d < 20. Overall complexity of Algorithms 3 and 4 is much less than the complexity of single layer feature aggregation (i.e., O(|E s |· f), where the feature lengthf is to the order of10 2 to10 3 ). For Algorithm 5, lines 6 and 7 each takes at most deg(w) operations if we simply scan the neighbor list of w. In return, we save one vector sum operation for u and v. Considering d ≪ f, overhead of Algorithm 5 is thus negligible compared with the benefit in redundancy reduction. 46 It is worth noticing that 1. one time transformation fromG s toG ′ s benefits 3L number of feature aggregation in an L-layer GNN, and 2. such transformation, which only involves integer opera- tions, reduces floating point arithmetics during feature aggregation. These two observations further justify the cost of Algorithms 3, 4 and 5. 47 4.4 Experiments 4.4.1 Datasets and Learning Tasks Experiments are under the inductive and supervised learning setting. We evaluate GraphSAINT on the following tasks: 1. classifying protein functions based on the interactions of human tissue proteins (PPI), 2. categorizing types of images based on the descriptions and common properties of online images (Flickr), 3. predicting communities of online posts based on user comments (Red- dit), 4. categorizing types of businesses based on customer reviewers and friendship (Yelp), and 5. classifying product categories based on buyer reviewers and interactions (Amazon). For PPI, we use the small version for the two layer convergence comparison (Table 4.2 and Figure 4.4), since GraphSAGE [37] and VRGCN [19] report accuracy for this version in their original papers. We use the large version for additional comparison with [21] to be consistent with its reported accu- racy. All datasets follow “fixed-partition” splits. Figure 4.3 shows the degree distribution of the five graphs. A point (k,p) in the plot means the probability of a node having degree at leastk isp. Table 4.1: Specification of the graph datasets (“m” stands for multi-class classification, and “s” for single-class.) Dataset Nodes Edges Degree Feature Classes Train / Val / Test PPI 14,755 225,270 15 50 121 (m) 0.66 / 0.12 / 0.22 Flickr 89,250 899,756 10 500 7 (s) 0.50 / 0.25 / 0.25 Reddit 232,965 11,606,919 50 602 41 (s) 0.66 / 0.10 / 0.24 Yelp 716,847 6,977,410 10 300 100 (m) 0.75 / 0.10 / 0.15 Amazon 1,598,960 132,169,734 83 200 107 (m) 0.85 / 0.05 / 0.10 PPI (large version) 56,944 818,716 14 50 121 (m) 0.79 / 0.11 / 0.10 OGB-product 2,449,029 61,859,140 25 100 47 (s) 0.10 / 0.02 / 0.88 We open sourceGraphSAINT 3 . We compare with six baselines: 1. vanilla GCN [50], 2. Graph- SAGE [37], 3. FastGCN [18], 4. VRGCN [19], 5. AS-GCN [44], and 6. ClusterGCN [21]. All 3 Open sourced code: https://github.com/GraphSAINT/GraphSAINT 48 Figure 4.3: Degree distribution of the evaluated graphs baselines are executed with their officially released code. Baselines and GraphSAINT are all imple- mented in Tensorflow with Python3. We run our experiments on a single machine with Dual Intel Xeon CPUs (E5-2698 v4 @ 2.2Ghz), one NVIDIA Tesla P100 GPU (16GB of HBM2 memory) and 512GB DDR4 memory. The code is written in Python 3.6.8 (where the sampling part is writ- ten with Cython 0.29.2). We use Tensorflow 1.12.0 on CUDA 9.2 with CUDNN 7.2.1 to train the model on GPU. Since the subgraphs are sampled independently, we run the sampler in parallel on 40 CPU cores. 4.4.2 Accuracy and Convergence Speed Table 4.2 and Figure 4.4 show the accuracy and convergence comparison of various methods. All results correspond to two-layer GNN models (for GraphSAGE, we use its mean aggregator). For a given dataset, we keep hidden dimension the same across all methods. The mean and confidence interval of the accuracy values in Table 4.2 are measured by three runs under the same hyper- parameters. The training time of Figure 4.4 excludes the time for data loading, pre-processing, 49 validation set evaluation and model saving. Our pre-processing incurs little overhead in train- ing time. See Section 4.4.3 for the cost of graph sampling. For GraphSAINT, we implement the graph samplers described in Section 4.2.4. In Table 4.2, “Node” stands for random node sampler; “Edge” stands for random edge sampler; “RW” stands for random walk sampler; “MRW” stands for multi-dimensional random walk sampler. Table 4.2: Comparison on test set F1-micro score between variants of GraphSAINT (4 sampling algorithms) and the state-of-the-art methods Method PPI Flickr Reddit Yelp Amazon GCN 0.515± 0.006 0.492± 0.003 0.933± 0.000 0.378± 0.001 0.281± 0.005 GraphSAGE 0.637± 0.006 0.501± 0.013 0.953± 0.001 0.634± 0.006 0.758± 0.002 FastGCN 0.513± 0.032 0.504± 0.001 0.924± 0.001 0.265± 0.053 0.174± 0.021 VRGCN 0.963± 0.010 0.482± 0.003 0.964± 0.001 0.640± 0.002 — 3 AS-GCN 0.687± 0.012 0.504± 0.002 0.958± 0.001 — 3 — 3 ClusterGCN 0.875± 0.004 0.481± 0.005 0.954± 0.001 0.609± 0.005 0.759± 0.008 GraphSAINT-Node 0.960± 0.001 0.507± 0.001 0.962± 0.001 0.641± 0.000 0.782± 0.004 GraphSAINT-Edge 0.981± 0.007 0.510± 0.002 0.966± 0.001 0.653± 0.003 0.807± 0.001 GraphSAINT-RW 0.981± 0.004 0.511± 0.001 0.966± 0.001 0.653± 0.003 0.815± 0.001 GraphSAINT-MRW 0.980± 0.006 0.510± 0.001 0.964± 0.000 0.652± 0.001 0.809± 0.001 Table 4.3: Comparison with ClusterGCN on test set F1-micro score (the depth and hidden dimen- sions follow the configurations of the original Cluster-GCN paper) PPI (large version) Reddit 2× 512 5× 2048 2× 128 4× 128 ClusterGCN 0.903± 0.002 0.994± 0.000 0.954± 0.001 0.966± 0.001 GraphSAINT 0.941± 0.003 0.995± 0.000 0.966± 0.001 0.970± 0.001 Clearly, with appropriate graph samplers, GraphSAINT achieves significantly higher accuracy on all datasets. For GraphSAINT-Node, we use the same node probability as FastGCN. Thus, the accuracy improvement is mainly due to the switching from layer sampling to graph sampling (see “Remark” in Section 4.2.3). Compared with AS-GCN, GraphSAINT is significantly faster. The 3 The codes throw runtime error on the large datasets (Yelp or Amazon). 50 0 20 40 60 0.4 0.6 0.8 1 Validation F1-micro PPI 0 10 20 30 40 0.44 0.46 0.48 0.5 0.52 Flickr 0 50 100 150 0.9 0.92 0.94 0.96 0.98 Reddit 0 200 400 600 800 0.25 0.45 0.65 Yelp 0 200 400 0.2 0.4 0.6 0.8 Training time (second) Amazon GCN GraphSAGE FastGCN* VRGCN AS-GCN ClusterGCN GraphSAINT *: CPU execution time -RW Figure 4.4: Convergence curves of 2-layer models onGraphSAINT and baselines (averaged over 3 runs) sampler of AS-GCN is expensive to execute, making its overall training time even longer than vanilla GCN. For VRGCN on Reddit, it achieves similar accuracy as GraphSAINT, at the cost of over9× longer training time. The released code of FastGCN only supports CPU execution, so its convergence curve is dashed. Table 4.3 presents additional comparison with ClusterGCN. We useL× f to specify the archi- tecture, whereL andf denote GCN depth and hidden dimension, respectively. The four architec- tures are the ones used in the original paper [21]. Again,GraphSAINT achieves significant accuracy improvement. To train models with L > 2 often requires additional architectural tweaks. Clus- terGCN uses its diagonal enhancement technique for the 5-layer PPI and 4-layer Reddit models. GraphSAINT uses jumping knowledge connection [89] for 4-layer Reddit. 4.4.3 Cost of Sampling and Pre-Processing Cost of graph samplers ofGraphSAINT Graph sampling introduces little training overhead. Let t s be the average time to sample one subgraph on a multi-core machine. Lett t be the average time to perform the forward and backward propagation on one minibatch on GPU. Figure 4.6 shows 51 the ratiot s /t t for various datasets. The parameters of the samplers are the same as Table 4.2. For Node, Edge and RW samplers, we observe that time to sample one subgraph is in most cases less than 25% of the training time. The MRW sampler is more expensive to execute. Regarding the complete pre-processing procedure, we repeatedly run the sampler for N = 50·|V| /|V s | times before training, to estimate the node and edge probability as discussed in Section 4.2.2 (where|V s | is the average subgraph size). These sampled subgraphs are reused as training minibatches. Thus, if training runs for more thanN iterations, the pre-processing is nearly zero-cost. Under the setting of Table 4.2, pre-processing on PPI and Yelp and Amazon does not incur any overhead in training time. Pre-processing on Flickr and Reddit (with RW sampler) takes less than 40% and 15% of their corresponding total training time. Cost of layers sampler of AS-GCN AS-GCN uses an additional neural network to estimate the conditional sampling probability for the previous layer. For a node v already sampled in layer ℓ, features of layer-(ℓ− 1) corresponding to allv’s neighbors need to be fed to the sampling neural network to obtain the node probability. For sake of analysis, assume the sampling network is a single layer MLP, whose weightW MLP has the same shape as the GCN weightsW (ℓ) . Then we can show, for a L-layer GCN on a degree-d graph, per epoch training complexity of AS-GCN is approximatelyγ = (d· L)/ P L− 1 ℓ=0 d ℓ times that of vanilla GCN. ForL = 2, we haveγ ≈ 2. This explains the observation that AS-GCN is slower than vanilla GCN in Figure 4.4. Additional, Table 4.4 shows the training time breakdown for AS-GCN. Clearly, its sampler is much more expensive than the graph sampler ofGraphSAINT. Table 4.4: Per epoch training time breakdown for AS-GCN Dataset Sampling time (sec) Forward / Backward propagation time (sec) PPI 1.1 0.2 Flickr 5.3 1.1 Reddit 20.7 3.5 52 2 3 4 5 6 0 2 4 6 8 GNN depth Normalized training time GraphSAINT: Reddit VRGCN: Reddit GraphSAINT: Yelp VRGCN: Yelp Figure 4.5: Comparison of training efficiency when increasing the GNN depth (GraphSAINT: linear complexity; VRGCN: exponential com- plexity due to neighborhood explosion) PPI Flickr Reddit Yelp Amazon 0 0.5 1 1.5 Fraction of training time Node Edge RW MRW Figure 4.6: Cost of various subgraph sampling algorithms measured by the fraction of total training time. The expensive MRW algorithm can be accelerated by our parallelization tech- niques in Section 6.3 Cost of clustering of ClusterGCN ClusterGCN uses the highly optimized METIS software 4 to perform clustering. Table 4.5 summarizes the time to obtain the clusters for the five graphs. On the large and dense Amazon graph, the cost of clustering increase dramatically. The pre-processing time of ClusterGCN on Amazon is more than4× of the total training time. On the other hand, the sampling cost ofGraphSAINT does not increase significantly for large graphs (see Figure 4.6). Table 4.5: Clustering time of ClusterGCN via executing METIS algorithm PPI Flickr Reddit Yelp Amazon Time (sec) 2.2 11.6 40.0 106.7 2254.2 Taking into account the pre-processing time, sampling time and training time altogether, we summarize the total convergence time of GraphSAINT and ClusterGCN in Table 4.6 (correspond- ing to Table 4.2 configuration). On graphs that are large and dense (e.g., Amazon), GraphSAINT 4 http://glaros.dtc.umn.edu/gkhome/metis/metis/download 53 achieves significantly faster convergence. Note that both the sampling of GraphSAINT and cluster- ing of ClusterGCN can be performed offline. Table 4.6: Comparison between GraphSAINT and ClusterGCN on total convergence time (pre- processing + sampling + training, unit: second) PPI Flickr Reddit Yelp Amazon GraphSAINT-Edge 91.0 7.0 16.6 273.9 401.0 GraphSAINT-RW 103.6 7.5 17.2 310.1 425.6 ClusterGCN 163.2 12.9 55.3 256.0 2804.8 4.4.4 Effect of Graph Sampling Algorithms From Table 4.2, random edge and random walk based samplers achieve higher accuracy than the random node sampler. Figure 4.7 presents sensitivity analysis on parameters of “RW”. We use the same hyperparameters (except the sampling parameters) and network architecture as those of the “RW” entries in Table 4.2. We fix the length of each walker to 2 (i.e., GCN depth), and vary the number of rootsr from250 to2250. For PPI, increasingr from250 to750 significantly improves accuracy. Overall, for all datasets, accuracy stabilizes beyondr =750. 4.4.5 GNN Architecture Variants and Deeper Models In Figure 4.8, we train a 2-layer and a 4-layer model of GAT [78] and JK-net [89], by using minibatches of GraphSAGE and GraphSAINT respectively. On the two 4-layer architectures, GraphSAINT achieves two orders of magnitude speedup than GraphSAGE, indicating much better scalability on deep models. From accuracy perspective, 4-layer GAT-SAGE and JK-SAGE do not outperform the corresponding 2-layer versions, potentially due to the smoothening effect caused by the massive neighborhood size. On the other hand, with minibatches returned by our edge sam- pler, increasing model depth of JK-SAINT leads to noticeable accuracy improvement (from 0.966 of 2-layer to 0.970 of 4-layer). 54 0 500 1,000 1,500 2,000 2,500 0.4 0.6 0.8 1 Number of walkers Test F1-micro PPI Flickr Reddit Yelp Amazon Figure 4.7: Sensitivity of test accuracy on sampling parameters (number of walkers for the random walk sampler) 4.4.6 Subgraph-level Redundancy Reduction After transformingG s toG # s following Section 4.3.3, redundancy in computation and communica- tion are reduced at the cost of extra memory consumption to store the vector sums ofM a . Figure 4.9 shows such tradeoff, as well as the effectiveness of redundancy reduction. Each “× ” marker corresponds to one round of redundancy reduction, and we run for 5 rounds (θ = 2, Algorithm 4). We observe 1. number of reads and additions can be reduced to as low as 60%, at the cost of a buffer less than 2|V s |, 2. amount of redundancy depends on the graph topology – while our algorithm is effective on many graphs (e.g., PPI, Yelp), exceptions exist (e.g., Reddit). By Figure 4.9, in our implementation, we set a budgetB such that P M∈Ma |M|L. A normal GNN is closely related to a SHADOW-GNN . Under the normal setup, an L-layer GNN operates on the fullG and propagates the influence from all the neighbors up to L hops away fromv. Such a GNN is equivalent to a model whereEXTRACT returns the fullL-hop subgraph and L ′ =L. We theoretical demonstrate how SHADOW-GNN improves expressivity from three different angles. On SHADOW-GCN (Section 5.2.2), we come from the graph signal processing perspec- tive. The GCN propagation can be interpreted as applying filtering on the node signals [84]. Deep models correspond to low-pass filters. Filtering the local graph G [v] preserves richer information than the globalG. On SHADOW-SAGE (Section 5.2.3), we view the GNN as a function approxi- mator. We construct a target function and study how decoupling reduces the approximation error. On SHADOW-GIN (Section 5.2.4), we focus on learning topological information. We show that decoupling helps capture local graph structure which the 1D Weisfeiler-Lehman test fails to capture. 60 5.2.2 Expressivity from the Graph Signal Processing Perspective Recall the oversmoothing phenomenon for GCN discussed in Section 5.1.1. Suppose the orig- inal features X reside in a high-dimensional space R |V|× d . Oversmoothing pushes X towards a low-dimensional subspaceR |V|× d ′ , whered ′ < d. Corresponding analysis comes from two per- spectives: oversmoothing by a deep GCN, and oversmoothing by repeated GCN-style propagation. The former considers the full neural network with non-linear activation, weight and bias. The later characterizes the aggregation matrixM = lim L→∞ e A L X. It is shown that even with the vanilla architecture, a deep GCN with bias parameters does not oversmooth [43]. In addition, various tricks [108, 73, 20] can prevent oversmoothing from the neural network perspective. However, a deep GCN still suffers from accuracy drop, indicating that the GCN-style propagation (rather than other GCN components like activation and bias) may be the fundamental reason causing dif- ficulty in learning. Therefore, we study the asymptotic behavior of the aggregation matrix M under the normal and SHADOW design. In other words, here in Section 5.2.2, we ignore the non- linear activation and bias parameters. Such setup is consistent with many existing literature such as [60, 63, 20, 108]. Proposition 4. ∞ number of feature propagation by SHADOW-GCN leads to m [v] = e [v] v · e T [v] X [v] (5.2) wheree [v] is defined by e [v] u = r δ [v] (u) P w∈V [v] δ [v] (w) andδ [v] (u) returns the degree ofu inG [v] plus 1. Proof. The GCN model performs symmetric normalization on the adjacency matrix. SHADOW- GCN follows the same way to normalize the subgraph adjacency matrix as: e A [v] = D [v] +I [v] − 1 2 · A [v] +I [v] · D [v] +I [v] − 1 2 (5.3) whereA [v] ∈R N× N is the binary adjacency matrix forG [v] . 61 e A [v] is a real symmetric matrix and has the largest eigenvalue of1. SinceEXTRACT ensures the subgraphG [v] is connected, so the multiplicity of the largest eigenvalue is 1. By Theorem 1 of [43], we bound the eigenvaluesλ i by1=λ 1 >λ 2 ≥ ...≥ λ N >− 1. Performing eigen-decomposition on e A [v] , we have e A [v] =E [v] Λ E − 1 [v] =E [v] Λ E T [v] (5.4) whereΛ is a diagonal matrixΛ i,i =λ i and matrixE [v] consists of all the normalized eigenvectors. We have: e A L [v] =E [v] Λ L E T [v] (5.5) Since |λ i | < 1 when i ̸= 1, we have lim L→∞ e A L [v] = e [v] e T [v] , where e v is the eigenvector corresponding toλ 1 . It is easy to see that e [v] u ∝ p δ [v] (u) [43]. After normalization, e [v] u = r δ [v] (u) P w∈V [v] δ [v] (w) . It directly follows thatm [v] = e [v] v · e T [v] X [v] , with value ofe [v] defined above. Oversmoothing by normal GCN propagation. With a large enough L, the full L-hop neigh- borhood becomes V (assuming connected G). So ∀ u,v, we have G [u] = G [v] = G, implying e [u] =e [v] andX [u] =X [v] =X. From Proposition 4, the aggregation converges to a point where no feature and little structural information of the target is preserved. The only information inm [v] isv’s degree. Local-smoothing by SHADOW-GCN propagation. With a fixed subgraph, no matter how many times we aggregate using e A [v] , the layers will not include the faraway irrelevant nodes. From Proposition 4, m [v] is a linear combination of the neighbor features X [v] . Increasing the number of layers only pushes the coefficients of each neighbor features to the stationary values. 62 The domainX [v] of such linear transformation is solely determined by EXTRACT and is indepen- dent of the model depth. Intuitively, ifEXTRACT picks non-identical subgraphs for two nodesu and v, the aggregations should be different due to the different domains of the linear transformation. Therefore, SHADOW-GCN preserves local feature information whereas normal GCN preserves none. For structural information inm [v] , note thate [v] is a normalized degree distribution of the subgraph around v, and e [v] v indicates the role of the target node in the subgraph. By sim- ply letting EXTRACT return the 1-hop subgraph, e [v] v alone already contains all the information preserved by a normal GCN, which isv’s degree inG. For the general EXTRACT,e [v] additionally reflects v’s ego-net structure. Thus, a deep SHADOW-GCN preserves more structural information than a deep GCN. Theorem 5. Letm [v] = ϕ G (v)· m [v] where ϕ G is any non-zero function only depending on the structural property of v. LetM = {m [v] | v ∈ V}. GivenG, EXTRACT and some continuous probability distribution inR |V|× d to generateX, thenm [v] ̸=m [u] ifV [u] ̸=V [v] , almost surely. Proof. We first prove the case of m [v] =m [v] . i.e.,ϕ G (v)=1. According to Proposition 4, the aggregation for each target node equalsm [v] = e [v] v e T [v] X [v] . LetN =|V|. Let ¯¯e [v] ∈R N× 1 be a “padded” vector frome [v] , such that theu th element is e [v] u ifu∈V [v] , and 0, otherwise. Therefore,m [v] = e [v] v ¯¯e T [v] X. Then, the difference in aggregation of two nodesu andv is given by m [v] − m [u] = e [v] v ¯¯e T [v] X− e [u] u ¯¯e T [u] X (5.6) =ϵ T X, (5.7) whereϵ = e [v] v · ¯¯e T [v] − e [u] u · ¯¯e T [u] . When two nodes u and v have identical neighborhood as V [u] = V [v] , then the aggregation vectors are identical asϵ =0. However, when two nodes have different neighborhoods, we claim that they almost surely have different aggregations. Let us assume the contrary, i.e., for somev and u withV [v] ̸=V [u] , their aggregations are the same:m [v] =m [u] . Then we must haveϵ T X =0. 63 Note that, givenG, EXTRACT and some continuous distribution to generateX, there are only finite values forϵ . The reasons are that 1.G is finite due to which there are only finite possible subgraphs and, 2. even thoughX can take infinitely many values, ϵ does not depend onX. Each of suchϵ ̸=0 defines a hyperplane in R N byϵ · x=0 (wherex∈R N ). LetH be the finite set of all such hyperplanes. In other words, ∀i, X :,i must fall on one of the hyperplanes in H. However, since X is generated from a continuous distribution inR N× f ,X :,i almost surely does not fall on any of the hyperplanes inH. Therefore, for anyv andu such thatV [v] ̸=V [u] , we havem [v] ̸= m [u] almost surely. For a more generalϕ G (v) applied onm [v] , sinceϕ G does not depend onX, the proof follows exactly the same steps as above. Corollary 5.1. ConsiderEXTRACT 1 , where∀v∈V, V [v] ≤ n. Then|M|≥ l |V| n m almost surely. Corollary 5.2. Consider EXTRACT 2 , where∀ u,v ∈ V, V [v] ̸= V [u] . Then|M| = |V| almost surely. Proof of Corollary 5.1. Note that any subgraph contains the target node itself. i.e.,∀u∈V, u∈ V [u] . Therefore, for any nodev, there are at mostn− 1 other nodes inV with the same neighborhood asV [v] . Suchn− 1 possible nodes are exactly those inV [v] . By Theorem 5,∀v∈V, there are at mostn− 1 other nodes inV having the same aggregation asm [v] . Equivalently, total number of possible aggregations is at least⌈|V|/n⌉. Proof of Corollary 5.2. By definition of EXTRACT 2 , any pair of nodes has non-identical neighbor- hood. By Theorem 5, any pair of nodes have non-identical aggregation. Equivalently, all nodes have different aggregation and|M| =|V|. Theorem 5 proves SHADOW-GCN does not oversmooth: 1. A normal GCN pushes the aggre- gation of same-degree nodes to the same point, while SHADOW-GCN withEXTRACT 2 ensures any 64 v v ′ G ′ [v] v e v ′ G ′′ [v] Figure 5.1: ExampleG ′ [v] andG ′′ [v] showing that the normal GraphSAGE model cannot approximate the linear τ function. The graph on the left shows that GraphSAGE can only has 2 rounds of neighbor propagation to exclude the influence of v ′ . The graph on the right shows that a 2-layer GraphSAGE cannot generate different representations forv inG ′ [v] and inG ′′ [v] . two nodes (even with the same degree) have different aggregation. 2. A normal GCN wipes out all information inX after many times of aggregation, while SHADOW-GCN always preserves feature information. Particularly, with ϕ G (v) = δ [v] (v) − 1/2 , a normal GCN generates only one unique value ofm for allv. By contrast, SHADOW-GNN generates|V| different values for any ϕ G function. 5.2.3 Expressivity from the Function Approximator Perspective We compare the expressivity by showing 1. SHADOW-SAGE can express all functions GraphSAGE can, and 2. SHADOW-SAGE can express some function GraphSAGE cannot. Recall, a GraphSAGE layer performs the following: x (ℓ) v = σ W (ℓ) 1 T x (ℓ− 1) v + W (ℓ) 2 T 1 |Nv| P u∈Nv x (ℓ− 1) u . We can prove Point 1 by making anL ′ -layer SHADOW-SAGE identical to anL-layer GraphSAGE with the following steps: 1. let EXTRACT return the fullL-hop neighborhood, and 2. setW (ℓ) 1 =I,W (ℓ) 2 =0 forL+1≤ ℓ≤ L ′ . For point 2, we consider a target function: τ X,G [v] = C · P u∈V [v] δ [v] (u)· x u for some neighborhood G [v] , scaling constant C and δ [v] (u) as defined in Proposition 4. An expressive model should be able to learn well this simple linear functionτ . GraphSAGE cannot learnτ accurately, while SHADOW-SAGE can. We first show the Graph- SAGE case. Let the depth ofG [v] beL. Firstly, we need GraphSAGE to perform message passing for exactly L times (where such a model can be implemented by, e.g., L layers or L ′ layers with W 2 =0 forL ′ − L layers). Otherwise, the extraL ′ − L message passings will propagate influence 65 from nodes v ′ ̸∈ V [v] , violating the condition that τ is independent of v ′ . Next, suppose Graph- SAGE can learn a function ζ such that on someG ′ [v] , we have ζ G ′ [v] = τ G ′ [v] . We construct anotherG ′′ [v] by adding an extra edgee connecting two depth-L nodes inG ′ [v] . Edgee changes the degree distribution δ [v] (·), and thus τ G ′ [v] ̸= τ G ′′ [v] . On the other hand, there is no way for GraphSAGE to propagate the influence of edge e to the targetv, unless the model performs at least L+1 message passings. Soζ G ′ [v] = ζ G ′′ [v] regardless of the activation function and weight parameters. Therefore,ζ ̸=τ . For SHADOW-SAGE , let EXTRACT returnG [v] . Then the model can output ζ ′ = h b A L ′ [v] X i v,: after we 1. setW (ℓ) 1 =0 andW (ℓ) 2 =I for all layers, and 2. either remove the non-linear activation or bypassReLU by shiftingX with bias. With known results in Markov chain convergence theorem [59], we derive the following theorem by analyzing the convergence of b A L ′ [v] whenL ′ →∞. Theorem 6. SHADOW-SAGE can approximateτ with error decaying exponentially with depth. Proof. SHADOW-SAGE follows the GraphSAGE way of normalization on the adjacency matrix. Thus, each row of b A [v] sums up to 1. Such normalization enables us to view b A [v] as the transition matrix of a Markov chain. The coefficients of the linear function τ (i.e.,δ [v] (·)) equal the limiting distribution of such a Markov chain. Therefore, we can use the convergence theorem of Markov chain to characterize the convergence of b A L ′ [v] towardsτ ’s coefficients. We can also use the mixing time of the Markov chain to derive the error bound of shaDow-SAGE. Our proof is built on the existing theoretical results in [59]. We rewrite τ as τ = C· δX , whereδ is a length-|V| vector with V [v] non-zero elements – each non-zero corresponds toδ [v] (u) of the neighborhoodG [v] . Further, denote b δ as the normalized δ vector. i.e., each non-zero element of b δ equals δ [v] (u) P w∈V [v] δ [v] (w) . Soτ =C ′ · b δX with some scaling factor C ′ . For ease of discussion, we ignore C ′ , as any scaling factor can be easily expressed by the model weight parameters. SHADOW-SAGE can express h b A L ′ [v] X i v,: = h b A L ′ [v] i v,: X. So now we need to show how h b A L ′ [v] i v,: converges to b δ whenL ′ →∞. We establish the following correspondence: 66 b A [v] as the Markov transition matrix. This can be easily seen since each row of b A [v] sums up to 1. Further, if we add self-loop inV [v] and we considerG [v] as undirected and connected (as can be guaranteed by EXTRACT), then b A [v] is irreducible, aperiodic and reversible. Irreducibility is guaranteed byG [v] being connected. Aperiodicity is guaranteed by self-loops. For reversibility, we can prove it by the stationary distribution of b A [v] . As shown by the next paragraph, b δ is the stationary distribution of b A [v] . So we need to show that h b δ i u h b A [v] i u,w = h b δ i w h b A [v] i w,u (5.8) Consider two cases. If(u,w)̸∈E [v] , then(w,u)̸∈E [v] and both sides of Equation 5.8 equal 0. If (u,w)∈E [v] , then h b δ i u = δ [v] (u) P x∈V [v] δ [v] (x) and h b A [v] i u,w = 1 δ [v] (u) . So both sides of Equation 5.8 equal 1 P x∈V [v] δ [v] (x) . Thus, Equation 5.8 holds and b A [v] is reversible. b δ as the limiting distribution. By definition, the stationary distribution π satisfies π =π b A [v] (5.9) It is easy to see that settingπ = b δ T can satisfy Equation 5.9. So b δ is the stationary distribu- tion. For irreducible and aperiodic Markov chain, the stationary distribution is also the limiting distribution, and thus lim L ′ →∞ h b A L ′ [v] i v,: = b δ (5.10) So far, we can already show that when the SHADOW-SAGE model is deep enough (i.e., with largeL ′ ), the model output approximatesτ : lim L ′ →∞ h b A L ′ [v] X i v,: =τ (5.11) 67 Desired model depth as the mixing time. Next, we want to see if we want the SHADOW- SAGE model to reach a given error bound, how many layers L ′ are required. Firstly, directly applying Theorem 4.9 of [59], we know that the error of SHADOW-SAGE approximating τ reduces exponentially with the model depthL ′ . Then From Equation 4.36 of [59], we can directly relate the mixing time of Markov chain with the required shaDow-SAGE depth to reach anyϵ error. Finally, by Theorem 12.3 of [59], the mixing time can be bounded by the “absolute spectral gap” of the transition matrix b A [v] . Note that Theorem 12.3 applies when the Markov chain is reversible – a condition satisfied by b A [v] as we have already discussed. The absolute spectral gap is calculated from the eigenvalues of the transition matrix b A [v] . In summary, SHADOW-SAGE can approximate τ with error decaying exponentially with depthL ′ . We have the following conclusions from above: 1. SHADOW-SAGE is more expressive than GraphSAGE. 2. appropriate EXTRACT function improves SHADOW-GNN expressivity, 3. There exists cases where it may be desirable to set the SHADOW-GNN depth much larger than the subgraph depth. 5.2.4 Expressivity from the Topology Learning Perspective While GCN and GraphSAGE are popular architectures in practice, they are not the theoretically most discriminative ones. The work in [88] establishes the relation in discriminativeness between GNNs and 1-dimensional Weisfeiler-Lehman test (i.e., 1-WL). And GIN [88] is an example archi- tecture achieving the same discriminativeness as 1-WL. We show that applying the decoupling principle can further improve the discriminativeness of such GNNs, making them more powerful than 1-WL. 1-WL is a graph isomorphism test aiming at distinguishing graphs of different structures. A GNN as expressive as 1-WL thus well captures the topological property of the target node. While 1-WL is already very powerful, it may still fail in some cases. e.g., it cannot distinguish certain 68 u v G u G 1 [u] G 1 [v] v Figure 5.2: An example 3-regular graph and the 1-hop subgraph of the target nodes u v CC2 CC1 Figure 5.3: An example 2-regular graph with two connected compo- nents (CC) non-isomorphic, regular graphs. To understand why SHADOW-GNN works, we first need to understand why 1-WL fails. In a regular graph, all nodes have the same degree, and thus the “regular” property describes a global topological symmetry among nodes. Unfortunately, 1-WL (and the corresponding normal GNN) also operates globally onG. Intuitively, on two different regular graphs, there is no way for 1-WL (and the normal GNN) to assign different labels by breaking such symmetry. On the other hand, SHADOW-GNN can break such symmetry by applying decoupling. At the beginning of Section 5.2, we have discussed how SHADOW-GNN is built from the local perspective on the full graph. The key property benefiting SHADOW-GNN is that a subgraph of a regular graph may not be regular. Thus, SHADOW-GNN can distinguish nodes in a regular graph with the non-regular subgraphs as the scope. We illustrate the intuition with the example in Figure 5.2. The graphG is 3-regular and we assume all nodes have identical features. Our goal is to discriminate nodesu andv since their neighborhood structures are different. No matter how many iterations 1-WL runs, or how many layers the normal GNN has, they cannot distinguish u and v. On the other hand, a SHADOW-GNN with 1-hop EXTRACT and at least 2 layers can discriminate u andv. Theorem 7. Consider GNNs whose layer function is defined by x (ℓ) v =f (ℓ) 1 x (ℓ− 1) v , X u∈Nv f (ℓ) 2 x (ℓ− 1) v ,x (ℓ− 1) u ! , (5.12) 69 wheref (ℓ) 1 andf (ℓ) 2 are the update and message functions of layer-ℓ, implemented as MLPs. Then, such SHADOW-GNN is more discriminative than the 1-dimensional Weisfeiler-Lehman test. Proof of Theorem 7. We consider GNNs whose layer function is defined by Equation 5.12. Define G L [v] as the subgraph induced from all ofv’sℓ-hop neighbors inG, where1≤ ℓ≤ L. First, we show that a SHADOW-GNN following Equation 5.12 is at least as discriminative as the 1-dimensional Weisfeiler-Lehman test (1-WL). We first prove that given any G L [v] , an L ′ -layer GNN can have exactly the same output as an L-layer GNN, where L ′ > L. We note that for the target nodev, the only difference between these two architectures is that anL-layer GNN exactly performsL message passing iterations to propagate node information from at mostL hops away, while anL ′ -layer GNN hasL ′ − L more message passing iterations before performing the rest of L message passings. Thanks to the universal approximation theorem [39], we can let the MLPs implementingf 1 andf 2 learn the following GNN layer function: x (ℓ) v =f (ℓ) 1 x (ℓ− 1) v , X u∈Nv f (ℓ) 2 x (ℓ− 1) v ,x (ℓ− 1) u ! =x (ℓ− 1) v , ∀1≤ ℓ≤ L ′ − L This meansx (0) v = x (L ′ − L) v . Then theL ′ -layer GNN will have the same output as theL-layer GNN. According to [88], the normal GNN (i.e., L-layer onG L [v] ) following Equation 5.12 is as discriminative as 1-WL. Thus, SHADOW-GNN (i.e.,L ′ -layer onG L [v] ) following Equation 5.12 is at least as discriminative as 1-WL. Next, we show that there exists a graph where SHADOW-GNN can discriminate topologically different nodes, while 1-WL cannot. The example graph in Figure 5.2 is one such graph. G is connected and is 3-regular. The nodes u and v marked in red and blue have different topological roles, and thus an ideal model / algorithm should assign different labels to them. Suppose all nodes inG have identical features. For 1-WL, it will assign the same label to all nodes (includingu and 70 v) no matter how many iterations it runs. For SHADOW-GNN , if we let EXTRACT returnG 1 [v] and G 1 [u] (i.e.,1-hop basedEXTRACT), and setL ′ >1, then our model can assign different labels tou and v. To see why, note thatG 1 [u] andG 1 [v] are non-isomorphic, and more importantly, non-regular. So if we run the “SHADOW” version of 1-WL (i.e., 1-WL onG 1 [v] andG 1 [u] rather than onG), the nodesu andv will be assigned to different labels after two iterations. Equivalently, SHADOW-GNN can discriminateu andv with at least 2 layers. We can further generalize to construct more such example graphs (although such generalization is not required by the proof). The guideline we follow is that, the full graphG should be regular. Yet the subgraphs around topologically different nodes (e.g.,G k [v] ) should be non-isomorphic and non-regular. The graph in Figure 5.3 is another example 2-regular graph, where nodesu andv can only be differentiated by decoupling the GIN. Finally, combining the above two parts, SHADOW-GNN following Equation 5.12 is more discriminative than the 1-dimensional Weisfeiler-Lehman test. The theorem also implies that SHADOW-GIN is more discriminative than a normal GIN due to the correspondence between GIN and 1-WL. 5.2.5 Neural Architecture Extensions Subgraph pooling For a normal GNN performing node classification, the multi-layer message passing follows a “tree structure”. The nodes at levelL of the tree correspond to theL-hop neigh- borhood. And the tree root outputs the final embedding of the target node. Thus, there is no way to apply subgraph pooling or READOUT on the final layer output, since the “pool” only contains a single vector. For a SHADOW-GNN , since we decouple theL th layer from theL-hop neighbor- hood, it is natural to let each layer (including the final layer) output embeddings for all subgraph 71 nodes. This leads to the design to READOUT all the subgraph node embeddings as the target node embedding. We can understand the pooling for SHADOW-GNN from another perspective. In a normal GNN, the target node at the final layer receives messages from all neighbors, but two neighbor nodes may not have a chance to exchange any message to each other (e.g., two nodesL-hop away from the target may be2L-hop away from each other). In our design, a SHADOW-GNN can pass messages between any pair of neighbors when the model depth is large enough. Therefore, all the subgraph node embeddings at the final layer capture meaningful information of the neighborhood. In summary, the power of the decoupling principle lies in that it establishes the connection between the node- / link-level task and the graph-level task. e.g., to classify a node is seen as to classify the subgraph surrounding the node. From the neural architecture perspective, we can apply any subgraph pooling / READOUT operation originally designed for graph classification (e.g., [105, 53, 14]) to enhance the node classification / link prediction of SHADOW-GNN . In particular, in the vanilla SHADOW-GNN , we can implement a trivial READOUT as “discarding all neighbor embeddings”, corresponding to performing center pooling. Table 5.1 specifies some example pooling functions implemented by SHADOW-GNN . Table 5.1: READOUT(·) function for different pooling operations, whereH [v] is the matrix consist- ing of all the last layer representation vectors of nodes inG [v] . Name READOUT(·) Center H [v] v,: Sum P u∈V [v] H [v] u,: Mean 1 |V [v]| P u∈V [v] H [v] u,: Max max u∈V [v] H [v] u,: Sort MLP H [v] h arg sort[H [v]] :,− 1 i :s Subgraph ensemble It may be challenging in practice to design a single EXTRACT capturing all meaningful characteristics of the neighborhood. We can use multipleEXTRACT to jointly define the 72 receptive field, and then ensemble multiple SHADOW-GNN at the subgraph level. Consider R candidates{EXTRACT i }, each returningG i [v] . To generate v’s embedding, we first use R branches ofL ′ -layer GNN to obtain intermediate embeddings for eachG i v , and then aggregate theR embed- dings by some learnable function g. In practice, we design g as an attention based aggregation function. Subgraph ensemble is useful both when{EXTRACT i } consists of different algorithms and when eachEXTRACT i performs the same algorithm under different parameters. CASE STUDY Consider PPR-basedEXTRACT i (see Section 5.3) with different thresholdθ i on the neighbor PPR score. A SHADOW-GNN -ensemble can approximate PPRGo [16]. PPRGo generates embedding as: ξ v = P u∈V [v] π u h v , where π u is u’s PPR score and h v = MLP(x v ). We can partitionV [v] = S R i=1 V i [v] s.t. nodes inV i [v] have similar PPR scores denoted bye π i , and e π i ≤ e π i+1 . Soξ v = P R i=1 ρ i P u∈V ′ i h u , where ρ i = e π i − P j θ . Thenp,θ are hyperparameters. For step 3, notice that our PPR sampler only uses PPR scores as a node filtering condition. The original graph structure is still preserved amongV [v] , due to the induced subgraph step. In comparison, other related works [52, 16] do not preserve the original graph structure. We also empirically highlight that the approximate PPR is both scalable and efficient. The number of nodes we need to visit to obtain the approximateπ v is much smaller than the graph size |V|. In addition, each visit only involves a scalar update, which is orders of magnitude cheaper than the cost of neighbor propagation in a GNN. We profile the three datasets, Reddit, Yelp and ogbn-products. As shown in Table 5.2, for each target node on average, the number of nodes touched by the PPR computation is comparable to the full 2-hop neighborhood size. This also indicates that faraway neighbors do not have much topological significance. We further show the empirical execution time measurement on multi-core CPU machines in Figure 5.6. Table 5.2: Average number of nodes touched by the approximate PPR computation on 3 large graphs Dataset Avg. 2-hop size Avg. # nodes touched by PPR Reddit 11093 27751 Yelp 2472 5575 ogbn-products 3961 5405 75 5.3.2 Other Methods Based on Modeling and Learning Model based We can understand the original graphG as the union ofG [v] . And thus ifG is asso- ciated with a graph generation process,EXTRACT then describes the reverse of such a process. This links to the graph generation literature [58]. e.g., forest-fire generation model would correspond to EXTRACT being forest-fire sampler [57]. More specifically, to extract the subgraph around a node v, we can imagine a process of addingv into the partial graphG ′ consisting ofV\{v}. Then the nodes selected by EXTRACT would correspond to the nodes touched by such an imaginary process of addingv. Learning based We may treat the design of EXTRACT as part of GNN training. However, due to the combinatorial nature of subgraph extraction, simultaneously learning EXTRACT and the SHADOW-GNN layer parameters is challenging. The learning of EXTRACT can be made possi- ble with appropriate approximation and relaxations. e.g., we may use the techniques proposed in [91] to design a two-phased learning process. In Phase 1, we train a normal L-layer GNN, and then use [91] to identify the important neighbors among the full L-hop neighborhood. In Phase 2, we use the neighborhood identified in Phase 1 as the subgraph returned by EXTRACT. Then we train anL ′ -layer SHADOW-GNN on top of such neighborhood. 5.3.3 Practical Implementation We now discuss the practical implementation of decoupled GNN – SHADOW-GNN. As the name suggests, in SHADOW-GNN, the scope is a shallow subgraph (i.e., with depth often set to2 or3). In many realistic scenarios (e.g., citation networks, social networks, product recommendation graphs), a shallow neighborhood is both necessary and sufficient for the GNN to learn well. On “sufficiency”, we consider the social network example: the friend of a friend of a friend may share little commonality with you, and close friends may be at most 2 hops away. Formally, by theγ -decaying theorem [104], a shallow neighborhood is sufficient to accurately estimate various 76 graph metrics. On “necessity”, since the neighborhood size may grow exponentially with hops, a deep neighborhood would be dominated by nodes irrelevant to the target. The corresponding GNN would first need to differentiate the many useless nodes from the very few useful ones, before it can extract meaningful features from the useful nodes. Finally, a shallow subgraph ensures scalability by avoiding “neighborhood explosion”. Remark on decoupling. So far we have defined a decoupled model as having the model depth L ′ larger than the subgraph depthL. Strictly speaking, a decoupled model also admitsL ′ =L. For example, suppose in the full L-hop neighborhood, there are 70% nodes L hops away. Applying decoupling, the EXTRACT excludes most of the L-hop neighbors, and the resulting subgraphG [v] contains only 20% nodes L hops away. Then it is reasonable to consider an L-layer model on such a depth-L subgraph as also a decouple model. Compared with an L-layer model on the full L-hop neighborhood, an L-layer model on such a depth-LG [v] propagates much less information from nodes L hops away. So the L message passings are indeed decoupled from the full L-hop neighborhood. Remark on neighborhood. The “sufficiency” and “necessity” in shallow neighborhood are not universal. In many other applications, long-range dependencies can be critical, as studied in [9]. In such cases, our practical implementation of SHADOW-GNN would incur accuracy loss. However, our decoupling principle in general may still be beneficial – “shallow subgraph” is a practical guideline rather than a theoretical requirement. We leave the study on such applications as future work. 77 5.4 Experiments 5.4.1 Datasets, Learning Tasks and Experimental Setup We evaluate SHADOW-GNN on seven graphs. Six of them are for the node classification task: Flickr [97],Reddit [37],Yelp [97],ogbn-arxiv,ogbn-products andogbn-papers100M [41]. The remaining is for the link prediction task: ogbl-collab [41]. The sizes of the seven graphs range from 9K nodes (Flickr) to 110M nodes (ogbn-papers100M). Flickr, Reddit and Yelp are under the inductive setting. ogbn-arxiv, ogbn-products and ogbn-papers100M are trans- ductive. Consistent with the original papers, for the graphs on node classification, we measure “F1-micro” score for Yelp and “accuracy” for the remaining five graphs. For the link prediction task, we use “Hits@50” as the metric. Table 5.3: Key statistics of the graphs we evaluate. ogbl-collab is used for link prediction and all the other graphs are for node classification. Dataset Setting Nodes Edges Degree Feature Classes Train / Val / Test Flickr Inductive 89,250 899,756 10 500 7 0.50 / 0.25 / 0.25 Reddit Inductive 232,965 11,606,919 50 602 41 0.66 / 0.10 / 0.24 Yelp Inductive 716,847 6,977,410 10 300 100 0.75 / 0.10 / 0.15 ogbn-arxiv Transductive 169,343 1,166,243 7 128 40 0.54 / 0.18 / 0.29 ogbn-products Transductive 2,449,029 61,859,140 25 100 47 0.10 / 0.02 / 0.88 ogbn-papers100M Transductive 111,059,956 1,615,685,872 29 128 172 0.78 / 0.08 / 0.14 ogbl-collab – 235,868 1,285,465 8 128 – 0.92/0.04/0.04 Table 5.3 summarizes the key statistics of the graphs we evaluate. Forogbn-papers100M, only around1% of the nodes are associated with ground truth labels. The training / validation / test sets are split only among those labelled nodes. We construct SHADOW with six backbone models: GCN [50], GraphSAGE [37], GAT [78], JK-Net [89], GIN [88], SGC [84]. The first five are representatives of the state-of-the-art GNN architectures, jointly covering various message aggregation functions as well as the skip connection design. SGC simplifies normal GCN by moving all the neighbor propagation to the pre-processing step. Therefore, SGC is suitable for evaluating oversmoothing. The non-SHADOW models are 78 trained with both full-batch andGraphSAINT-minibatch [97]. Due to the massive size of the fullL- hop neighborhood, we need to perform sampling when training normal GNNs in order to make the computation time tractable. GraphSAINT is suitable for our purpose since 1. it is the state-of-the-art minibatch method which achieves high accuracy, and 2. it supports various GNN architectures and scales well to large graphs. On the other hand, for SHADOW-GNN , both training and inference are always executed in minibatches. One advantage of SHADOW-GNN is that the decoupling enables straightforward minibatch construction: each target just independently extracts the small subgraph on its own. We implement twoEXTRACT described in Section 5.3: 1. “PPR”, where we set the node budget K as{200,400} for the largestogbn-papers100M andK ≤ 200 for all other graphs; 2. “L-hop”, where we set the depth as{1,2}. We implement two subgraph pooling functions: “mean” and “max”. For the model depth, since L ′ = 3 is the standard setting in the literature (e.g., see the benchmarking in OGB [41]), we start fromL ′ = 3 and further evaluate a deeper model ofL ′ = 5. All accuracy are measured by 5 runs without fixing random seeds. To evaluate the feasibility of SHADOW-GNN for various realistic applications, we run SHADOW-GNN under various hardware configurations, ranging from low-end desktops to high- end servers. We observe that the training and inference of SHADOW-GNN can be easily adapted to the amount of available hardware resources by adjusting the batch size. • MACHINE 1: a low-end desktop machine with 4-core Intel Core i7-6700 CPU @3.4GHz, 16GB RAM and one NVIDIA GeForce GTX 1650 GPU of 4GB memory. • MACHINE 2: a low-end server with 28-core Intel Xeon Gold 5120 CPU @2.2GHz, 128GB RAM and one NVIDIA Titan Xp GPU of 12GB memory. • MACHINE 3: a more powerful server with 64-core AMD Ryzen Threadripper 3990x CPU @2.9GHz, 256GB RAM and three NVIDIA GeForce RTX3090 GPUs of 24GB memory. From the GPU perspective, the low-end GTX 1650 GPU can support the SHADOW-GNN computation on all the graphs (including ogbn-papers100M). However, the limited RAM size of 79 Flickr Reddit Yelp arxiv products 0 0.2 0.4 0.6 0.8 1 Ratio of nodes at different hopℓ 4-layer GNN Flickr Reddit Yelp arxiv products Any layer SHADOW (PPR) ℓ=1 ℓ=2 ℓ=3 ℓ=4 Figure 5.4: Difference in neighborhood composition between the normal GNN and SHADOW- GNN . The PPR-based EXTRACT filters out most distant neighbors in the original graph and thus the resulting subgraph is dominated by nearby neighbors. MACHINE 1 limits its usage to only Flickr, Reddit, ogbn-arxiv and ogbl-collab. The other two servers, MACHINE 2 and MACHINE 3, are used to train all of the seven graphs. Table 5.4 summarizes our recommended minimum hardware resources to run SHADOW-GNN . Note that regardless of the model, larger graphs inevitibly requires larger RAM due to the growth of the raw features (the raw data files of ogbn-papers100M already takes 70GB). However, larger graphs do not correspond to higher GPU requirement for SHADOW-GNN , since the GPU mem- ory consumption is controlled by the batch size parameter. Table 5.4: Recommended minimum hardware resources for SHADOW-GNN Dataset |V| CPU cores RAM GPU memory ogbn-arxiv 0.2M 4 8GB 4GB ogbn-products 2.4M 4 32GB 4GB ogbn-papers100M 111.1M 4 128GB 4GB 80 0 10 20 30 40 0.4 0.6 0.8 1 Power on e A or e A [v] Test accuracy SHADOW-SGC ,Flickr SHADOW-SGC ,Reddit SHADOW-SGC ,ogbn-arxiv SGC,Flickr SGC,Reddit SGC,ogbn-arxiv Figure 5.5: Oversmoothing of SGC and local smoothing of SHADOW-SGC . “Power on e A or e A [v] ” is equivalent to the depth of a GCN without non-linear activation. 5.4.2 Characteristic of the GNN Scope SHADOW-GNN neighborhood For both normal and SHADOW GNNs, Figure 5.4 shows on average how many neighbors areL hops away from the target. For a normal GNN, the size of the neighborhood grows rapidly with respect to L, and the nodes 4 hops away dominate the neigh- borhood. For SHADOW-GNN using the Table 5.5 EXTRACT, most neighbors concentrate within 2 hops. A small number of nodes are 3 hops away. Almost no nodes are 4 or more hops away. Importantly, not only does the composition of the two kinds of neighborhood differ significantly, but also the size of SHADOW-GNN scope is much smaller (see also Table 5.5). Such small sub- graphs are essential to high computation efficiency. Finally, we can ignore the very few distant neighbors (L≥ 4), and regard the (effective) depth of SHADOW-GNN subgraph asL = 2 (or at most3). Under such practical value ofL, a model withL ′ ≥ 3 is indeed a SHADOW-GNN (see “Remark on decoupling” in Section 5.3.3). Cost of neighborhood extraction We evaluate the PPR EXTRACT in terms of its execution time overhead and accuracy-time tradeoff. In Figure 5.6, we parallelize EXTRACT using half of the 81 Flickr Reddit Yelp arxiv products papers100M 0 1 2 3 Inference time per node (ms) PPR SHADOW-SAGE SHADOW-GAT Figure 5.6: Measured execution time for PPREXTRACT and the GNN model computation available CPU cores of MACHINE 3 (i.e., 32 cores) and execute the GNN computation on the RTX 3090 GPU. Clearly, the PPR EXTRACT is lightweight: the sampling time is lower than the GNN computation time in most cases. In addition, the sampling time per node does not grow with the full graph size. This shows that SHADOW-GNN is scalable to massive scale graphs. By the discussion in Section 5.3.1, the approximate PPR computation achieves efficiency and scalability by only traversing a local region around each target node. 5.4.3 Accuracy and Inference Speed for Node Classification Comparison with baselines Table 5.5 shows the performance comparison of SHADOW-GNN with the normal GNNs. All models on all datasets have uniform hidden dimension of 256. We define the metric “inference cost” as the average amount of computation to generate prediction for one test node. Inference cost is a measure of computation complexity and is independent of hardware / implementation factors such as parallelization strategy, batch processing, etc.. For the cost of SHADOW-GNN , we do not include the overhead on computing EXTRACT, since it is hard 82 Table 5.5: Comparison with state-of-the-art on test accuracy / F1-micro score and inference cost (both SHADOW-GNN and the baselines are tuned with DropEdge to improve the training quality on deep models). Method Layers Flickr Reddit Yelp ogbn-arxiv ogbn-products Accuracy Cost Accuracy Cost F1-micro Cost Accuracy Cost Accuracy Cost GCN 3 0.5159± 0.0017 2E0 0.9532± 0.0003 6E1 0.4028± 0.0019 2E1 0.7170± 0.0026 1E1 0.7567± 0.0018 5E0 5 0.5217± 0.0016 2E2 0.9495± 0.0012 1E3 OOM 1E3 0.7186± 0.0017 1E3 OOM 9E2 GCN-SAINT 3 0.5155± 0.0027 2E0 0.9523± 0.0003 6E1 0.5110± 0.0012 2E1 0.7093± 0.0003 1E1 0.8003± 0.0024 5E0 5 0.5165± 0.0026 2E2 0.9523± 0.0012 1E3 0.5012± 0.0021 1E3 0.7039± 0.0020 1E3 0.7992± 0.0021 9E2 3 0.5234± 0.0009 (1) 0.9576± 0.0005 (1) 0.5291± 0.0020 (1) 0.7180± 0.0024 (1) 0.7742± 0.0037 (1) SHADOW-GCN (PPR) 5 0.5268± 0.0008 1E0 0.9564± 0.0004 1E0 0.5323± 0.0020 2E0 0.7206± 0.0025 2E0 0.7821± 0.0043 2E0 +Pooling 3/5 0.5286± 0.0013 1E0 0.9624± 0.0002 1E0 0.5393± 0.0036 2E0 0.7223± 0.0018 2E0 0.7914± 0.0044 2E0 GraphSAGE 3 0.5140± 0.0014 3E0 0.9653± 0.0002 5E1 0.6178± 0.0033 2E1 0.7192± 0.0027 1E1 0.7858± 0.0013 4E0 5 0.5154± 0.0052 2E2 0.9626± 0.0004 1E3 OOM 2E3 0.7193± 0.0037 1E3 OOM 1E3 SAGE-SAINT 3 0.5176± 0.0032 3E0 0.9671± 0.0003 5E1 0.6453± 0.0011 2E1 0.7107± 0.0003 1E1 0.7923± 0.0023 4E0 5 0.5201± 0.0032 2E2 0.9670± 0.0010 1E3 0.6394± 0.0003 2E3 0.7013± 0.0021 1E3 0.7964± 0.0022 1E3 3 0.5312± 0.0019 1E0 0.9672± 0.0003 1E0 0.6542± 0.0002 1E0 0.7163± 0.0028 1E0 0.7935± 0.0031 1E0 SHADOW-SAGE (2-hop) 5 0.5335± 0.0015 2E0 0.9675± 0.0005 2E0 0.6525± 0.0003 2E0 0.7180± 0.0030 2E0 0.7995± 0.0022 2E0 3 0.5356± 0.0013 (1) 0.9688± 0.0002 (1) 0.6538± 0.0003 (1) 0.7227± 0.0012 (1) 0.7905± 0.0026 (1) SHADOW-SAGE (PPR) 5 0.5417± 0.0006 2E0 0.9692± 0.0007 2E0 0.6518± 0.0002 2E0 0.7238± 0.0007 2E0 0.8005± 0.0040 2E0 +Pooling 3/5 0.5395± 0.0013 2E0 0.9703± 0.0003 2E0 0.6564± 0.0004 1E0 0.7258± 0.0017 2E0 0.8067± 0.0037 1E0 GAT 3 0.5070± 0.0032 2E1 OOM 3E2 OOM 2E2 0.7201± 0.0011 1E2 OOM 3E1 5 0.5164± 0.0033 2E2 OOM 2E3 OOM 2E3 OOM 3E3 OOM 2E3 GAT-SAINT 3 0.5225± 0.0053 2E1 0.9671± 0.0003 3E2 0.6459± 0.0002 2E2 0.6977± 0.0003 1E2 0.8027± 0.0028 3E1 5 0.5153± 0.0034 2E2 0.9651± 0.0024 2E3 0.6478± 0.0012 2E3 0.6954± 0.0013 3E3 0.7990± 0.0072 2E3 3 0.5349± 0.0023 (1) 0.9707± 0.0004 (1) 0.6575± 0.0004 (1) 0.7235± 0.0020 (1) 0.8006± 0.0014 (1) SHADOW-GAT (PPR) 5 0.5352± 0.0028 2E0 0.9713± 0.0004 2E0 0.6559± 0.0002 2E0 0.7274± 0.0022 2E0 0.8071± 0.0004 2E0 +Pooling 3/5 0.5364± 0.0026 1E0 0.9710± 0.0004 2E0 0.6566± 0.0005 1E0 0.7265± 0.0028 2E0 0.8142± 0.0031 1E0 to calculate such cost analytically. Empirically, subgraph extraction is much cheaper than model computation (see Figure 5.6, 5.7 for time evaluation on CPU and GPU). During training, we apply DropEdge [73] to both the baseline and SHADOW models. DropEdge helps improve the baseline accuracy by alleviating oversmoothing, and benefits SHADOW-GNN due to its regularization effects. ACCURACY We aim at answering the following questions: 1. Can we improve accuracy by decoupling a baseline model? 2. What architecture components can we tune to improve accuracy of a decoupled model? 3. WhatEXTRACT can we tune to improve the accuracy of a decoupled model? To answer Q1, we fix the backbone architecture and remove pooling. Then we inspect “3-layer normal GNN vs. 3-layer SHADOW-GNN ” and “5-layer normal GNN vs. 5-layer SHADOW- GNN ”. Clearly, SHADOW-GNN s (with scope size no more than 200) in general achieve signifi- cantly higher accuracy than the normal GNNs. This indicates that a shallow neighborhood contains 83 Table 5.6: Comparing SHADOW-GNN with the top-performing methods on the public OGB leaderboard for the link prediction datasetogbl-collab Method Test Hits@50 Val Hits@50 SEAL 0.5371± 0.0047 0.6389± 0.0049 DeeperGCN 0.5273± 0.0047 0.6187± 0.0045 LRGA+GCN 0.5221± 0.0072 0.6088± 0.0059 SHADOW-SAGE 0.5492± 0.0022 0.6524± 0.0017 sufficient information, and customizing the scope can benefit accuracy even without architecture changes (from Figure 5.4, a depth-3G [v] differs significantly from the 3-hop neighborhood). To answer Q2, we focus on the PPREXTRACT and thus compare the rows in blue background. We use the 3-layer SHADOW-GNN without pooling as the baseline and analyze the effects of 1. increas- ing the GNN depth without expanding scope, and 2. adding subgraph pooling. Comparing among the rows in light blue background, we observe that in many cases, simply increasing the depth from 3 to 5 leads to significant accuracy gain. Comparing the ligh blue rows with the dark blue rows, we observe that sometimes pooling can further improve the accuracy of a SHADOW-GNN . In conclusion, both types of architecture tuning are effective ways of optimizing a SHADOW-GNN . Finally, to answer Q3, we compare the light blue rows with the light yellow rows. In general, PPR EXTRACT leads to higher accuracy than 2-hop EXTRACT, demonstrating the importance of designing a goodEXTRACT. INFERENCE COST Inference cost of SHADOW-GNN is orders of magnitude lower than the normal GNNs (a 5-layer SHADOW-GNN is still much cheaper than a 3-layer normal GNN). The high cost of the baselines is due to the “neighborhood explosion” with respect to more layers. SHADOW-GNN is efficient and scalable as the cost only grows linearly with the model depth. Note thatGraphSAINT only improves efficiency during training since its inference operates on the fullL-hop neighborhood. 84 5.4.4 Extension to Link Prediction We further show that SHADOW-GNN is general and can be extended to the link prediction task. There are two settings ofogbl-collab. We follow the one where validation edges cannot be used in training updates. This is the setting which most leaderboard methods follow. Table 5.6 shows the comparison with the top GNN models under the same setting. SHADOW-SAGE outperforms the rank-1 model with significant margin. 5.4.5 Scaling to Graphs of 100M Nodes Table 5.7: Comparing SHADOW-GNN with the top-performing methods on the public OGB leaderboard for the largest public graphogbn-papers100M Method Test accuracy Val accuracy Neigh size GraphSAGE+incep 0.6706± 0.0017 0.7032± 0.0011 4E5 SIGN-XL 0.6606± 0.0019 0.6984± 0.0006 > 4E5 SGC 0.6329± 0.0019 0.6648± 0.0020 > 4E5 SHADOW-GAT 200 0.6681± 0.0016 0.7019± 0.0011 2E2 SHADOW-GAT 400 0.6708± 0.0017 0.7073± 0.0011 3E2 Table 5.8: Memory consumption of theogbn-papers100M leaderboard methods Method CPU RAM GPU memory GraphSAGE+incep 80GB 16GB SIGN-XL >682GB 4GB SGC >137GB 4GB SHADOW-GAT 80GB 4GB We further scale SHADOW-GNN to ogbn-papers100M, one of the largest public dataset. Even through the full graph size is at least two orders of magnitude larger than the graphs in Table 5.5, the localized scope of SHADOW-GNN barely needs to increase. Since SHADOW-GNN performs minibatch computation, a low-end GPU with limited memory capacity can compute SHADOW-GNN on ogbn-papers100M efficiently. Table 5.8 further shows that we can train and 85 inference our model with as little as 4GB GPU memory consumption. This is infeasible using nor- mal GNNs. Table 5.7 summarizes our comparison with the top leaderboard methods [81, 30, 84]. We only include those methods that do not use node labels as the model input (i.e., the most stan- dard setup). We achieve at least 3 orders of magnitude reduction in neighborhood size without sacrificing accuracy. For SIGN-XL and SGC, their neighborhood is too large to count the exact size. Also, their preprocessing consumes5× more CPU memory than SHADOW-GNN . Note that our machines do not support the training of SIGN (due to the RAM size con- straint), and thus we only show the lower bound of SIGN’s RAM consumption. On the other hand, SHADOW-GNN can flexibly adjust its batch size based on the available memory. Even for ogbn-papers100M, a typical low-end server with 4GB of GPU memory and 128GB of RAM is sufficient to train the 5-layer SHADOW-GAT . Increasing the batch size of SHADOW-GNN may further lead to higher GPU utilization for more powerful machines. A 5-layer SHADOW-GAT only requires around 5GB of GPU memory to saturate the computation resources of the powerful GPU cards such as NVIDIA RTX 3090. 5.4.6 Accuracy-Speed Tradeoff During Inference To evaluate the accuracy-time tradeoff, we take the 5-layer models of Table 5.5 as the pretrained models. Then for SHADOW-GNN , we vary the PPR budgetp from 50 to 200 with stride 50. In Figure 5.7, the inference time of SHADOW-GNN has already included the PPR sampling time. Firstly, consistent with Table 5.5, inference of SHADOW-GNN achieves higher accuracy than the normal GNNs, with orders of magnitude speedup as well. In addition, based on the application requirements (e.g., latency constraint), SHADOW-GNN s have the flexibility of adjusting the sampling size without the need of retraining. For example, on Reddit and ogbn-arxiv, directly reducing the subgraph size from 200 to 50 reduces the inference latency by 2× to 4× at the cost of less than1% accuracy drop. 86 10 − 1 10 0 10 1 10 2 10 3 0.4 0.6 0.8 1 Inference time per node (ms) Test accuracy Tradeoff analysis (5-layer) SHADOW SAGE SHADOW GAT Normal SAGE Normal GAT Flickr Reddit arxiv products Figure 5.7: Inference performance tradeoff between accuracy and time. For each SHADOW-GNN architecture and graph, we pre-train a single 5-layer model and apply it to perform inference on subgraphs of various sizes without re-training. 87 Chapter 6 Parallelization and Acceleration Techniques We improve the execution speed of the subgraph based GNN computation via parallelization and hardware acceleration techniques. By constraining the GNN computation within a small subgraph, the frequent accessed data can be fully stored in the fast memory of the devices (e.g., CPU cache or FPGA on-chip BRAM). Thus, the hardware utilization can be significantly improved compared with executing GNNs directly on the large full graph. 6.1 Intensive Computation, Irregular Memory Accesses and Load Balance From the computation perspective, there are three challenges in GNN computation. Intensive computation The propagation along GNN layers involves computationally intensive dense tensor operations on the model weights and graph node features (e.g.,X· W in Equation 2.3, 2.8 and 2.10). Fast computation of this step require high consumption of hardware arithmetic resources. This challenge is also similarly seen when computing other deep learning models such as CNNs on images [94, 95, 71] and LSTMs on time-series data [82, 35]. Irregular memory accesses The neighbor propagation in a large and sparse graph incurs high volume of irregular memory accesses, both on-chip and off-chip. The memory challenge is unique to the GNN computation. While CNN accelerators [64, 83, 99, 103, 107] achieve high data reuse and regular accesses by tiling and sliding windows on input tensors, such tensor operations do not generalize to sparse graphs. In addition, while graph processing accelerators [23, 24, 48, 101, 88 112] optimize data flow and layout for propagation of node labels, such optimizations hardly lead to performance gain when the labels are replaced with long feature vectors. Specifically, when computing GraphSAINT and SHADOW-GNN, the memory accesses for subgraph-level neighbor propagation are still irregular. However, all accesses can be constrained within a small range of memory addresses if we store the subgraph data in contiguous memory. This leads to the opportunity to address the memory access challenge since the CPU cache and FPGA BRAM both can support random accesses without latency penalties. Load-balance Degree-imbalance of graph nodes can significantly degrade the performance of feature propagation. Consequently, the overall layer computation can be load-imbalanced. It is challenging to design on-chip computation modules and the corresponding scheduler to ensure load-balance for arbitrary graphs when the neighbors propagate in parallel. Our solution to ensure load-balance in GraphACT is to serielize the neighbor propagation and then pipeline it with the dense computation on layer weights. 6.2 Computation Kernels First of all, subgraph sampling is one key computation kernel in our subgraph-based GNN compu- tation. In addition to sampling, the computation within each GNN layer can also be expressed by a few other computation kernels, as discussed in the following. After obtaining the subgraphs as minibatches, the GNN computation mainly involves forward and backward propagation along the layers. We first analyze in detail the backward propagation computation for the GraphSAGE model [37]. Then we show that all the four GNN variants pre- sented in Section 2.1 share the same set of key computation operations. And thus the parallelization strategy can be generally applied to all the models. For the forward propagation, Equations 2.3, 2.5 and 2.7 have already defined all the operations required for the various layers. Next, we derive the equations for calculating gradients. 89 Starting from the minibatch lossL s , we first compute the gradient with respect to the classifier output on the subgraph nodes (X MLP ) s . Then, using chain-rule, we compute the gradients with respect to the variables of the MLP layer and the GNN layers (from layer L back to layer 1). Assume theReLU activation function. For the layer with cross-entropy loss, the gradients are computed by: ∇ (X MLP ) s L s = 1 |V s | · Y s − Y s (6.1) For the MLP layer, the gradients are computed by: ∇ W (MLP)L s = X (L) s T · mask ∇ (X MLP ) s L s (6.2) ∇ X (L) s L s =mask ∇ (X MLP ) s L s · W (MLP) T (6.3) For each GNN layerℓ, the gradients are computed by: ∇ W (ℓ) ◦ L s = X (ℓ− 1) s T · mask h ∇ X (ℓ) s L s i :,0: 1 2 f (ℓ) (6.4) ∇ W (ℓ) ⋆ L s = b A s X (ℓ− 1) s T · mask h ∇ X (ℓ) s L s i :, 1 2 f (ℓ) :f (ℓ) (6.5) ∇ X (ℓ− 1) s L s =mask h ∇ X (ℓ) s L s i :,0: 1 2 f (ℓ) · W (ℓ) ◦ T + b A s T · mask h ∇ X (ℓ) s L s i :, 1 2 f (ℓ) :f (ℓ) · W (ℓ) ⋆ T (6.6) From the equations of forward and backward propagation, we observe that the GraphSAGE computation consists of three kernels: • Feature / gradient propagation in the sparse subgraph. e.g., b A s X (ℓ) s ; 90 • Dense weight transformation on the feature / gradient. e.g.,X (ℓ− 1) s W (ℓ) ◦ ; • Sparse adjacency matrix transpose. i.e., b A s T . In fact, the above three are also the key operations for GCN [50], MixHop [6] and GAT [78]. For GCN [50], the forward propagation only contains one pass as compared to the two paths in GraphSAGE (i.e., the two paths being concatenated by the “∥” operation). Therefore, in the backward propagation, we replace b A s with e A s and only keep the terms containing e A s in Equation 6.4. For example, we have∇ X (ℓ− 1) s L s = e A T · mask ∇ X (ℓ) s L s · W (ℓ) T . For MixHop [6], each layer in the forward propagation consists ofK paths as compared to the two paths in GraphSAGE. Therefore, we need to introduce the e A k terms (where 1≤ k ≤ K) to Equation 6.4 in the backward pass. For example, we need e A s 2 X (ℓ− 1) s to compute∇ W (ℓ) 2 L s . Further note that e A s 2 X (ℓ− 1) s = e A s · e A s X (ℓ− 1) s . And even thoughA s is sparse, the product e A s X (ℓ− 1) s is again a dense matrix. So the forward and backward propagation for MixHop does not involve sparse-sparse matrix multiplication and the MixHop computation can still be covered by the three key operations listed above. For GAT [78], in the forward pass, we need to compute the attention values for each element in the subgraph adjacency matrix. Such computation according to Equation 2.11 only involves dense algebra. After obtaining the attention adjacency matrix, the rest of the propagation by Equation 2.10 is the same as that of GCN. In the backward pass, according to chain rule, we can still break down the computation steps following the logic in the forward pass. For example, to obtain the gradient with respect to attention parametersa, we first obtain the gradients with respect to the attention matrixA (ℓ− 1) att by a series of dense matrix operations onX (ℓ− 1) ,∇ X (ℓ)L s andW . Then we obtain the gradient with respect toa based on the gradient with respect toA (ℓ− 1) att . Even though the mathematical expression for the GAT gradient computation is more complicated, it is easy to see that all the operations involved are again covered by the three key operations listed above. In summary, if we can efficiently parallelize the three operations listed above, we are auto- matically able to execute the full forward and backward propagation for the four GNNs. In the 91 following, we use b A s to represent the subgraph adjacency matrix used in each GNN layer. For different models, the b A s may be replaced by e A s orA att . 92 6.3 ParaGNN: Parallel GNN Training on Multi-Core CPUs In this section , we describe our parallelization strategy for all the computation kernels: transpose of the sparse adjacency matrix (Section 6.3.1), parallelization strategies for the subgraph sampling algorithms (especially the frontier sampling algorithm) (Sections 6.3.2, 6.3.3 and 6.3.4), paral- lelization of subgraph feature / gradient propagation (Section 6.3.5) and parallelization of dense weight transformation (Section 6.3.5). We conclude with an overall framework in Section 6.3.6. 6.3.1 Transpose of the sparse adjacency matrix Since we assume the training graph and the sampled subgraphs are undirected, the transpose of the subgraph adjacency matrix b A s T can be performed efficiently with low computation and space complexity. We first discuss the serial implementation before moving forward to the parallel version. Suppose the original adjacency matrix b A is represented in the CSR format, consisting of a size-|V s +1| index pointer array (INDPTR), a size-|E s | indices array (INDICES) and a size-|E s | data array (DATA). For an undirected graph, if edge(u,v)∈E s , then(v,u)∈E s . Therefore, the index pointer and the indices arrays of b A s are identical as the ones of b A s T . To transpose b A s thus means to generate a new data array by permuting the original DATA of the CSR of b A s . We propose to generate the permuted data array for b A s T by a single pass of INDPTR and INDICES of b A s . Our algorithm relies on a weak assumption on INDICES of b A s : for any node v, we assume its neighbor IDs in the indices array, INDICES[INDPTR[v]: INDPTR[v+1]], is sorted in ascending order. The transpose operation is shown in Algorithm 6. The correctness of the algorithm can be reasoned as follows. Suppose a column v of the original adjacency matrix has n non-zeros denoted as h b A s i u i ,v = a i , where 1 ≤ i ≤ n and the node IDs satisfy u i < u j for i < j. When we traverse the CSR of b A s (lines 4 to 5), we will read a i before a j if the node IDs haveu i < u j . After transpose, the neighbor data –a 1 ,...,a n – should be placed in a continuous subarray DATA[INDPTR[v]: INDPTR[v+1]] of b A s T . In addition,a i should locate to the left of 93 Algorithm 6 Transpose of the subgraph adjacency matrix Input: Original adjacency matrix b A s represented by the CSR format Output: Transposed adjacency matrix b A s T represented by the CSR format 1: INDPTR, INDICES, DATA← CSR arrays of b A s 2: DATATRANS← array of size|E s | initialized toINV 3: PTRDATA← array of size|V s | initialized to INDPTR[:|V s |] 4: forv from0 to|V s |− 1 do 5: forj from INDPTR[v] to INDPTR[v+1] do 6: u← INDICES[j]; a← DATA[j]; 7: DATATRANS[PTRDATA[u]]← a 8: Increment PTRDATA[u] by1 ▷ Record the next position to append 9: end for 10: end for 11: return Transposed matrix b A s T from INDPTR, INDICES, DATATRANS a j ifu i < u j . Therefore, once readinga i of the edge(u i ,v) from b A s , we can simply appenda i to v’s data subarray of the transposed CSR. The computation and space complexity of Algorithm 6 areO(|V s |+|E s |) andO(|E s |) respec- tively, which are low compared with other operations in training. For GraphSAINT training, we parallelize the adjacency matrix transpose at the subgraph level. During sampling, each of thep inter processors sample one subgraph and permute the corresponding DATA array by Algorithm 6. The information of the original and transposed subgraphs are all stored in the pool of{G i }, to be later consumed by the GNN layer propagation. 6.3.2 Data Structure for Parallel Subgraph Sampling The frontier sampling (i.e., multi-dimensional random walk sampling, MRW) algorithm proceeds as follows. Throughout the sampling process, the sampler maintains a constant-size frontier set FS consisting ofm vertices inG. In each iteration, the sampler randomly pops out a nodev in FS according to a degree based probability distribution, and replacesv in FS with a randomly selected neighbor ofv. The popped outv is added to the node setV s ofG s . The sampler repeats the above 94 update process on the frontier set FS, until the size ofV s reaches the desired budgetn. Algorithm 7 shows the details. According to [72], a good empirical value ofm is around1000. Algorithm 7 Frontier sampling algorithm Input: Full graphG(V,E); Frontier sizem; Node budgetn Output: Induced subgraphG s (V s ,E s ) 1: FS← Set ofm nodes selected uniformly at random fromV 2: V s ← FS 3: fori=0 ton− m− 1 do 4: Selectu∈ FS with probabilitydeg(u)/ P v∈FS deg(v) 5: Selectu ′ from neighbors{w|(u,w)∈E} uniformly at random 6: FS← (FS\{u})∪{u ′ } 7: V s ←V s ∪{u} 8: end for 9: G s ← Subgraph ofG induced byV s 10: returnG s (V s ,E s ) In our sequential implementation of training, we notice that about half of the time is spent in the sampling phase. This motivates us to parallelize the graph sampler. The challenges are as follows. 1. While sampling from a discrete distribution is a well-researched problem, we focus on fast parallel sampling from a dynamic probability distribution. Such dynamism is due to the addition/deletion of new nodes in the frontier. Existing methods for fast sampling such as aliasing [80] (which can output a sample inO(1) time with linear processing) cannot be modified easily for our problem. It is non-trivial to select a node from the evolving FS with low complexity. A straightforward implementation by partitioning the total probability of 1 into m intervals would requireO(m) work to update the intervals for each replacement in FS. Givenm=1000 as recom- mended by the authors in the original paper [72], theO(m· n) complexity to sample a singleG s is too expensive. 2. The sampling is inherently sequential as the nodes in the frontier set should be popped out one at a time. Otherwise,G s may not preserve the characteristics of the original graph well enough. 95 To address the above challenges, we first propose a novel data structure that lowers the com- plexity of frontier sampler and allows thread-safe parallelization (Section 6.3.3). We then propose a training scheduler that exploits parallelization within and across sampler instances. Since nodes in the frontier set is replaced only one at a time, an efficient implementation should allow incremental update of the probability distribution over the m nodes. To achieve such goal, we propose a “Dashboard” table to store the status of current and historical frontier nodes (a node becomes historical after it gets popped out of the frontier set). The next node to pop out is selected by probing the Dashboard using randomly generated indices. In the follow- ing, we formally describe the data structure and operations in the Dashboard-based sampler. The implementation involves two arrays: • Dashboard DB∈R η ·m·d : A vector maintaining the status and sampling probabilities of the current and historical frontier nodes. If a node v is in the frontier, we “pin” a “tile” of v to the “dashboard”. Here a tile is a small data structure storing the meta-data ofv, and a pin is an address pointer to the tile. One entry of DB corresponds to one pin. A nodev will have deg(v) pins allocated continuously in DB, each pointing to the same tile belonging tov. If v is popped out of the frontier, we invalidate all its pins to NULL. The optimal value of the parameterη is explained later. • Index array IA∈R 2× (η ·m·d+1) : An auxiliary array to help cleanup DB upon table overflow. The j th column in IA has 2 slots, the first slot records the starting index of the DB pins corresponding tov, wherev is thej th node added into DB. The second slot is a flag, which isTrue whenv is a current frontier node, andFalse whenv is a historical one. The symbols related to the design and analysis of the Dashboard data structure are summarized in Table 6.1. Since the probability of popping out a node in frontier is proportional to its degree, we allocate deg(v i ) continuous entries in DB, for each v i currently in the frontier set. This way, the sampler only needs to probe DB uniformly at random to achieve line 4 of Algorithm 7. Clearly, DB should 96 Table 6.1: Summary of symbols related to the data structure and parallelization of the frontier sampling algorithm Name Meaning Dashboard (DB) Data structure consisting of “pins” and “tiles” to support fast dynamic update of probability distribution tile Data structure storing meta-information of frontier nodes pin Pointer pointing to the tiles. All pins belonging to the same node will point to a shared tile Index array (IA) Data structure helping the cleanup of DB when it is full m Number of nodes in the frontier set n Total number of nodes to be sampled in the subgraph d Average degree of frontier nodes η Enlargement factor controlling the computation-storage tradeoff. Largerη : larger DB and less frequent cleanup contain at least m· d entries, where d is the average degree of the frontier nodes. For the sake of incremental updates, we append the entries for the new node and invalidate the entries of the popped out node, instead of changing the values in-place and shifting the tailing entries. The invalidated entries become historical. To accommodate the append operation, we introduce an enlargement factorη (whereη > 1), and set the length of DB to beη · m· d. As an approximation, we set d as the average degree of the training graph G. As the sampling proceeds, eventually, all of the η · m· d entries in DB may be filled up by the information of current and historical frontier nodes. In this case, we free up the space occupied by historical nodes before resuming the sampler. Although cleanup of the Dashboard is expensive, due to the factorη , such scenario does not happen frequently (see complexity analysis in Section 6.3.3). Using the information in IA, the cleanup phase does not need to traverse all of theη · m· d entries in DB to locate the space to be freed. When DB is full, the entries in DB can correspond to at most η · m· d vertices. Thus, we safely set the capacity of IA to beη · m· d + 1. Slot 1 of the last entry of IA contains the current number of used DB entries. 97 6.3.3 Intra- and Inter-subgraph Parallelization Since our subgraph-based GNN training requires independently sampling multiple subgraphs, we can sample different subgraphs on different processors in parallel. Also, we can further parallelize within each sampling instance by exploiting the parallelism in probing, book-keeping and cleanup of DB. Algorithm 8 shows the details of Dashboard-based parallel frontier sampling, where all arrays are zero-based. Considering the main loop (lines 20 to 30), we analyze the complexity of the three functions in Algorithm 9. Denote COST rand and COST mem as the cost to generate one random number and to perform one memory access, respectively. pardo_POP_FRONTIER: Anytime during sampling, on average, the ratio of valid DB entries (those occupied by current frontier vertices) over total number of DB entries is 1/η . Probability of one probing falling on a valid entry equals1/η . Expected number of rounds forp processors to generate at least 1 valid probing can be shown to be 1/ 1− 1− 1 η p , where one round refers to one repetition of lines 5 to 7 of Algorithm 9. After selection of v pop , deg(v pop ) number of slots needs to be updated to invalid values INV. Since this operation occurs (n− m) times, the para_POP_FRONTIER function incurs(n− m) 1 1− (1− 1/η ) p · COST rand + d p · COST mem cost. pardo_CLEANUP: Each time cleanup of DB happens, we need one traversal of IA to calcu- late the cumulative sum of indices (slot 1) masked by the status (slot 2), so as to obtain the new location for each valid entries in DB. On expectation, only η · m entries of IA is filled, so this step costs η · m. Afterwards, only the valid entries in DB will be moved to the new, empty DB based on the accumulated shift amount. This translates to m· d number of memory operations. Thepara_CLEANUP function is fully parallelized. The cleanup happens only when DB is full, i.e., n− m (η − 1)m times throughout sampling. Thus, the cost is n− m (η − 1)·m · m·d p · COST mem . We ignore the cost of computing the cumulative sum asηm ≪ md. pardo_ADD_TO_FRONTIER: Adding a new frontierv new to DB requires appendingdeg(v new ) new entries to DB. This costs(n− m)· d p · COST mem . 98 Algorithm 8 Parallel Dashboard based frontier sampling Input: Original graph G(V,E); Frontier size m; Budget n; Enlargement factor η ; Number of processorsp Output: Induced subgraphG s (V s ,E s ) 1: d←|E| /|V| 2: DB← Array ofR 1× (η ·m·d) with valueNULL 3: IA← Array ofR 2× (η ·m·d+1) with valueINV ▷ INValid 4: FS← Set ofm nodes selected uniformly at random fromV 5: V s ← FS 6: Convert the set FS to an indexable list of nodes 7: IA[0,0]← 0; IA[1,0]← True; 8: fori=1 tom do ▷ Initialize IA from FS 9: IA[0,i]← IA[0,i− 1]+deg(FS[i− 1]) 10: IA[1,i]← True 11: end for 12: IA[1,m]← False 13: fori=0 tom− 1 pardo ▷ Initialize DB from FS 14: pin← Address of a tile of 4-tuple(FS[i], IA[0,i], IA[0,i+1],i) 15: fork = IA[0,i] to IA[0,i+1]− 1 do 16: DB[k]← pin 17: end for 18: end for 19: cnt← m; V s ←∅ ; 20: fori=m ton− 1 do ▷ Main loop of sampling 21: v pop , pin← pardo_POP_FRONTIER(DB,p) 22: v new ← Node randomly sampled fromv pop ’s neighbors 23: ifdeg(v new )>η · m· d− IA[0,s]+1 then 24: DB← pardo_CLEANUP(DB, IA,p) 25: cnt← m− 1 26: end if 27: pardo_ADD_TO_FRONTIER(v new , pin, cnt, DB, IA,p) 28: V s ←V s ∪{v new } 29: cnt← cnt+1 30: end for 31: G s ← Subgraph ofG induced byV s 32: returnG s (V s ,E s ) Considering all operations in pardo_POP_FRONTIER , pardo_CLEANUP and pardo_ADD_TO_FRONTIER , the overall cost to sample one subgraph onp processors equals: 99 Algorithm 9 Functions in Dashboard Based Sampler 1: function PARDO_POP_FRONTIER(DB,p) 2: idx pop ← INV ▷ Shared variable 3: forj =0 top− 1 pardo 4: while idx pop ==INV do ▷ Probing DB 5: idx p ← Index generated uniformly at random 6: if DB[idx p ]̸=NULL then 7: idx pop ← idx p 8: end if 9: end while 10: end for 11: pin pop ← DB[idx pop ] 12: v pop ,i pinStart ,i pinEnd ,i IA ← data of the tile pointed to by pin pop 13: forj =0 top− 1 pardo 14: Update the DB entries toNULL from indexi pinStart toi pinEnd 15: end for 16: IA[1,i IA ]← False ▷ Update IA 17: returnv pop , pin pop 18: end function 19: function PARDO_CLEANUP(DB, IA,p) 20: DB new ← New, empty dashboard 21: k← Cumulative sum of IA[0,:] masked by IA[1,:] 22: fori=0 top− 1 pardo 23: Move entries from DB to DB new by offsets ink 24: end for 25: fori=0 top− 1 pardo 26: Re-index IA based on DB new 27: end for 28: return DB new 29: end function 30: function PARDO_ADD_TO_FRONTIER(v new , pin,i, DB, IA,p) 31: IA[0,i+1]← IA[0,i]+deg(v new ); IA[1,i]← True; 32: Assign values(v new , IA[0,i], IA[0,i+1],i) to the tuple pointed to by pin 33: forj =0 top− 1 pardo 34: Update the DB entries to pin from index IA[0,i] to IA[0,i+1] 35: end for 36: end function 1 1− (1− 1/η ) p · COST rand + 2+ 1 η − 1 d p · COST mem · (n− m) (6.7) 100 Assuming COST mem = COST rand , we have the following scalability bound: Theorem 8. For any given ϵ > 0, Algorithm 7 guarantees a speedup of at least p 1+ϵ ,∀p ≤ ϵd 2+ 1 η − 1 − η . Proof. Note that 1 1− (1− 1/η ) p ≤ 1 1− exp(− p/η ) ≤ η +p p . This follows from 1 1− e − x = 1 1− 1 e x ≤ 1 1− 1 1+x ≤ x+1 x . Further, since p ≤ ϵd · (2+1/(η − 1))− η , we have η +p p ≤ ϵd ·(2+1/(η − 1)) p . Now, speedup obtained by Algorithm 7 compared to a serial implementation (p=1) is (η +d(1/(η − 1)+2))(n− m) 1 1− (1− 1/η ) p + d p (1/(η − 1)+2) (n− m) ≥ d(1/(η − 1)+2) ϵd p (1/(η − 1)+2)+ d p (1/(η − 1)+2) ≥ p 1+ϵ . Setting ϵ = 0.5, then for any value of η , Theorem 8 guarantees good scalability (p/1.5) for at least p = d− η processors. As we will see later in this section, we perform the intra-sampler parallelism via A VX instructions. So we do not require p to scale to a large number in practice. Note that the above performance analysis always holds as long as we know the expected node degree in the subgraphs. During the sampling process, when the sampler enters a well connected local region of the original graph, cleanup may happen more frequently since the frontier contains more high degree nodes. However, the sampler would eventually replace those high degree frontier nodes with low degree ones, so that the overall subgraph degree is similar to that of the original graph. Also, note that for graphs with skewed degree distribution, it is possible that the next node to be added into the frontier set has very high degree. Such a node may even require more slots than that is totally available in DB. In this case, we would cleanup DB and allocate all the remaining slots to that node, without further expanding the size of DB. This only slightly alters the sampling distribution since the higher the node degree is, the sooner it would be popped out of the frontier. While the scalability can be high for dense graphs, it is challenging to scale the sampler to mas- sive number of processors on sparse graphs. Feasible parallelism is bound by the graph degree. In 101 summary, the parallel Dashboard based frontier sampling algorithm 1. enables lower serial com- plexity by incremental update on probability distribution, and 2. scales well up top=O(d) number of processors. To further scale the graph sampling step, we exploit task parallelism across multiple sampler instances. Since the topology of the training graphG is fixed over the training iterations, sampling and GNN computation can proceed in an interleaved fashion, without any dependency constraints. Detailed scheduling algorithm of the sampling phase and the GNN computation phase is described in Section 6.3.6. The general idea is that, during training, we maintain a pool of sampled subgraphs {G i }. When{G i } is empty, the scheduler launches p inter frontier samplers in parallel, and fill the pool with subgraphs independently sampled from the full graph G. Each of the p inter sampler instances runs on p intra number of processing units. Thus, the scheduler exploits both intra- and inter-subgraph parallelism. In each training iteration, we remove a subgraphG s from{G i }, and build a complete GNN uponG s , following Algorithm 1. When filling the pool of subgraphs, total amount of parallelism p intra · p inter is fixed on the target platform. We should choose the value of p intra and p inter carefully based on the trade-off between the two levels of parallelism. Note that the operations on DB mostly involve a chunk of memory with continuous addresses. This indicates that intra-subgraph parallelism can be well exploited at the instruction level using vector instructions (e.g., A VX). In addition, since most of the memory traffic going into DB is in a random manner, it is desirable to have DB stored in cache. As coarse estimation, withm = 1000,η = 2,d = 25, the memory consumption by one DB is400KB 1 . This indicates that DB mostly fits into the private L2 cache (size 256KB) in modern shared memory parallel machines. Therefore, during sampling, we bind one sampler to one processor core, and use A VX instructions to parallelize within a single sampler. For example, on a 40-core machine with A VX2,p intra =8 andp inter =40. 1 Assume 8-byte address pointing to the tuple of pins. So the size of DB is2· 1000· 25· 8 Bytes and the size for the pins is1000· 4· 4 Bytes. 102 Finally, note that the size of DB is determined by the number of frontier nodes,m, rather than the number of subgraph nodesn. While it is true that we may need to increasen when the original training graphG grows, the size ofm would not need to change. The authors of [72] interpretm as the dimensionality of the random walk – frontier sampling onG is equivalent to a single random walk onG raised to them-th Cartesian power. With such understanding, the authors of [72] use a fixed number of m=1000 on all experiments in ranging from small graphs to large ones. 6.3.4 Extension to Other Graph Sampling Algorithms By Section 4.2.4, it is reasonable to use other graph sampling algorithms to perform minibatch GNN training. Here we evaluate two sampling algorithms: random edge sampling (“Edge”) and unbiased random walk sampling (“RW”), as described in Section 4.2.4. The “Edge” sampler assigns the probability of picking an edge (u,v) as p u,v ∝ 1 deg(u) + 1 deg(v) , and can be understood as a special case of the “RW” algorithm by setting the walk length to be 1. “Edge” and “RW” samplers are static algorithms since the sampling probability does not change during the sampling process. Therefore, their computation complexity is much lower than that of frontier sampling. It is easy to show that both have computation complexity ofO(|V s |+|E s |) (we can use alias method [80] for “Edge” sampling to achieve such complexity). For the “Edge” and “RW” samplers, we thus only apply inter-sampler parallelism to achieve scalability. We can use exactly the same inter-sampler parallelization strategy discussed above. The only difference is that each subgraph in the pool{G i } is now obtained by a serial “Edge” or “RW” sampler. To further improve the training accuracy, we further integrate the aggregator normalization and loss normalization techniques of GraphSAINT. Such normalization requires two minor modifica- tions to our training algorithm: 103 • Pre-processing: Before training, we would need to independently sample a given number of subgraphs to estimate the probability of eachv∈V ande∈E being picked by the sampling algorithm. The pre-processing can be parallelized by the strategies discussed above. • Applying the normalization coefficients: with aggregator normalization, the feature aggre- gation (i.e., b A s · X s ) would be based on a re-normalized adjacency matrix. With loss nor- malization, the loss L s would be computed with weighted sum for the minibatch nodes. Therefore, the two normalization steps do not make any change on the computation pattern. 6.3.5 Parallel Algorithm for Minibatch Training We discuss in detail the parallel algorithm for feature propagation within each sampled subgraph. During training, each node in the graph convolution layerℓ pulls features from its neighbors, along the layer edges. Essentially, the operation of b AX (ℓ− 1) s can be viewed as feature propagation within the subgraphG s . A similar problem, label propagation within graphs, has been extensively studied in the lit- erature. State-of-the-art methods based on vertex-centric [13] and edge-centric [74] paradigms perform node partitioning on graphs so that processors can work independently in parallel. The work in [106] also performs label partitioning along with graph partitioning when the label size is large. In our case, we borrow the above ideas to allow two dimensional partitioning along the graph as well as the feature dimensions. However, we also realize that the aforementioned techniques may lead to sub-optimal performance in our GNN based feature propagation, due to two reasons: • The propagated data from each node is a long feature vector (consisting of hundreds of elements) rather than a small scalar label. • Our graph sizes are small after graph sampling, so partitioning of the graph may not lead to significant advantage. 104 In the following, we analyze the computation and communication costs of feature propaga- tion after graph and feature partitioning. We temporarily ignore load-imbalance and partitioning overhead, and address them later on. Suppose we partition the subgraph into Q v number of disjoint node partitions{V i s | 0 ≤ i ≤ Q v − 1}. Let the set of nodes that send features toV i s beV i src ={u|(u,v)∈E s ∧v∈V i s }. Note that V i s ⊆V i src , if we follow the design in [37, 21, 20] to add a self-connection to each node. We further partition the feature vectorx v ∈R f of each nodev intoQ f equal parts{x i v | 0≤ i≤ Q f − 1}. Each of the processors is responsible for propagation ofX i,j s ={x j v |v∈V i src }, flowing from V i src intoV i (where0≤ i≤ Q v − 1 and0≤ j≤ Q f − 1). Define γ v = |V i src| |V| as a metric reflecting the graph partitioning quality. While γ v depends on the partitioning algorithm, it is always bound by 1 Qv ≤ γ v ≤ 1. Letn=|V s | andf =|x v |. So|V i |= n Qv and|x i v |= f Q f . In our performance model, we assume p processors operating in parallel. Each processor is associated with a private fast memory (i.e., cache). The p processors share a slow memory (i.e., DRAM). Our objective in partitioning is to minimize the overall processing time in the parallel system. After partitioning, each processor owns Qv·Q f p number ofX i,j s , and propagates itsX i,j s intoV i . Due to the irregularity of graph edge connections, accesses into X i,j s are random. On the other hand, using the CSR format, the neighbor lists of nodes inV i can be streamed into the processor, without the need to stay in cache. In summary, an optimal partitioning scheme should: • Let eachX i,j s fit into the fast memory; • Utilize all of the available parallelism in the system; • Minimize the total computation workload; • Minimize the total slow-to-fast memory traffic; • Balance the computation and communication load among the processors. 105 Each round of feature propagation has n Qv · d· f Q f computation, and2· n Qv · d+8· n· γ v · f Q f communication (in bytes) 2 . Computation and computation overQ v · Q f rounds are: g comp (Q v ,Q f )=n· d· f (6.8) g comm (Q v ,Q f )=2· Q f · n· d+8· Q v · n· f· γ v (6.9) Note that g comp (Q v ,Q f ) is not affected by the partitioning scheme. We thus formulate the following communication minimization problem: minimize Qv,Q f g comm (Q v ,Q f )=2Q f · nd+8Q v · nfγ v (6.10) subject to Q v Q f ≥ p; 8nfγ v Q f ≤ S cache ; Q v ,Q f ∈Z + ; (6.11) Next, we prove that without any graph partitioning we can obtain a 2-approximation for this opti- mization problem for small subgraphs. Theorem 9. Q v = 1,Q f = max n p, 8nf S cache o results in a 2-approximation of the communication minimization problem (Equation 6.10), forp≤ 4f d and2nd≤ S cache , irrespective of the partition- ing algorithm. Proof. Note that sinceQ v ,Q f ≥ 1 andγ v ≥ 1/Q v ,∀Q v ,Q f : g comm (Q v ,Q f )≥ 2Q f nd+8Q v nf 1 Q v ≥ 8nf. SetQ v =1 andQ f =max n p, 8nf S cache o . Clearly,γ v =1. 2 Given that sampled graphs are small, we useINT16 to represent the node indices. We useDOUBLE to represent each feature value. 106 Case 1,p≥ 8nf S cache In this case,Q f =p≥ 8nf/S cache . Thus both constraints are satisfied. And, g comm (1,p)=2ndp+8nf =8nf pd 4f +1 ≤ 8nf· (1+1)=16nf due top≤ 4f/d. Case 2,p≤ 8nf S cache In this case,Q f =8nf/S cache is a feasible solution. And, g comm 1, 8nf S cache =2nd 8nf S cache +8nf =8nf 2nd S cache +1 ≤ 8nf· (1+1)=16nf due to2nd≤ S cache . In both cases, the approximation ratio of our solution is ensured to be: g comm 1,max n p, 8nf S cache o ming comm (Q v ,Q f ) ≤ 16nf 8nf =2 Note that this holds for S cache ≥ 2nd. So for a cache size of 256KB, number of edges in the subgraph (i.e., nd) can be up to 128K. Such upper bound on|E s | can be met by the subgraphs in consideration. Also, since f ≫ d, the condition p≤ 4f/d holds for most of the shared memory platforms in the market. Note that the above theorem is derived by a simple lower bounding on the ratio γ v for any (including the optimal) partitioning scheme. However, finding such optimal partitioning is computationally infeasible even on small subgraphs, since there are exponential number of possible partitioning. We thus do not provide experimental evaluation on this theorem. 107 Using typical valuesn≤ 8000,f = 512, andd = 15, then for up top≤ 4f d = 136 cores 3 , the total slow-to-fast memory traffic under feature only partitioning is less than 2 times the optimal. Recall the two properties (see the beginning of this section) that differentiate our case with the traditional label propagation. Because the graph size n is small enough, we can find a feasible Q f ∈ Z + solution to satisfy the cache constraint 8nf Q f ≤ S cache . Because the value f is large enough, we can find enough number of feature partitions such that Q f ≥ p. Algorithm 10 specifies our feature propagation. Algorithm 10 Feature propagation within sampled graph Input: SubgraphG s (V s ,E s ) with adjacency matrix b A s ; Node feature matrixX (ℓ− 1) s ; Cache size S cache ; Number of processorsp Output: Feature matrixX (ℓ) s 1: n←|V s |; f ← length of the feature vector of a node; 2: Q f ← max n p, 8nf S cache o ; f ′ ← f/Q f ; 3: Column-partitionX (ℓ− 1) s intoQ f equal-size parts h X (ℓ− 1) s i :,i·f ′ :(i+1)·f ′ 4: forr =0 toQ f /p− 1 do 5: forj =0 top− 1 pardo 6: i← r+j· Q f /p 7: h X (ℓ) s i :,i·f ′ :(i+1)·f ′ ← b A s · h X (ℓ− 1) s i :,i·f ′ :(i+1)·f ′ 8: end for 9: end for 10: returnX (ℓ) s Lastly, the feature only partitioning leads to two more important benefits. Since the graph is not partitioned, load-balancing (with respect to both computation and communication) is optimal across processors. Also, our partitioning incurs almost zero pre-processing overhead since we only need to extract continuous columns to form sub-matrices. In summary, the feature propagation in our graph sampling-based training achieves 1. Minimal computation; 2. Optimal load-balancing; 3. Zero pre-processing cost; 4. Low communication volume. 3 Note thatd here refers to the average degree of the sampled graph rather than the training graph. Thus,d value here is set to be lower than that in Section 6.3.3. 108 6.3.6 Complete Framework Computation order of layer operations Both the forward and backward propagation of GNN layers (Equations 2.8, 2.3, 2.9, 2.10 and 6.4) involve multiplying a chain of three matrices. Given a chain of matrix multiplication, it is known that different orders of computing the chain leads to different computation complexity. In general, we can use dynamic programming techniques to obtain the optimal order corresponding to the lowest computation complexity [2]. Specifically, for our training problem, we have a chain of three matrices whose sizes and densities are known once the subgraphs are sampled. Consider a sparse matrixA∈R n× n (with density δ ), and two dense matricesW 1 ∈ R n× f 1 andW 2 ∈ R f 1 × f 2 . To calculateAW 1 W 2 , there are two possible computation orders. Order 1 of(AW 1 )W 2 computes the partial resultP =AW 1 first and then computesPW 2 . This order of computation requiresδ · n 2 · f 1 +n· f 1 · f 2 Multiply-ACcumulate (MAC) operations. Order 2 ofA(W 1 W 2 ) computes the partial resultP =W 1 W 2 first and then computesAP . This order requiresδ · n 2 · f 2 +n· f 1 · f 2 MAC operations. Therefore, iff 1 <f 2 , order 1 is better. Otherwise, we should use order 2. Similarly, supposeW 3 ∈R n× f 3 and our target is(W 1 ) T AW 3 . Then order 1 of(AW 1 ) T W 3 is better than order 2 of(W 1 ) T (AW 3 ) if and only iff 1 <f 3 . Consider a GraphSAGE layerℓ. Iff (ℓ− 1) <f (ℓ) , we should use order 1 to calculate the forward propagation of Equation 2.8, order 1 to calculate∇ W (ℓ) ◦ L s of Equation 6.4 and order 2 to calculate ∇ X (ℓ− 1) s L s of Equation 6.4. Note that the decision of the scheduler only relies on the dimension of the matrices, and thus can be made during runtime at almost no cost. In addition, the partitioning strategy presented in Section 6.3.5 does not rely on any specific computation order. In summary, the light-weight scheduling algorithm reduces computation complexity without affecting scalability. Scheduling the feature partitions After partitioning the feature matrix (Section 6.3.5), the ques- tion still remains how to schedule these partitions for further performance optimization. Ideally, since the operations on the partitions are completely independent, any scheduling would lead to 109 identical performance. However, in reality, the partitions may undesirably interact with each other due to “false sharing” of data in private caches. If the size of each feature partition is not divisible by the cacheline size, then in the private cache of the processor owning partition i, there may be one cachline containing data of both partitionsi andi+1, and another cacheline containing data of both partitionsi− 1 andi. Therefore, if the partitionsi− 1,i andi+1 are computed concurrently, there may be undesirable data eviction to keep the three caches clean. So the scheduler should try not to dispatch adjacent partitions at the same time, and we follow the processing order as specified by lines 5 and 6 of Algorithm 10 to achieve this goal. When the number of processors is large or the number of feature partitions is small (i.e., line 4 of Algorithm 10 finishes in one iteration), it is inevitable to process adjacent partitions in parallel. On the other hand, note that if the partition size is divisible by the cacheline size, we can avoid “false sharing” regardless of the scheduling. The partition size equals w·|V s |· f/Q f , where w specifies the word-length. Suppose the cacheline size is S cline . Then our goal is to make |V s | divisible by S cline /w. For example, if we use double-precision floating point numbers in training and the cacheline size is 128 bytes, then we can clip the number of subgraph nodes to be divisible by 16. Considering that |V s | is in the order of 10 3 , such clipping has negligible effect on the subgraph connectivity and the training accuracy. The node clipping can be performed before the induction step (line 9 of Algorithm 7) by randomly dropping nodes inV s . Therefore, the clipping step incurs almost zero cost. Overall scheduler Algorithm 11 presents the overall training scheduler. As discussed in Section 6.3.3, multiple samplers can be launched in parallel without any data dependency. This is shown by lines 4 to 8. After the GNN is constructed, the forward and backward propagation operations are parallelized by the techniques presented in Section 6.3.5. The scheduler performs two decisions based on the sampled subgraphs. The first decision (during runtime) is to perform node clipping to improve cache performance. The second decision (statically performed before the actual training) is to determine the order of matrix chain multiplication in both forward and backward propagation 110 Algorithm 11 Runtime scheduling (with Frontier sampling as an example) Input: Training graphG(V,E,X); Ground truth labelsY ;L-layer GNN model; Sampler param- etersm,n,η ; Parallelization parametersp inter ,p intra Output: Trained GNN weights 1: {G i }←∅ ▷ Set of unused subgraphs 2: while not terminate do ▷ Iterate over minibatches 3: if{G i } is empty then 4: forp=0 top inter − 1 pardo 5: G s ← SAMPLE(G(V,E)) withp intra ; Clip nodes by cacheline size 6: TransposeG s by permuting the DATA array of the CSR 7: AddG s and its transposed array DATA to the pool{G i } 8: end for 9: end if 10: G s ← Subgraph popped out from{G i } 11: Construct GNN onG s 12: Determine the order of matrix chain multiplication 13: Parallel forward and backward propagation on GNN 14: end while 15: return Trained GNN weights to reduce computation complexity. Note that our scheduler is a general one, in the sense that the training can replace the frontier sampler with any other graph sampling algorithm in a plug-and- play fashion. The processing by the scheduler has negligible overhead. 111 6.4 GraphACT: Accelerating GNN Training on CPU-FPGA Heterogeneous Platforms 6.4.1 Hardware-friendly Minibatch Training Algorithm For efficient training on large graphs, the first step is to reduce external memory communication via minibatches. Ideally, a minibatch should be sampled such that all data required for gradient calculation (e.g., X (ℓ) of the minibatch nodes) fits in BRAM. Among the numerous algorithms [37, 18, 19, 44, 97, 96, 98], some return minibatches not suitable for hardware execution. We categorize these algorithms and analyze the hardware cost on external memory accesses. Minibatch by sampling GNN layers [37, 18, 19, 44] Sampling algorithms in this category tra- verse GNN layers backward from layer-L outputs to layer-1 inputs to select minibatch nodes. Assume b nodes of layer ℓ+1 are selected. The sampler then takes α · b nodes of layer ℓ based on the inter-layer connections and/or the features of the d· b neighbors in layer-ℓ (where d is the average node degree). Specifically, the sampling of [37] does not depend on neighbor features and 10 ≤ α ≤ 50 in general. [19] uses the neighbor features as “historical activation” and α = 2. [44] computes the node probability by the neighbor features andα = 1 on average. [18] does not require neighbor features and α = 1. In summary, suppose we sample b 0 output nodes of layer L. Then the sampler reads the neighbor features of each layer (if required) to eventually return α L · b 0 input nodes in layer1. During training, we read features of theα L · b 0 input nodes of layer 1, and then perform forward propagation to compute output features of the layer-ℓ sampled nodes (1≤ ℓ≤ L). Minibatch by sampling training graph [97, 96, 98] The algorithm samples fromG instead of the GNN layers. Given a graph sampling algorithm (e.g., multi-dimensional random walk [72]), [96, 98] returns a subgraphG s (V s ,E s ) induced fromG(V,E) (where|V s |≪|V| ). The minibatch contains the same set of nodesV s for all the GNN layers. In other words, for each minibatch, 112 [96, 98] constructs a complete L-layer GNN fromG s , with the layer nodes defined by V s and the layer connections defined by E s . Note that unlike [19, 44], the graph sampler of [96, 98] requires no information from node features. Suppose we were to implement the above minibatch training algorithms on FPGA. Since the full X (ℓ) is too large to fit on-chip, we need to read from external memory the following data: 1. layer-1 input features of the minibatch nodes, and 2. layer-ℓ input features of the minibatch neighbor nodes (if required). For ease of analysis, assume the feature sizes of each layer are the same, and let the cost of transferring one feature vector be 1. Also, let the cost of aggregation and computing UPDATE(·) for one node be1. Ignore the computation cost of feature aggregation. Table 6.2 summarizes the ratio between on-chip computation cost and off-chip communication cost, where for the “Value” row, we setL=2,d=15 andα as25,1,2,1 for [37], [18], [19], [44] respectively. Algorithms of low computation-communication ratio (i.e., [37, 44, 19]) impede the development of efficient hardware accelerators. For the remaining two algorithms, the complexity of the sampler of [96, 98] is much lower than that of [18], and the training accuracy of [96, 98] is higher. Thus, we select [96, 98] as the target training algorithm for acceleration. Table 6.2: Computation-communication ratio of various training algorithms. LetB = P L− 1 ℓ=0 α ℓ b 0 , andB ′ =L· b 0 . [37] [18] [19] [44] [97, 96, 98] Expression B α L b 0 B α L b 0 B α L b 0 +dB B α L b 0 +dB B ′ b 0 Value 0.04 2 0.06 0.06 2 6.4.2 FPGA Architecture We use GraphSAGE [37] as an example layer architecture to illustrate our accelerator design tech- niques. 113 Setting the subgraph-based training as the target training algorithm, we further apply the redun- dancy reduction techniques in Section 4.3 to reduce the computation workload of FPGA. There- fore, the FPGA executes neighbor propagation directly on the transformed subgraphG # s . Figure 6.1: Overview of the processing pipeline on FPGA (L=2) We first partition the workload between FPGA and CPU. We let FPGA execute the computation intensive operations, and leave the communication intensive parts to CPU. A natural partition is: CPU performs graph sampling, and then converts G s to G # s ; FPGA performs the forward and backward pass defined by Equations 2.3 and 6.4, where feature propagation of each layer is based onG # s . Notice that under supervised learning, the last GNN layer is followed by a node classifier (see Equations 2.5 and 2.6). The softmax layer at the end of the classifier and the cross-entropy loss require computation of exponential and logarithmic functions. Since softmax and loss contribute to a negligible portion of the total workload and their accurate calculation requires complicated hardware [65, 25], we assign their computation to CPU. Section 6.4.3 describes the scheduling. To improve overall training throughput, we need to 1. reduce the overhead in external memory access, and 2. increase the utilization of the on-chip resources. The first goal can be achieved by setting the minibatch parameters so that the size ofG s is appropriate for the BRAM capacity. 114 Ideally, once receiving the initial node features (i.e.,X (0) s ofV s ) and the connection information (i.e., D s , A # s , M a ), the FPGA should propagate forward from the first GNN layer to the last classifier MLP layer without accessing external memory. Similarly, once receiving the gradient with respect to softmax, FPGA should propagate backward without accessing external memory. Thus, CPU needs to communicate: • To FPGA:X (0) s ,D s ,A # s ,M a and ∂L ∂X out MLP • From FPGA:X out MLP And on-chip BRAM needs to store: • Node features: S 0≤ ℓ≤ L n X (ℓ) s o • Subgraph topological data:D s ,A # s andM a • Pre-computed vector sum for pairs inM a • Intermediate feature aggregation results • Model weights: S 1≤ ℓ≤ L n W (ℓ) ◦ ,W (ℓ) ⋆ o andW MLP • Gradient information: ∂L ∂X (ℓ) s and ∂L ∂X (ℓ− 1) s , for any2≤ ℓ≤ L • Optimizer specific auxiliary data • Other data (i.e., tile buffer the systolic arrays) Note that we only need to store gradients with respect to activations of consecutive layers. When calculating ∂L ∂W (ℓ) ◦ and ∂L ∂W (ℓ) ⋆ , the data ∂L ∂X (ℓ) s and ∂L ∂X (ℓ+1) s are stored on chip. When calculating ∂L ∂X (ℓ− 1) s , we overwrite ∂L ∂X (ℓ+1) s with the newly generated ∂L ∂X (ℓ− 1) s . Also note that the optimizer specific auxiliary data depends on the optimizer used. For vanilla gradient descent, no auxiliary data is needed. For gradient descent with momentum [75], we need to store the gradient with respect to model weights for the previous minibatch. For Adam optimizer [49], we need to store 115 more gradient related data (e.g., first moment estimate, second moment estimate, etc.). No matter what optimizer is used, the size of the auxiliary data is comparable to the size of model weights. Regarding the processing pipeline on-chip, there are two main computation modules to perform feature aggregation and weight transformation. In Figure 6.1, the Feature Aggregation module consists of a 1D accumulator array to sum the node vectors. The Weight Transformation module consists of a 2D systolic array to compute the dense matrix product between X and W . The two modules are reused for the computation of theL GNN layers, and the Weight Transformation module is also reused for the MLP layer. The various BRAM buffers store the data listed above. During forward pass, when computing layerℓ, the Feature Aggregation and Weight Transformation modules read from the buffer ofX (ℓ− 1) s , and write intoX (ℓ) s . In backward pass, for layerℓ, the two computation modules read the buffer ofX (ℓ− 1) s , and read / write into the two gradient buffers in a ping-pong fashion. TheX (ℓ) s and ∂L ∂X (ℓ) s buffers use feature major data layout. So each cycle, a buffer can provide the full feature vector of one node. For a BRAM block (36bits× 1K) of the Xilinx Alveo board we use, we store feature values (32-bit floating point) of 1K different nodes. Feature aggregation module performs feature aggregation in three steps. The first pre- computation step calculates the vector sum of node pairs inM a , and stores the results in the buffer forX M . The second step computesA # s · X (ℓ) s by readingX M andX (ℓ) s . The final step applies the scaling coefficient based on D s . The aggregated features are written into a temporary buffer to be consumed by the Weight Transformation module. For the below discussion, we ignore step 3 since its complexity is low (i.e.,O(|V s |)) compared with the other steps (i.e.,O(|E s |)). Regarding steps 1 and 2, since the feature length is large, we explore parallelism along the feature dimension. In this case, a simple 1D accumulator array of size P agg = max 0≤ ℓ≤ L− 1 f (ℓ) is sufficient. During pre-computation, pairs of node indices are read sequentially fromM a . Vector sum of each pair consumes two cycles. Similarly, during propagation ofA # s , indices in the neighbor lists are read sequentially. Note that even thoughM a contains pairs of multiple rounds (see Section 4.3.3), as 116 long as the pre-computation on the earlier rounds finishes before that of later rounds, the memory controller for the accumulator array is straightforward. For example, assume the original subgraph nodes are indexed continuously from 1 to|V s | and the new nodes generated in the first round are indexed from|V s |+1 to|V s |+|M a |. The matching in the second round,M ′ a , can only contain indices from 1 to |V s | +|M a |. If the second round starts after the first round has finished, all features required byM ′ a are ready. So the accumulator array can directly readX (ℓ) s andX M , and continue filling X M with new features indexed from|V s |+|M a |+1 to|V s |+|M a |+|M ′ a |. Weight transformation module performs weight transformation of either a GNN layer (i.e., UPDATE(·)) or a MLP layer. The main operation is multiplication of dense matrices. We use a 2D systolic array to execute the blocked matrix multiplication algorithm. Let the dimension of the systolic array be P sys (where P sys ≪ f (ℓ) ). So the computation parallelism of this module is P 2 sys , and each cycle2P sys data are streamed in the array. Next we specify the data access pattern. Suppose we computeX· W , whereX ∈R |Vs|× f andW ∈R f× f ′ . Then theX buffer is depth- wise partitioned to tiles of layoutP sys × f, and theW buffer is width-wise partitioned to tiles of layoutf× P sys . There are |Vs| P sys × f ′ P sys pairs of tiles. For each pair, the systolic array reads a diagonal (or anti-diagonal if matrix is transposed) of the two tiles per cycle. Thus, computing a pair of tiles consumesf+P sys − 1 cycles, and computing the full dot product takes |Vs| P sys · f ′ P sys · (f +P sys − 1)≈ 1 P 2 sys ·|V s |· f· f ′ cycles. To perform ReLU in the forward pass, we only need to add a comparator in each PE of the systolic array. When the results for a pair of tiles are ready, the PEs clip negative values to zero and append a status bit to indicate whether or not the clipping occurs. The mask function in the backward pass can thus be implemented in a simple way based on the True or False of the status bit. Interaction with Feature Aggregation module When operating on the neighbor weightW ⋆ , Weight Transformation module reads the feature from the buffer filled by the Feature Aggregation module. When operating on the self weightW ◦ , conflicts may occur since both modules read from 117 theX buffer. We add a small tile buffer in the Weight Transformation module to completely avoid read conflicts . The tile buffer of size P sys × f stores a tile ofX, and is filled in f cycles. Data in the tile buffer stay for (f +P sys − 1)· f ′ P sys cycles to enumerate all tiles ofW . Read conflicts can only happen during the filling of the tile buffer. And we simply stall the Feature Aggregation module in this short period of time. Since f ′ ≫ P sys , the pipeline stall has negligible impact on performance. The above analysis is based on the forward pass operation. In the backward pass, we swap the dimensions of|V s | andf, and the same conclusion holds. Remark To further increase parallelism, we can aggregate vectors of multiple neighbors in the same cycle. Then the accumulator array would be replaced by an accumulator tree, and the buffer would be further partitioned to reduce bank conflicts in BRAM. In this case, challenges such as load-balance would emerge. Fortunately, for most target FPGA devices, parallelism of just maxf (ℓ) is sufficient. We discuss this in detail in Section 6.4.4. Also, we evaluate the storage overhead due toX M in Section 4.4.6. 6.4.3 CPU-FPGA Co-processing Scheduling between CPU and FPGA CPU samples the subgraphG s , transforms it toG # s , and calculates softmax, loss and the corresponding gradients. These are shown in light blue blocks of Figure 6.2. FPGA handles majority of the computation workload in the forward and backward pass, as shown in light green. Communication between CPU and FPGA, as specified in Section 6.4.2, is in dark blue. Notice that the subgraphs are independently sampled for each minibatch. Thus, CPU can prepareG s andG # s for the next minibatch, while simultaneously FPGA is training the current one. This explains the overlap between step 7 and 2. Parallelism can be explored by multiple cores of the CPU. The C cores can process subgraphs for the next C minibatches in parallel without any dependency. 118 1 7 3 4 5 2 6 CPU FPGA Time a c b a c b a c b a c b Modules Transfer initial data Sample; pre-process (for next batch) Propagate forward Transfer inputs to softmax Calculate softmax and loss Transfer softmax gradient Propagate backward Aggregate features Transform w.r.t. self-weight Transform w.r.t. neighbor-weight Figure 6.2: Per-minibatch scheduling between CPU and FPGA (up), and between the two compu- tation modules on FPGA (down). The sampling on CPU and the layer computation on FPGA are pipelined, and the execution of the two computation modules on FPGA is also pipelined. Scheduling of FPGA modules In Figure 6.2, we only show the scheduling of the GNN forward pass. Scheduling of MLP and the backward pass can be analogously designed. Computation on neighbor weights depends on the aggregated feature and computation on self weights has no dependency. Thus, we overlap the operations of a and b. The avoidance of read conflicts between a and b is discussed in Section 6.4.2. During operation c, Feature Aggregation module is idle. We analyze its impact on DSP utilization in Section 6.4.4. 6.4.4 Theoretical Performance Analysis To simplify notation, we assume f (ℓ) = f,∀0 ≤ ℓ ≤ L. So the weight matricesW (ℓ) ◦ ∈ R f× 1 2 f andW (ℓ) ⋆ ∈ R f× 1 2 f , where the 1 2 factor is due to the concatenation in the forward pass. For the classifier, we assume a single layer MLP, with W MLP ∈R f× f . 119 Regarding FPGA resources, we assume accumulators and multipliers are implemented by DSPs and have the same hardware cost. We assume the target FPGA can implement R DSP num- ber of accumulators / multipliers, store R BRAM words, and communicate R BW words with external memory per cycle. Here a word represents an element of the feature vectors or the weight matrices. Training performance depends on the parameters related to: • Minibatch and GNN:|V s |,d andf • Redundancy reduction: γ add ,γ read and P M∈Ma |M| • FPGA architecture: P agg ,P sys We also use the fact thatd≪ f ≪|V s | to simplify analysis. Computation Each GNN layer performs feature aggregation 3 times and product on weights 6 times. Two of the feature aggregation operate on length-f vectors, and the other one on length- 1 2 f vectors. Since feature aggregation is parallelized along feature dimension only, it takes exactly γ read ·|V s |· d· f/P agg cycles to aggregate length-f features. All six products on weights have the same complexity, and each takes 1 2 ·|V s |· f 2 /P 2 sys cycles. By the schedule in Figure 6.2, to hide the feature aggregation time, we have: γ read ·|V s |· d· f· 1 P agg = 1− 2P sys f · 1 2 ·|V s |· f 2 · 1 P 2 sys (6.12a) P agg +2P 2 sys =R DSP (6.12b) P agg ≤ f (6.12c) where factor 1− 2P sys f is due to the pipeline stall analyzed in Section 6.4.2. Solving Equations 6.12a and 6.12b under the constraint of 6.12c gives the architectural parametersP ∗ agg ,P ∗ sys . Under 120 reasonable values off,|V s |,d,γ read and R DSP , the solutionsP ∗ arr andP ∗ sys always exist (see also the following analysis on “Load-balance”). Total FPGA cycles 4 to complete one minibatch is: T batch =(3L+3)·|V s |· f 2 · 1 P ∗ sys 2 (6.13) Communication (on-chip storage) All data listed in Section 6.4.2 have to fit on-chip. Index data A # s and M a and coefficient D are negligible compared with feature data (since d ≪ f). Size of the buffer forX M is f · P M∈Ma |M|. Size of the buffer between Feature Aggregation and Weight Transformation modules is f ·|V s |. Size of the tile buffer in Weight Transformation module isP sys ·|V s |. Weights for each layer takesf 2 . Under gradient descent with momentum, the optimizer requires additionalf 2 storage per layer for the auxiliary data. Thus, (L+5)f|V s |+f X M∈Ma |M|+P sys |V s |+2(L+1)f 2 ≤ R BRAM (6.14) By adjustingθ and number of rounds (see Algorithm 4 and Section 4.3), we control P |M| to meet a pre-defined budget (say |V s |). Then, we tune the graph sampler so that|V s | satisfies inequality 6.14. Communication (off-chip accesses) Ignoring D s , A # s and M a , CPU-FPGA data transfer include the initial features and the MLP output features and gradients. Thus,β , the ratio between on-chip computation and off-chip communication, is lower bounded 5 by f, indicating high reuse of on-chip data, and low overhead in external memory access. Load-balance If feature aggregation is parallelized along feature dimension only, the full FPGA pipeline is perfectly load-balanced, regardless of the graph connection. However, if we would 4 We ignore the number of cycles to apply gradients to weights (elementwise operation), since it is negligible compared to the number of cycles to calculate gradients. 5 To simplify expression, we only consider the computation of the systolic array on graph convolutional layers. Therefore,f is just a lower bound on the ratio. 121 have to aggregate features of multiple nodes in parallel, BRAM access conflicts would cause load- imbalance of the module and degrade training performance. Load-imbalance is more likely to happen on FPGAs with more DSP resources. The threshold ˆ R DSP for a design to become load- imbalanced can be derived by pluggingP agg =f into Equations 6.12a and 6.12b. After simplifying the expression, we have: ˆ R DSP > f dγ read 2 · 1+dγ read − q 1+2γ read d (6.15) Feature dimension is often set to256, 512 or1024 in the literature. Subgraph degree is often less than20, so we setd=15. The ratioγ read depends onG s , and we set it to0.7. With such parameter values, ˆ R DSP > 4048, meaning that there have to be at least 4048 multipliers / accumulators on chip for our FPGA pipeline to become load-imbalanced. Even the largest FPGAs in the market (e.g., Xilinx UltraScale+, Intel Stratix-10) does not have such high DSP capacity (considering that data in single precision floating point are necessary to achieve high training accuracy). Note that γ read helps keep the threshold ˆ R DSP high. Ifγ read =1, ˆ R DSP is only3038. DSP Utilization Since load-balance is achieved and BRAM conflicts are eliminated, we can analytically derive the DSP utilization for anyG s . From the timeline of Figure 6.2, the systolic array is idle when CPU is communicating with FPGA (ignoring the CPU computation time on softmax and loss). The accumulator array of Feature Aggregation module is idle during the stall and during computation on neighbor weights. Using the fact that computation-communication ratioβ >f , we derive the overall DSP utilizationµ as follows: µ>µ ′ · 1 1+µ ′ · R DSP f· R BW (6.16) whereµ ′ = 1 R DSP · 2 P ∗ sys 2 + 5 12 P ∗ arr 1− 2P ∗ sys f , andP ∗ arr ,P ∗ sys satisfiy Equations 6.12a and 6.12b. The 5 12 factor is due to the aggregation of length- 1 2 f features in the backward pass. If we can reduce the redundancy up toγ read > 1− 2P ∗ sys f > 1− √ 2R DSP f , we can then lower-boundµ ′ by 1 1+d/f . With 122 the typical parameter values specified in the above “Load-balance” paragraph, µ ′ >95%. For most FPGA platforms, R DSP f· R BW ≪ 1, so overall DSP utilization µ is close to 1. For example, on Xilinx Alveo U200 board connected to CPU via PCIe 3.0 x16, we haveµ> 93%. 6.4.5 Discussion on GPU-Based Acceleration This work developed several hardware-software optimizations for GNN training on CPU-FPGA heterogeneous platforms. We discuss these optimizations and their applicability to various plat- forms. Design challenges The key to accelerating GNN training is to address the challenges of memory access and load-balance. The solutions for these differ on various platforms. Regarding mem- ory access, the solution on FPGA must optimize both on-chip and off-chip accesses. For data in BRAMs, we need to increase their reuse so as to reduce off-chip communication. We also need to reduce bank access conflicts to reduce pipeline stalls. In GraphACT, we reduce off-chip communication by setting the subgraph size based on the BRAM capacity. We eliminate on-chip access conflicts by appropriately parallelizing the feature aggregation operation and buffering data tiles between computation modules. On the other hand, for CPU and GPU, the critical issue is to enhance data access locality since the memory hierarchy is not explicitly exposed to the pro- grammer. Considering that the full graph fits in the CPU main memory or GPU global memory (sizes of which range from 10 1 to 10 3 GB), the memory access challenge on CPU and GPU may be addressed by software node-reordering or by hyper-threading. Regarding load-balance, CPU and GPU can decouple the operations of feature propagation and weight transformation, and then adopt separate strategies to balance them. However, on FPGA, since the two operations are pro- cessed by a single pipeline, an integrated strategy needs to simultaneously balance the very differ- ent operations (one on dense tensors and the other on sparse tensors). While memory access and load-balance are challenging to optimize, a carefully designed FPGA pipeline can achieve much 123 higher efficiency than the CPU or GPU solutions. This is due to the ability of FPGA to customize computation modules and to control memory accesses in an explicit and fine-grained manner. Remark on redundancy reduction Although redundancy reduction is an algorithm-level opti- mization independent of the platform, it may not be directly applicable to CPU to improve perfor- mance. Recall that redundancy reduction (Section 4.3) constructs a subgraphG # s with less edges yet more nodes thanG s . Thus, feature aggregation onG # s may have worse data locality thanG s . On a CPU, even though the L3-cache can hold the node features ofG # s , the overall feature aggregation performance may not benefit from the reduced computation load due to the potential increase in L1- or L2-cache misses. On the other hand, data access locality ofGraphACT does not affect FPGA performance, as long as data of one minibatch completely fits in BRAM. Under our architecture and data layout (Section 6.4.2), BRAM will always provide the requested node feature ofG # s to the Feature Aggregation module within one cycle. 124 6.5 Experiments 6.5.1 Datasets, Platforms and Implementations Platform for CPU-based parallel training algorithm We run experiments on a dual 20-Core Intel ® Xeon E5-2698 v4 @2.2GHz machine with 512GB of DDR4 RAM. For thePython imple- mentation, we usePython 3.6.5 with Tensorflow 1.10.0. For the C++ implementation, the com- pilation is via Intel ® ICC (-O3 flag). ICC (version 19.0.5.281), MKL (version 2019 Update 5) and OMP are included in Intel Parallel Studio Xe 2018 update 3. Platform for FPGA-based accelerator We evaluate on a Xilinx Alveo U200 board hosted by a 40-core Xeon server (E5-2698 v4 @2.2GHz, hyper-threaded). CPU-FPGA communication is via PCIe 3.0 x16 (peak bandwidth: 15.8 GB/s). The UltraScale+ XCU200 FPGA on Alveo has 5867 DSP slices, 1766 36Kb block RAMs and 800 288Kb Ultra RAMs 6 . For Float32, on-chip RAM can store 8166K words, and DSPs can implement 1173 accumulators or 1955 multipliers. We use Vivado 2018.3 for synthesis. Datasets for CPU-based parallel training algorithm We conduct experiments on 4 large scale real-world graphs as well as on synthetic graphs. Details of the datasets are described as follows: • PPI [76]: A protein-protein interaction graph. A node represents a protein and edges repre- sent protein interactions. • Reddit [76]: A post-post graph. A node represents a post. An edge exists between two posts if the same user has commented on both posts. • Yelp [90, 97]: A social network graph. A node is a user. An edge represents friendship. Node attributes are user comments converted from text using Word2Vec [67]. 6 We refer to block RAM and Ultra RAM by a general term “BRAM”. 125 • Amazon [97]: An item-item graph. A node is a product sold by Amazon. An edge is present if two items are bought by the same customer. Node attributes are converted from bag-of- words of text item descriptions using singular value decomposition (SVD). • Synthetic graphs: Graphs generated by Kronecker generator [55]. We follow the setup in [55] and set the initiator matrices to be proportional to the 2 by 2 matrix[[0.9,0.5],[0.5,0.1]]. We generate two sets of Kronecker graphs. The first set consists of graphs with fixed average degree of 16 and number of nodes equals to 2 20 , 2 21 , 2 22 , 2 23 , 2 24 and 2 25 . The second set consists of graphs with 2 20 nodes and the average degree equals to 8, 16, 32, 64, 128, 256 and512. The PPI and Reddit datasets are standard benchmarks used in [50, 37, 19, 44, 18]. The larger scale graphs, Yelp and Amazon, are processed and evaluated in [96, 97]. We use the set of four real- world graphs for a thorough evaluation on accuracy, efficiency and scalability. Table 6.3 shows the specification of the graphs. We use “fixed-partition” split, and the “Train/Val/Test” column shows the percentage of nodes in the training, validation and test sets. “Classes” shows the total number of node classes (i.e., number of columns ofY andY in Equation 2.7). For synthetic graphs, we can only generate the graph topology. The node attributes and the class memberships are filled by random numbers. Table 6.3: Dataset Statistics for the graphs evaluated by our CPU parallel algorithm and CPU- FPGA heterogeneous accelerator. We generate synthetic graphs of various sizes to better under- stand the scalability. Dataset Nodes Edges Attributes Classes Train/Val/Test PPI 14,755 225,270 50 121 (M) 0.66/0.12/0.22 Reddit 232,965 11,606,919 602 41 (S) 0.66/0.10/0.24 Yelp 716,847 6,977,410 300 100 (M) 0.75/0.15/0.10 Amazon 1,598,960 132,169,734 200 107 (M) 0.80/0.05/0.15 Synthetic 2 20 -2 25 2 23 -2 30 50 2 (S) 0.50/0.25/0.25 * The (M) mark stands for Multi-class classification, while (S) stands for Single-class. 126 Datasets for FPGA-based accelerator We use PPI, Reddit and Yelp specified in Table 6.3 to evaluate the FPGA training speed and accuracy. Implementation of CPU-based parallel training algorithm We open-source our implemen- tation in C++ (with OpenMP) 7 . We use ICC (version 19.0.5.281) for compilation. The GEMM operation for applying the weight matrices of each layer is implemented by a direct routine call to the Intel MKL library (version 2019 Update 5). Implementation of CPU-FPGA heterogeneous accelerator For all the datasets, following [37, 19, 44, 18, 97, 96, 98], the GNN has two graph convolutional layers (L=2) and one MLP layer in the classifier. The hidden dimension is f (ℓ) = 256, whereℓ = 1,2. We compare with the baseline [96] sinceGraphACT uses the same minibatch algorithm as [96]. We run [96] on both the CPU and GPU platforms. Table 6.4 shows the specification of the hardware. Parallelization of [96] on CPU is via the Intel MKL library. Thus, we add the--mkl flag to run [96] on CPU. Parallelization of [96] on GPU is directly via Tensorflow. The implementation of GraphACT is open-sourced 8 . 6.5.2 Scalability of Parallel Graph Sampling We evaluate the effect of inter-sampler parallelism for the frontier, random walk and edge sampling algorithms, and intra-sampler parallelism for the frontier sampling algorithm. For the frontier sampling algorithm, the A VX2 instructions supported by our target platform translate to maximum of 8 intra-subgraph parallelism (p intra = 8). The total of 40 Xeon cores makes 1 ≤ p inter ≤ 40. Figure 6.3.A shows the effect of p inter , when p intra = 8 (i.e., we launch 1 ≤ p inter ≤ 40 independent samplers, where A VX is enabled within each sampler). Sampling is highly scalable with inter-subgraph parallelism. We observe that scaling performance degrades when going from 20 to 40 cores, due to mixed effect of lower boost frequency and limited memory 7 Code available at: https://github.com/GraphSAINT/GraphSAINT 8 GraphACT code: https://github.com/GraphSAINT/GraphACT 127 bandwidth. With all the 20 cores in one chip executing A VX2 instructions, the Xeon CPU can only boost to 2.2GHz, in contrast with 3.4GHz for executing A VX instructions only on one core. Figure 6.3.B shows the effect ofp intra under variousp inter . The bars show the speedup of using A VX instructions comparing with otherwise. We achieve around 4× speedup on average. The scaling on p intra is data dependent. Depending on the training graph degree distribution, there may be significant portion of nodes with less than 8 neighbors, resulting in under-utilization of the A VX2 instruction. We can understand such under-utilization of instruction-level parallelism as a result of load-imbalance due to node degree variation. Such load-imbalance explains the discrepancy from the theoretical modeling on the sampling scalability (Theorem 8). Figure 6.3.C and 6.3.D show the effect ofp inter for random walk and edge sampling algorithms. Both sampling algorithms scale more than 20× whenp inter = 40. As we do not use A VX instruc- tions for these two samplers (i.e.,p intra = 1, and the CPU frequency is unaffected), the scalability from 20 cores to 40 cores is better than that of the frontier sampler. 6.5.3 Scalability of Parallel Training Scalability of feature aggregation Figure 6.4 shows the scalability of the feature aggregation step using our partitioning strategy. We achieve good scalability (around20× speedup on 40 cores) for all datasets under various feature sizes, thanks to our caching strategy and the optimal load- balance discussed in Section 6.3.6. According to the analysis, the scalability of feature aggregation should not be significantly affected by the subgraph topological characteristics. Therefore, we observe from plots B and E of Figure 6.4 that, the curves for the four datasets look similar to each other. Scalability of weight transformation The weight transformation operation is implemented by cblas_dgemm routine of the Intel ® MKL [1] library. All optimizations on the dense matrix multiplication are internally implemented in the library. Plots C and F of Figure 6.4 show the 128 0 10 20 30 40 0 10 20 30 p inter Speedup A. Frontier Sampling Speedup (p intra =8) 1 5 10 20 40 0 2 4 6 p inter Speedup B. Frontier Sampling A VX Speedup 0 10 20 30 40 0 10 20 30 p inter Speedup A. Random Walk Sampling Speedup 0 10 20 30 40 0 10 20 30 p inter Speedup A. Edge Sampling Speedup PPI Reddit Yelp Amazon Figure 6.3: Execution time speedup for the parallel subgraph sampling algorithm under both inter- & intra-subgraph parallelism scalability result. On 40 cores, average of13× speedup is achieved. We speculate that the overhead of MKL’s internal thread and buffer management is the bottleneck on further scaling. Scalability of the overall training For the developed GNN training algorithm, Figure 6.4 shows the parallel training speedup relative to sequential execution. The execution time includes every training steps specified by lines 2 to 13 of Algorithm 11 – 1. frontier graph sampling (with A VX 129 0 20 40 0 10 20 Speedup (w.r.t. serial runing time) A. Overall Training (Hidden Dimension=512) 0 20 40 0 10 20 B. Feature Aggregation (Hidden Dimension=512) 0 20 40 0 10 20 C. Weight Transformation (Hidden Dimension=512) 0 20 40 0 10 20 Speedup (w.r.t. serial runing time) D. Overall Training (Hidden Dimension=1024) 0 20 40 0 10 20 E. Feature Aggregation (Hidden Dimension=1024) 0 20 40 0 10 20 Number of Cores F. Weight Transformation (Hidden Dimension=1024) 1 5 10 20 40 0 0.5 1 Number of Cores Normalized Execution Time G. Execution Time Breakdown (Hidden Dimension=512) 1 5 10 20 40 0 0.5 1 Number of Cores H. Execution Time Breakdown (Hidden Dimension=1024) PPI Reddit Yelp Amazon Sampling Weight Tran. Feature Aggr. Other Figure 6.4: Evaluation on scalability with respect to the number of CPU cores (hidden feature dimensions: 512 and1024) enabled) and subgraph transpose, 2. feature aggregation in the forward propagation and its corre- sponding operation in the backward propagation, 3. weight transformation in the forward propaga- tion and its corresponding operation in the backward propagation, and 4. all the other operations (e.g.,ReLU activation, sigmoid function, etc.) in the forward and backward propagation. As before, we evaluate scaling on a 2-layer model executing the GraphSAGE layer operation, with small and large hidden dimensions, f (1) = f (2) = 512 and 1024, respectively. As shown by the plots A and D of Figure 6.4, the overall training is highly scalable, consistently achieving around 15× speedup on 40-cores for all datasets. The performance breakdown in plots G and H of Figure 6.4 suggests that sampling time corresponds to only a small portion of the total training time. This 130 0 1· 10 7 2· 10 7 3· 10 7 0 20 40 60 80 100 Graph Size|V| (Average Degree=16) Iteration Time (ms) 0 100 200 300 0 50 100 150 200 Average Degree (Graph Size|V|=2 20 ) Iteration Time (ms) Figure 6.5: Execution time per training iteration for synthetic graph under various sizes and average degrees. Left: the total graph size (|V|) grows from 1 million to 33 million with a fixed average degree of 16; Right: the average degree grows from 8 to 512 with a fixed total graph size ( |V|) of 2 20 . is due to 1. low serial complexity of our Dashboard based implementation, and 2. highly scal- able implementation using intra- and inter-sampler parallelism. In addition, feature aggregation for Amazon corresponds to a significantly higher portion of the total time compared with other datasets. This is due to the higher degree of the subgraphs sampled from Amazon. The main bot- tleneck in scaling is the weight transformation step performing dense matrix multiplication. The overall performance scaling is also data dependent. For denser graphs such as Amazon, the scaling of the feature aggregation step dominates the overall scalability. For the other sparser graphs, the weight transformation step has a higher impact on the training. Lastly, our parallel algorithm can scale well under a wide range of configurations – whether the hidden dimension is small or large; whether the training graph is small or large, sparse or dense. 6.5.4 Evaluation on Synthetic Graphs We generate synthetic graphs of much larger sizes to perform more thorough scalability evaluation. In the left plot of Figure 6.5, the sizes of the synthetic graphs grow from 1 million nodes to around 33 million nodes. All synthetic graphs have average degree of 16. We run a 2-layer GNN with 131 hidden dimension of 512 on the subgraphs of the synthetic graphs. The vertical axis denotes the time to compute one iteration (i.e., the time to perform forward and backward propagation on one minibatch subgraph). The subgraphs are all sampled by the frontier sampling algorithm with the same sampling parameters ofn=8000 andm=1000. With the increase of the training graph size, the iteration time converges to a constant value of around 100 ms. This indicates that our parallel training is highly scalable with respect to the graph size. When increasing the number of graph nodes, we keep the average degree unchanged. Therefore, the degree of the sampled subgraphs also keeps unchanged (due to the property of frontier sampling). Since we set the node budgetn to be fixed, the subgraph size (in terms of number of nodes and edges) in each iteration is approximately independent of the total number of nodes in the training graph. So the cost to perform one step of gradient update does not depend on the training graph size (for a given training graph degree). In the right plot of Figure 6.5, we fix the graph size as |V| = 2 20 and increase the average degree. Under the same sampling algorithm, if the original graph becomes denser, the sampled subgraphs are more likely to be denser as well. The computation complexity of feature aggregation is proportional to the subgraph degree. We observe that the iteration time approximately grows linearly with the average degree of the original training graph. This indicates that our parallel training algorithm can handle both sparse and dense graphs very well. 6.5.5 Speedup of Hardware Accelerator Table 6.4: Comparison with state-of-the-art with respect to hardware resources and training time (until convergence) Xilinx Alveo U200 Xeon E5-2698 v4 server NVIDIA Tesla P100 PPI Reddit Yelp PPI Reddit Yelp PPI Reddit Yelp Data type Float32 Float32 Float32 Float32 Float32 Float32 Float32 Float 32 Float32 Frequency (GHz) 0.2 0.2 0.2 2.2 2.2 2.2 1.2 1.2 1.2 DSP slices / CPU cores / CUDA cores 5632 (96%) 5632 (96%) 5632 (96%) 40 40 40 3584 3584 3584 BRAM / L3 / HBM2 (MB) 34.6 (96%) 32.8 (91%) 27.3 (75%) 102 102 102 16280 16280 16280 Off-chip bandwidth (GB/sec) 15.8 15.8 15.8 136.6 136.6 136.6 15.8 15.8 15.8 Total convergence time (sec) 9.6 7.6 23.4 151.4 95.5 359.4 10.6 11.4 30.4 132 Table 6.4 shows the training speed comparison with the state-of-the-art [96]. All implemen- tations use single precision floating point for weights and features. In our design, the DSP and BRAM resources are heavily utilized. We set P sys = 24 and P arr = 128 for all datasets. For the multi-core implementation, “L3” shows the aggregated L3-cache size and “Off-chip bandwidth” shows the peak CPU-main memory data transfer speed. For the GPU implementation, “Off-chip bandwidth” is the same as our implementation since the GPU is connected via PCIe 3.0 x16. “Total convergence time” includes the time for graph sampling, pre-processing (for redundancy reduc- tion), GNN forward/backward propagation and data transfer from/to main memory. It excludes the time for initial data loading from the disk and Tensorflow initialization (for allocating device resources and building computation graphs). Comparing with the CPU baseline, we achieve 12× to 15× speedup. Apart from the over- head of the Python language, inefficiency of the CPU baseline is mainly due subgraph feature aggregation operation. We observe that although feature aggregation requires less than 10% of the multiplication/addition of weight transformation, the CPU spends about equal amount of time on the two operations (see also Figure 3.D of [96]). Comparing with the GPU baseline, our design convergences slightly faster (1.1× to 1.5× ). The theoretical peak performance of the GPU (9.3 TFLOPS for Float32 [3]) is much higher than that of the FPGA (1.8 TFLOPS for Float32 [4] 9 ). Inefficiency of the GPU baseline is mainly due to the sub-optimal Tensorflow implementation of sparse matrix multiplication. Note that the CPU in our design only executes light weight opera- tions (Section 4.3 and 6.4.3). Thus, the training time improvement mainly comes from the highly optimized FPGA architecture. Although the baseline training time may be further improved by re-implementation using C++/CUDA, inefficiency in CPU and GPU due to sparse feature aggregation may not be easily eliminated. 9 Each Float32 multiplier consumes 3 DSPs. The max DSP frequency is 775MHz. 133 Chapter 7 Conclusion and Future Work 7.1 Summary of Contributions In this dissertation, we improved the scalability of Graph Neural Networks with respect to model depth and graph size, achieving significantly better accuracy, resource consumption and execu- tion speed compared with the state-of-the-art. We identified the two fundamental challenges of GNNs in terms of computation redundancy and information flow, and followed an algorithm- model-architecture co-optimization approach to address both of them. All completed works in this dissertation followed a novel subgraph-centric perspective on GNN computation, where a deep GNN is built on top of small subgraphs of the large full graph. In contrast to the existing graph- centric, node-centric and layerwise node-centric perspectives, our subgraph-centric execution pro- vides a clean way to address the computation redundancy due to “neighborhood explosion,” while considering the constraints of limited hardware memory. In addition, by moving the sampling pro- cess from within the GNN layers to within the graph, the subgraph construction can be seen as an additional data processing step, providing new opportunities to improve model expressivity. Following the subgraph-centric computation, we further categorized the subgraph construction process into graph-centric and node-centric, where the former led to our minibatch training frame- work, GraphSAINT, and the later led to our new class of GNNs, SHADOW-GNN. Parallelization and acceleration techniques were also developed for the CPU and CPU-FPGA platforms, where the small sizes of the subgraphs enabled practical designs achieving near-optimal load-balance, communication cost and hardware utilization. Our contributions on efficient training algorithms include: 134 • a general minibatch training framework, GraphSAINT, reducing the per-iteration computa- tion complexity from exponential to linear with respect to model depth; • theoretical analysis of minibatch bias and variance, leading to light-weight normalization techniques and subgraph sampling algorithms; • various theoretically inspired graph sampling algorithms, showing the applicability of extendingGraphSAINT to various problem domains; • integration ofGraphSAINT into standard benchmarks (e.g., OGB [41]) and deep graph learn- ing libraries (e.g., PyG [29] and DGL [81]), demonstrating our impact on the graph learning community; • comprehensive experimental evaluation of GraphSAINT on several GNN layer archictures, showing significant improvements in accuracy, convergence speed and scalability compared with state-of-the-art; • a graph-theoretic algorithm to address redundancy in subgraph-level neighbor propagation, achieving significant reduction in arithmetic operations and memory accesses. Our contributions on expressive models include: • a novel design principle to decouple the depth and scope of GNNs, leading to an expanded design space for most existing GNN architectures; • theoretical analysis showing that “decoupled GNNs” achieve higher expressive power by resolving oversmoothing, approximating target functions better and identifying more neigh- borhood structures; • efficient algorithms to customize small neighborhood subgraphs for each target node, which utilizes insights from traditional graph analytics as the useful inductive bias for GNNs; 135 • extension of decoupled GNNs to various layer architectures (e.g., pooling layers, neighbor aggregation functions, etc.), downstream tasks (e.g., node classification and link prediction) and the largest public graph (111 million nodes), all acheving state-of-the-art performance in accuracy and computation complexity; • optimization on minibatch execution to significantly reduce hardware requirements ( e.g., GPU memory CPU RAM) and to enable flexible accuracy-latency tradeoff on pre-trained models. Our contributions on fast acceleration include: • parallelization of the key computation kernels involved in the subgraph-centric GNN execu- tion, achieving efficient cache accesses on multi-core CPUs; • a novel data structure supporting fast, incremental and parallel updates to the dynamic prob- ability distribution used by subgraph samplers; • parallel subgraph sampling algorithm theoretically achieving near-optimal scalability with respect to number of processing units; • data partitioning scheme for the neighbor propagation operation which achieves minimum computation, optimal load-balance and near-optimal communication cost; • hardware architecture that accelerates the execution of various sparse and dense computation operations of subgraph-centric GNNs; • on-chip scheduling and data layout design achieving optimal load-balance, high hardware utilization and low communication cost; • comprehensive performance evaluation showing that our CPU parallelization strategies and FPGA accelerator design achieve significant execution speedup compared with state-of-the- art. 136 7.2 Future Work 7.2.1 Heterogeneous and Temporal Graphs This dissertation mostly focuses on undirected, unweighted graphs. Realistic applications often require more sophisticated and elaborate data representations such as heterogeneous graphs and temporal graphs. For example, in social recommendation, the graph may consist of multiple node types such as users and content. Over time, the interests of a user may evolve, and new contents may be created. Thus, the graph is dynamic in terms of both its structure and features, and temporal information needs to be considered in graph learning. Many of the techniques developed by us can be extended to such graphs containing richer information. For example, with appropriate heterogeneity-aware subgraph samplers,GraphSAINT can be applied to minibatch training of heterogeneous graphs. It is an interesting and challeng- ing research problem to design efficient and accurate sampling algorithms for heterogeneous and temporal graphs. Our samplers in SHADOW-GNN are based on classic graph analytics algorithms such as Personalized PageRank (PPR), which were originally designed for homogeneous graphs. Extending PPR to multiple node/edge types relies on heterogeneous variant of random walk, which can potentially be implemented by ideas such as metapaths [26]. Existing works [31, 27] have com- bined metapaths with GNNs on heterogeneous graphs. Opportunities still exist to further improve the scalability of such models via subgraph-centric computation. Multiple node and edge types results in higher cost for both sampling and GNN layer execution, making it necessary to develop efficient algorithms and parallelization strategies. To encode the temporal information in entities (i.e., nodes and edges), spatio-temporal sam- pling algorithms need to be developed to capture the time dependencies as well as the structural dependencies due to edges. In addition, we can borrow the rich literature in time-series analysis to enhance the GNN designs. For example, an integrated model combining GNNs with sequence modeling techniques (e.g., self-attention encoder and LSTMs) can be developed for forecasting future events in the graph [111, 77]. 137 7.2.2 Large Scale Deployment on Distributed Systems Realistic deployment of large scale graph learning workloads is often based on a highly distributed environment where multiple compute nodes (e.g., GPUs) are connected via high-speed intercon- nection. In such a scenario, the full graph for training is partitioned across a distributed storage and accessing inter-partition data requires expensive remote accesses. To achieve high scalability, both the algorithm and system implementation needs to be optimized. From the algorithm level, we can develop distributed version of GraphSAINT by utilizing the data hierarchy due to clustering. We propose a “sample-update-shuffle” execution paradigm, where in the sampling phase, each com- pute node independently samples subgraphs from its local graph partition; in the updating phase, the compute nodes compute gradients locally and exchange the gradient information globally to update a common GNN model; in the shuffle phase, the compute nodes perform batch data transfer across the current partitions to achieve efficient data re-partitioning. The shuffle phase can be exe- cuted periodically with coarse granularity to reduce the communication cost. In addition, shuffling ensures that the full graph information is visible to all the compute nodes (with sufficient num- ber of shuffle rounds). The distributed GraphSAINT learns information of a hierarchically stored graph without excessive remote communication on inter-partition neighbors; this can lead to high scalability. From the system optimization perspective, advanced scheduler needs to be developed to achieve overall load-balance. Due to the irregular edge connections (especially for web-scale social graphs following power-law of degrees [7]), graph partitions on different compute nodes may exhibit very different connectivity, resulting in widely varying workloads for sampling and GNN execution. The variation on computation workload is highly dynamic due to the stochastic sampling process and the inter-partition data shuffling. Therefore, a fast and high-quality scheduler for runtime load-balancing is critical to achieve scalability in distributed execution. In addition, dynamic data structures need to be developed to support fast updates on the graph structure, either due to data shuffling during training or temporal evolution of the graph entities. 138 7.2.3 Federated Learning on Large Graphs To further extend the idea of distributed GNN execution, we can explore subgraph sampling tech- niques in federated learning. Our subgraph-based computation can be applied on top of existing GNN federated learning algorithms [38, 100, 62] to further improve their accuracy and efficiency. For example, we can explore hierarchical sampling where the initial large garph is first partitioned to each computation device by some clustering or partitioning algorithms based on minimum-cut [46, 70]. Then we derive customized second-level samplers to construct subgraphs within each partition, so that each computation device can efficiently train its own GNN model only from its local data. The stochasticity of the sampling process prevents the learned local GNN models from improving the training performance through memorizing the full neighborhood information. Thus, the subgraph sampling step benefits the privacy preserving requirements in the federated learning scenarios. We can further develop privacy-aware sampling algorithms so that each subgraph only preserves the key characteristics (e.g., degree distribution) of the full graph without leaking the true structure of the raw graph. In addition, since our SHADOW-GNN models construct small neighborhood subgraphs that are independent of the model configuration, the federated learning systems can more flexibily explore various GNN models without the additional overhead of re-partitioning the neighborhood data. Construction of small neighborhood subgraphs creates a boundary in neighborhood propagation for each target node. Such boundaries at the node level help the partitioning algorithm define boundaries at the full-graph level (reflected by the inter-partition nodes and edges). Thus, our subgraph-based execution paradigm opens up new opportunities for large scale federated GNN systems. 139 References [1] Intel mkl. https://software.intel.com/en-us/mkl. Accessed: 2018-10-12. [2] Matrix chain multiplication. http://faculty.cs.tamu.edu/klappi/ csce629-f17/csce411-set6c.pdf. Accessed: 2020-07-10. [3] Nvidia tesla p100 peak performance. https://images.nvidia.com/content/tesla/pdf/nvidia- tesla-p100-PCIe-datasheet.pdf. Accessed: 2019-11-30. [4] Xilinx alveo u200 peak performance. https://www.xilinx.com/products/boards-and- kits/alveo/u200.html#specifications. Accessed: 2019-11-30. [5] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. CoRR, abs/1603.04467, 2016. [6] S. Abu-El-Haija, B. Perozzi, A. Kapoor, H. Harutyunyan, N. Alipourfard, K. Lerman, G. V . Steeg, and A. Galstyan. Mixhop: Higher-order graph convolution architectures via sparsi- fied neighborhood mixing. arXiv preprint arXiv:1905.00067, 2019. [7] W. Aiello, F. Chung, and L. Lu. A random graph model for power law graphs. Experimental Mathematics, 10(1):53–66, 2001. [8] U. Alon and E. Yahav. On the bottleneck of graph neural networks and its practical impli- cations. In International Conference on Learning Representations, 2021. [9] U. Alon and E. Yahav. On the bottleneck of graph neural networks and its practical impli- cations. In International Conference on Learning Representations, 2021. [10] R. Andersen, F. Chung, and K. Lang. Local graph partitioning using pagerank vectors. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), pages 475–486. IEEE, 2006. [11] L. Babai. Graph isomorphism in quasipolynomial time. CoRR, abs/1512.03547, 2015. [12] A. Barrat, M. Barthélemy, R. Pastor-Satorras, and A. Vespignani. The architecture of com- plex weighted networks. Proceedings of the National Academy of Sciences, 101(11):3747– 3752, mar 2004. 140 [13] S. Beamer, K. Asanovic, and D. Patterson. Reducing pagerank communication via propaga- tion blocking. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2017. [14] F. M. Bianchi, D. Grattarola, and C. Alippi. Spectral clustering with graph neural net- works for graph pooling. In International Conference on Machine Learning, pages 874–883. PMLR, 2020. [15] V . D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre. Fast unfolding of com- munities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10):P10008, oct 2008. [16] A. Bojchevski, J. Klicpera, B. Perozzi, A. Kapoor, M. Blais, B. Rózemberczki, M. Lukasik, and S. Günnemann. Scaling graph neural networks with approximate pagerank. In Proceed- ings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, page 2464–2473, New York, NY , USA, 2020. Association for Computing Machinery. [17] H. Cai, V . W. Zheng, and K. C. Chang. A comprehensive survey of graph embedding: Problems, techniques and applications. CoRR, abs/1709.07604, 2017. [18] J. Chen, T. Ma, and C. Xiao. Fastgcn: Fast learning with graph convolutional networks via importance sampling. In International Conference on Learning Representations (ICLR), 2018. [19] J. Chen, J. Zhu, and L. Song. Stochastic training of graph convolutional networks with variance reduction. In ICML, pages 941–949, 2018. [20] M. Chen, Z. Wei, Z. Huang, B. Ding, and Y . Li. Simple and deep graph convolutional net- works. In H. D. III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 1725–1735. PMLR, 13–18 Jul 2020. [21] W. Chiang, X. Liu, S. Si, Y . Li, S. Bengio, and C. Hsieh. Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks. CoRR, abs/1905.07953, 2019. [22] G. Corso, L. Cavalleri, D. Beaini, P. Liò, and P. Velickovic. Principal neighbourhood aggre- gation for graph nets. CoRR, abs/2004.05718, 2020. [23] G. Dai, Y . Chi, Y . Wang, and H. Yang. FPGP: Graph processing framework on fpga a case study of breadth-first search. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 105–110. ACM, 2016. [24] G. Dai, T. Huang, Y . Chi, N. Xu, Y . Wang, and H. Yang. Foregraph: Exploring large-scale graph processing on multi-fpga architecture. In Proceedings of the 2017 ACM/SIGDA Inter- national Symposium on Field-Programmable Gate Arrays, pages 217–226. ACM, 2017. 141 [25] F. De Dinechin, P. Echeverria, M. López-Vallejo, and B. Pasca. Floating-point exponentia- tion units for reconfigurable computing. ACM Transactions on Reconfigurable Technology and Systems (TRETS), 6(1):4, 2013. [26] Y . Dong, N. V . Chawla, and A. Swami. Metapath2vec: Scalable representation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD International Confer- ence on Knowledge Discovery and Data Mining, KDD ’17, page 135–144, New York, NY , USA, 2017. Association for Computing Machinery. [27] S. Fan, J. Zhu, X. Han, C. Shi, L. Hu, B. Ma, and Y . Li. Metapath-guided heteroge- neous graph neural network for intent recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’19, page 2478–2486, New York, NY , USA, 2019. Association for Computing Machinery. [28] M. Fey. Just jump: Dynamic neighborhood aggregation in graph neural networks. CoRR, abs/1904.04849, 2019. [29] M. Fey and J. E. Lenssen. Fast graph representation learning with pytorch geometric. CoRR, abs/1903.02428, 2019. [30] F. Frasca, E. Rossi, D. Eynard, B. Chamberlain, M. Bronstein, and F. Monti. SIGN: Scalable inception graph neural networks, 2020. [31] X. Fu, J. Zhang, Z. Meng, and I. King. MAGNN: metapath aggregated graph neural network for heterogeneous graph embedding. CoRR, abs/2002.01680, 2020. [32] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman Co., USA, 1979. [33] M. R. Garey and D. S. Johnson. Computers and Intractability; A Guide to the Theory of NP-Completeness. W. H. Freeman Co., USA, 1990. [34] T. Geng, T. Wang, A. Sanaullah, C. Yang, R. Xu, R. Patel, and M. Herbordt. Fpdeep: Acceleration and load balancing of cnn training on fpga clusters. In 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pages 81–84. IEEE, 2018. [35] Y . Guan, Z. Yuan, G. Sun, and J. Cong. Fpga-based accelerator for long short-term mem- ory recurrent neural networks. In 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC), pages 629–634, 2017. [36] K. Guo, S. Liang, J. Yu, X. Ning, W. Li, Y . Wang, and H. Yang. Compressed cnn training with fpga-based accelerator. In Proceedings of the 2019 ACM/SIGDA International Sym- posium on Field-Programmable Gate Arrays, FPGA ’19, pages 189–189, New York, NY , USA, 2019. ACM. 142 [37] W. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems 30, pages 1024–1034. 2017. [38] C. He, K. Balasubramanian, E. Ceyani, C. Yang, H. Xie, L. Sun, L. He, L. Yang, P. S. Yu, Y . Rong, P. Zhao, J. Huang, M. Annavaram, and S. Avestimehr. Fedgraphnn: A federated learning system and benchmark for graph neural networks, 2021. [39] K. Hornik, M. Stinchcombe, H. White, et al. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989. [40] P. Hu and W. C. Lau. A survey and taxonomy of graph sampling. arXiv preprint arXiv:1308.5865, 2013. [41] W. Hu, M. Fey, M. Zitnik, Y . Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec. Open graph benchmark: Datasets for machine learning on graphs, 2020. [42] Q. Huang, H. He, A. Singh, S.-N. Lim, and A. Benson. Combining label propagation and simple models out-performs graph neural networks. In International Conference on Learning Representations, 2021. [43] W. Huang, Y . Rong, T. Xu, F. Sun, and J. Huang. Tackling over-smoothing for general Graph Convolutional Networks, 2020. [44] W. Huang, T. Zhang, Y . Rong, and J. Huang. Adaptive sampling towards fast graph represen- tation learning. In Advances in Neural Information Processing Systems, pages 4558–4567, 2018. [45] G. Jeh and J. Widom. Simrank: a measure of structural-context similarity. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 538–543, 2002. [46] G. Karypis and V . Kumar. A fast and high quality multilevel scheme for partitioning irreg- ular graphs. SIAM Journal on Scientific Computing , 20(1):359–392, 1998. [47] L. Katz. A new status index derived from sociometric analysis. Psychometrika, 18(1):39– 43, 1953. [48] S. Khoram, J. Zhang, M. Strange, and J. Li. Accelerating graph analytics by co-optimizing storage and access on an fpga-hmc platform. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’18, pages 239–248, New York, NY , USA, 2018. ACM. [49] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [50] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional net- works. In International Conference on Learning Representations (ICLR), 2017. 143 [51] J. Klicpera, A. Bojchevski, and S. Günnemann. Personalized embedding propagation: Com- bining neural networks on graphs with personalized pagerank. CoRR, abs/1810.05997, 2018. [52] J. Klicpera, S. Weißenberger, and S. Günnemann. Diffusion improves graph learning, 2019. [53] J. Lee, I. Lee, and J. Kang. Self-attention graph pooling. In International Conference on Machine Learning, pages 3734–3743. PMLR, 2019. [54] J. B. Lee, R. A. Rossi, X. Kong, S. Kim, E. Koh, and A. Rao. Higher-order graph convolu- tional networks. CoRR, abs/1809.07697, 2018. [55] J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, and Z. Ghahramani. Kronecker graphs: An approach to modeling networks. Journal of Machine Learning Research, 11(Feb):985–1042, 2010. [56] J. Leskovec and C. Faloutsos. Sampling from large graphs. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 631–636. ACM, 2006. [57] J. Leskovec and C. Faloutsos. Sampling from large graphs. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, page 631–636, New York, NY , USA, 2006. Association for Computing Machinery. [58] J. Leskovec, J. Kleinberg, and C. Faloutsos. Graph evolution: Densification and shrinking diameters. ACM Trans. Knowl. Discov. Data, 1(1):2–es, Mar. 2007. [59] D. A. Levin and Y . Peres. Markov chains and mixing times, volume 107. American Mathe- matical Soc., 2017. [60] Q. Li, Z. Han, and X.-M. Wu. Deeper insights into Graph Convolutional Networks for semi-supervised learning. arXiv preprint arXiv:1801.07606, 2018. [61] R. Li, J. X. Yu, L. Qin, R. Mao, and T. Jin. On random walk based graph sampling. In 2015 IEEE 31st International Conference on Data Engineering, pages 927–938, April 2015. [62] R. Liu and H. Yu. Federated graph neural networks: Overview, techniques and challenges, 2022. [63] S. Luan, M. Zhao, X.-W. Chang, and D. Precup. Break the ceiling: Stronger multi-scale deep graph convolutional networks. In Advances in Neural Information Processing Systems 32, pages 10945–10955. Curran Associates, Inc., 2019. [64] Y . Ma and et al. Optimizing loop operation and dataflow in fpga acceleration of deep con- volutional neural networks. In Proceedings of the 2017 ACM/SIGDA Intl. Symposium on Field-Programmable Gate Arrays, FPGA ’17, 2017. 144 [65] A. Mansour, A. El-Sawy, M. Aziz, and A. Sayed. A new hardware implementation of base 2 logarithm for fpga. 2015. [66] S. Micali and V . V . Vazirani. An o(v|v| c |e|) algoithm for finding maximum matching in general graphs. In 21st Annual Symposium on Foundations of Computer Science (sfcs 1980), pages 17–27, Oct 1980. [67] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, 2013. [68] H. Nakahara, A. Jinguji, M. Shimoda, and S. Sato. An fpga-based fine tuning accelerator for a sparse cnn. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 186–186. ACM, 2019. [69] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Infor- mation Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019. [70] F. Petroni, L. Querzoni, K. Daudjee, S. Kamali, and G. Iacoboni. Hdrf: Stream-based parti- tioning for power-law graphs. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, CIKM ’15, page 243–252, New York, NY , USA, 2015. Association for Computing Machinery. [71] R. Rajat, H. Zeng, and V . Prasanna. A flexible design automation tool for accelerating quantized spectral CNNs. In 2019 29th International Conference on Field Programmable Logic and Applications (FPL), September 2019. [72] B. Ribeiro and D. Towsley. Estimating and sampling graphs with multidimensional random walks. In Proceedings of the 10th ACM SIGCOMM conference on Internet measurement, pages 390–403. ACM, 2010. [73] Y . Rong, W. Huang, T. Xu, and J. Huang. DropEdge: Towards deep Graph Convolutional Networks on node classification. In International Conference on Learning Representations, 2020. [74] A. Roy, I. Mihailovic, and W. Zwaenepoel. X-stream: Edge-centric graph processing using streaming partitions. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. [75] D. E. Rumelhart, G. E. Hinton, R. J. Williams, et al. Learning representations by back- propagating errors. Cognitive modeling, 5(3):1, 1986. 145 [76] Ppi/reddit datasets. http://snap.stanford.edu/graphsage/#datasets, 2019. [77] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. Attention is all you need. In I. Guyon, U. V . Luxburg, S. Bengio, H. Wal- lach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. [78] P. Veliˇ ckovi´ c, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y . Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017. [79] S. K. Venkataramanaiah, Y . Ma, S. Yin, E. Nurvithadhi, A. Dasu, Y . Cao, and J.- s. Seo. Automatic compiler based fpga accelerator for cnn training. arXiv preprint arXiv:1908.06724, 2019. [80] A. J. Walker. An efficient method for generating discrete random variables with general distributions. ACM Transactions on Mathematical Software (TOMS), 3(3):253–256, 1977. [81] M. Wang, D. Zheng, Z. Ye, Q. Gan, M. Li, X. Song, J. Zhou, C. Ma, L. Yu, Y . Gai, T. Xiao, T. He, G. Karypis, J. Li, and Z. Zhang. Deep graph library: A graph-centric, highly-performant package for graph neural networks, 2020. [82] S. Wang, Z. Li, C. Ding, B. Yuan, Y . Wang, Q. Qiu, and Y . Liang. C-LSTM: enabling efficient LSTM using structured compression techniques on fpgas. CoRR, abs/1803.06305, 2018. [83] X. Wei, C. H. Yu, P. Zhang, Y . Chen, Y . Wang, H. Hu, Y . Liang, and J. Cong. Automated systolic array architecture synthesis for high throughput cnn inference on fpgas. In Proceed- ings of the 54th Annual Design Automation Conference 2017, DAC ’17, pages 29:1–29:6, New York, NY , USA, 2017. ACM. [84] F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger. Simplifying graph convolu- tional networks. In Proceedings of the 36th International Conference on Machine Learning, pages 6861–6871. PMLR, 2019. [85] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu. A comprehensive survey on graph neural networks. CoRR, abs/1901.00596, 2019. [86] T. Xie, R. Kannan, and C. C. J. Kuo. Graphhop++: New insights into graphhop and its enhancement, 2022. [87] T. Xie, B. Wang, and C.-C. J. Kuo. Graphhop: An enhanced label propagation method for node classification. IEEE Transactions on Neural Networks and Learning Systems, pages 1–15, 2022. [88] K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826, 2018. 146 [89] K. Xu, C. Li, Y . Tian, T. Sonobe, K.-i. Kawarabayashi, and S. Jegelka. Representation learning on graphs with jumping knowledge networks. arXiv preprint arXiv:1806.03536, 2018. [90] Yelp 2018 challenge. https://www.yelp.com/dataset, 2019. [91] R. Ying, D. Bourgeois, J. You, M. Zitnik, and J. Leskovec. GNNExplainer: Generating explanations for graph neural networks, 2019. [92] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec. Graph con- volutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’18, 2018. [93] R. Ying, J. You, C. Morris, X. Ren, W. L. Hamilton, and J. Leskovec. Hierarchical graph representation learning with differentiable pooling. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, pages 4805–4815, USA, 2018. Curran Associates Inc. [94] H. Zeng, R. Chen, C. Zhang, and V . Prasanna. A framework for generating high throughput CNN implementations on FPGAs. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’18, pages 117–126, New York, NY , USA, 2018. ACM. [95] H. Zeng, C. Zhang, and V . Prasanna. Fast generation of high throughput customized deep learning accelerators on FPGAs. In 2017 International Conference on ReConFigurable Computing and FPGAs (ReConFig), Dec 2017. [96] H. Zeng, H. Zhou, A. Srivastava, R. Kannan, and V . Prasanna. Accurate, efficient and scalable graph embedding. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2019. [97] H. Zeng, H. Zhou, A. Srivastava, R. Kannan, and V . Prasanna. GraphSAINT: Graph sam- pling based inductive learning method. In International Conference on Learning Represen- tations, 2020. [98] H. Zeng, H. Zhou, A. Srivastava, R. Kannan, and V . Prasanna. Accurate, efficient and scalable training of graph neural networks. Journal of Parallel and Distributed Computing, 147:166–183, 2021. [99] C. Zhang, G. Sun, Z. Fang, P. Zhou, P. Pan, and J. Cong. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2018. [100] H. Zhang, T. Shen, F. Wu, M. Yin, H. Yang, and C. Wu. Federated graph learning - A position paper. CoRR, abs/2105.11099, 2021. 147 [101] J. Zhang, S. Khoram, and J. Li. Boosting the performance of fpga-based graph processor using hybrid memory cube: A case for breadth first search. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’17, pages 207–216, New York, NY , USA, 2017. ACM. [102] J. Zhang, X. Shi, J. Xie, H. Ma, I. King, and D.-Y . Yeung. Gaan: Gated attention networks for learning on large and spatiotemporal graphs. arXiv preprint arXiv:1803.07294, 2018. [103] J. Zhang, W. Zhang, G. Luo, X. Wei, Y . Liang, and J. Cong. Frequency improvement of systolic array-based cnns on fpgas. In 2019 IEEE International Symposium on Circuits and Systems (ISCAS), pages 1–4, May 2019. [104] M. Zhang and Y . Chen. Link prediction based on graph neural networks. In Advances in Neural Information Processing Systems, pages 5165–5175, 2018. [105] M. Zhang, Z. Cui, M. Neumann, and Y . Chen. An end-to-end deep learning architecture for graph classification. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 32, 2018. [106] M. Zhang, Y . Wu, K. Chen, X. Qian, X. Li, and W. Zheng. Exploring the hidden dimen- sion in graph processing. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). [107] X. Zhang, J. Wang, C. Zhu, Y . Lin, J. Xiong, W.-m. Hwu, and D. Chen. Dnnbuilder: an automated tool for building high-performance dnn hardware accelerators for fpgas. In Proceedings of the International Conference on Computer-Aided Design, page 56. ACM, 2018. [108] L. Zhao and L. Akoglu. PairNorm: Tackling oversmoothing in gnns. In International Conference on Learning Representations, 2020. [109] W. Zhao, H. Fu, W. Luk, T. Yu, S. Wang, B. Feng, Y . Ma, and G. Yang. F-cnn: An fpga- based framework for training convolutional neural networks. In 2016 IEEE 27th Interna- tional Conference on Application-specific Systems, Architectures and Processors (ASAP) , pages 107–114. IEEE, 2016. [110] D. Zheng, C. Ma, M. Wang, J. Zhou, Q. Su, X. Song, Q. Gan, Z. Zhang, and G. Karypis. Distdgl: Distributed graph neural network training for billion-scale graphs. CoRR, abs/2010.05337, 2020. [111] H. Zhou, J. Orme-Rogers, R. Kannan, and V . K. Prasanna. Sedyt: A general framework for multi-step event forecasting via sequence modeling on dynamic entity embeddings. CoRR, abs/2109.04550, 2021. [112] S. Zhou, C. Chelmis, and V . K. Prasanna. High-throughput and energy-efficient graph processing on fpga. In 2016 IEEE 24th Annual International Symposium on Field- Programmable Custom Computing Machines (FCCM), pages 103–110, May 2016. 148 [113] Z. Zhou. Graph convolutional networks for molecules. CoRR, abs/1706.09916, 2017. [114] D. Zou, Z. Hu, Y . Wang, S. Jiang, Y . Sun, and Q. Gu. Few-shot representation learning for out-of-vocabulary words. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems, NeurIPS, 2019. 149
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Scaling up temporal graph learning: powerful models, efficient algorithms, and optimized systems
PDF
Efficient graph learning: theory and performance evaluation
PDF
Graph embedding algorithms for attributed and temporal graphs
PDF
Acceleration of deep reinforcement learning: efficient algorithms and hardware mapping
PDF
Exploiting variable task granularities for scalable and efficient parallel graph analytics
PDF
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
PDF
Hardware-software codesign for accelerating graph neural networks on FPGA
PDF
Scalable exact inference in probabilistic graphical models on multi-core platforms
PDF
Scaling recommendation models with data-aware architectures and hardware efficient implementations
PDF
Accelerating reinforcement learning using heterogeneous platforms: co-designing hardware, algorithm, and system solutions
PDF
Dynamic graph analytics for cyber systems security applications
PDF
Fast and label-efficient graph representation learning
PDF
Federated and distributed machine learning at scale: from systems to algorithms to applications
PDF
Leveraging training information for efficient and robust deep learning
PDF
Hardware and software techniques for irregular parallelism
PDF
Learning distributed representations from network data and human navigation
PDF
3D deep learning for perception and modeling
PDF
Green knowledge graph completion and scalable generative content delivery
PDF
Algorithm and system co-optimization of graph and machine learning systems
PDF
Graph-based models and transforms for signal/data processing with applications to video coding
Asset Metadata
Creator
Zeng, Hanqing
(author)
Core Title
Scaling up deep graph learning: efficient algorithms, expressive models and fast acceleration
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Degree Conferral Date
2022-12
Publication Date
09/21/2022
Defense Date
08/30/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
deep learning,graph analytics,graph representation learning,high performance computing,OAI-PMH Harvest,parallel computing,reconfigurable computing
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Prasanna, Viktor (
committee chair
), Jia, Robin (
committee member
), Kuo, C.-C. Jay (
committee member
)
Creator Email
zengh@usc.edu,zhqhku@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC112013907
Unique identifier
UC112013907
Legacy Identifier
etd-ZengHanqin-11238
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Zeng, Hanqing
Type
texts
Source
20220921-usctheses-batch-983
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
deep learning
graph analytics
graph representation learning
high performance computing
parallel computing
reconfigurable computing