Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Scaling up temporal graph learning: powerful models, efficient algorithms, and optimized systems
(USC Thesis Other)
Scaling up temporal graph learning: powerful models, efficient algorithms, and optimized systems
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SCALING UP TEMPORAL GRAPH LEARNING: POWERFUL MODELS,
EFFICIENT ALGORITHMS, AND OPTIMIZED SYSTEMS
by
Hongkuan Zhou
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER ENGINEERING)
May 2024
Copyright 2024 Hongkuan Zhou
Dedication
To family and friends.
ii
Acknowledgements
I would like to express my deepest gratitude to my PhD advisor, Prof. Viktor K. Prasanna, who opened the
door to scientific research for me and provided me with guidance and support on this long journey. I still
remember our first conversation in November 2017 when I was a first-year Master’s student who felt lost
about the future. It helped me make my firm decision to start my career in parallel computing and pursue a
PhD degree. I would like to thank Prof. Rajgopal Kannan for his guidance, mentorship, and dedication to
many projects. I will always remember the motto to do research that makes a real-world impact. I would
also like to thank Hanging Zeng and Ajitesh Srivastava for their mentorship in my initial years. They not
only inspired me with groundbreaking research directions but also taught me many important research
skills step-by-step.
I would like to send my biggest and warmest hugs to my family. My parents and wife have always been
there for me, offering unwavering support and confident, both emotionally and economically. My parents
Sheng Zhou (father) and Xinjun Liu (mother) in China flew more than ten hours one-way multiple times a
year just to visit me after my visa expired. My wife Yixiu Liu married me in 2021 when I was a fledgling
PhD student (another significant milestone in my life!) and worked remotely in Los Angeles to stay with
me during my PhD. Without their company, I would not be able to survive this journey, especially during
the COVID quarantine time.
I would also like to thank my mentors and managers, Da Zheng, Xiang Song, Nisa Israt, and George
Karypis, at AWS, where I did summer internships twice. Through them, I gained valuable industry
iii
knowledge and skills, as well as exposure to the exciting possibilities within the field. Through our
collaboration, I was able to develop two key components of my dissertation.
I feel incredibly fortunate to be surrounded by so many talented and curious students in our p-group. I
really enjoy working with Bingyi Zhang, Naifeng Zhang, Gangda Deng, Yuxin Yang, Yuhong Liu on several
research projects, and wishing them all the best in their own PhD journey.
I would also like to thank my defense committee, Prof. Keith Michael Chugg, Prof. Rajgopal Kannan,
Prof. Viktor K. Prasanna, and Prof. Mukund Raghothaman, and my qualifying exam committee, Prof. Keith
Michael Chugg, Prof. Rajgopal Kannan, Prof. Viktor K. Prasanna, Prof. Xuehai Qian, and Prof. Mukund
Raghothaman. They have made many helpful advice and valuable suggestions on my thesis.
Lastly, I would like to thank each and every person who supported me along the way in my PhD journey
again. This accomplishment would not have been possible without you all!
iv
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Temporal Graph Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4.2 Computation Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Chapter 2: Background and Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Static Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 GNN Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Mini-Batch GNN Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Temporal Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Unified Representation of Dynamic Graphs . . . . . . . . . . . . . . . . . . . . . . 17
2.3.2 Time-Aware Message Passing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.2.1 Time Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.2.2 Time-Aware Node Features . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.2.3 Time-Aware Aggregator . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.2.4 Time-Aware Updater . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.3 Example TGNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.4 TGNN Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Target Hardware Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.1 GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.2 FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
v
Chapter 3: TASER: Temporal Adaptive Sampling for Noise-Resistant TGNNs . . . . . . . . . . . . . 31
3.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Noise in Dynamic Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.1 Dynamic Graph Denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.2 Adaptive Mini-Batch Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.3 Adaptive Neighbor Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.4 Neighbor Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.5 Graph Feature Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4.1 Temporal Adaptive Mini-batch Selection . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4.2 Temporal Adaptive Neighbor Sampling . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.3 GPU Temporal Neighbor Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.4 GPU Feature Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.5.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.5.3 Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.5.4 GPU Neighbor Finder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.5.5 GPU Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.5.6 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Chapter 4: TGL and DistTGL: Multi-GPU Scalable TGNN Training . . . . . . . . . . . . . . . . . . 58
4.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2 TGL: A General Framework for TGNN Training . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2.2 Parallel Temporal Sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2.3 Parallel Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2.4.2 Parallel Temporal Sampler . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2.4.3 Single-GPU Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2.4.4 Random Chunk Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.2.4.5 Multi-GPU Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3 DistTGL: Distributed Memory-Based TGNN Training . . . . . . . . . . . . . . . . . . . . . 76
4.3.1 Batched M-TGNN Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.3.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.3.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.3.3.1 M-TGNN Model with Static Node Memory . . . . . . . . . . . . . . . . . 83
4.3.3.2 Parallel Training Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.3.3.3 Distributed Training System . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.3.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.3.4.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.3.4.3 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.3.4.4 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.3.4.5 Training Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
vi
Chapter 5: ViTeGNN: Versatile TGNN Inferencing on FPGAs . . . . . . . . . . . . . . . . . . . . . 101
5.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.2 TGNN Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.3 Conventional M-TGNN Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.4 Inference Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.5 Case Study: TGN Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.6 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.6.1 Inference Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.6.1.1 ViTeGNN-lat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.6.1.2 ViTeGNN-thpt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.6.1.3 Latency and Throughput Analysis . . . . . . . . . . . . . . . . . . . . . . 110
5.6.2 Model Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.6.2.1 Simplified Temporal Attention . . . . . . . . . . . . . . . . . . . . . . . . 113
5.6.2.2 Temporal Neighbor Pruning . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.6.2.3 Time Encoding Look-Up-Table . . . . . . . . . . . . . . . . . . . . . . . . 115
5.6.3 Hardware Mapping and Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.6.3.1 Overview of Hardware Architecture . . . . . . . . . . . . . . . . . . . . . 116
5.6.3.2 Neighbor Update Unit (NUU) . . . . . . . . . . . . . . . . . . . . . . . . 117
5.6.3.3 Compute Unit (CU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.6.3.4 Vertex Memory Updater . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.6.4 Performance Model-Guided Dynamic Mode Selection . . . . . . . . . . . . . . . . . 121
5.6.4.1 ViTeGNN-lat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.6.4.2 ViTeGNN-bal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.6.4.3 ViTeGNN-thpt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.6.4.4 ViTeGNN-auto . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.7.1 FPGA Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.7.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.7.3 Latency and Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.7.4 Performance Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.7.5 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Chapter 6: Conclusion and Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.2.1 In-Depth TGNN Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.2.2 TGNNs on Heterogeneous Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.2.3 Multimodal TGNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
vii
List of Tables
2.1 Common notation used in this dissertation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1 TASER Datset Statistic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Accuracy of TASER and baselines in MRR. . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3 Total Runtime Breakdown per Epoch of TASER. . . . . . . . . . . . . . . . . . . . . . . . . 53
4.1 Dataset statistic of TGL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2 Execution time and improvement of the TGL sampler. . . . . . . . . . . . . . . . . . . . . . 70
4.3 Link prediction accuracy of TGL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.4 Dynamic node classification result of TGL. . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5 Link prediction of TGL on GDELT and MAG. . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.6 Summary of the three parallel training strategies of DistTGL. . . . . . . . . . . . . . . . . . 89
4.7 Dataset statistic of DistTGL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.1 Case study on TGN Inference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.2 Latency and throughput of the three inference modes in ViTeGNN. . . . . . . . . . . . . . 111
5.3 Dataset statistic of ViTeGNN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.4 Specifications of the hardware platforms in ViTeGNN. . . . . . . . . . . . . . . . . . . . . . 127
5.5 FPGA resource utilization of ViTeGNN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.6 Accuracy of ViTeGNN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.7 Ablation study of ViTeGNN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
viii
6.1 Summary of completed work in this dissertation. . . . . . . . . . . . . . . . . . . . . . . . . 137
ix
List of Figures
2.1 Architecture of the NVIDIA H100 GPU [68]. . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Architecture of the AMD Alevo U280 FPGA [2]. . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1 Runtime breakdown for TGAT with different numbers of neighbors. . . . . . . . . . . . . . 34
3.2 Overview of TASER. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3 Sampling time and cache hit rate of TASER. . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.4 Test MRR with different sampling budgets of TASER. . . . . . . . . . . . . . . . . . . . . . 56
4.1 Accuracy and training time of TGL on the Wikipedia dataset. . . . . . . . . . . . . . . . . . 60
4.2 Overview of TGL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3 The T-CSR data structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4 Scalability and runtime breakdown of the TGL sampler. . . . . . . . . . . . . . . . . . . . . 70
4.5 Validation AP and runtime breakdown of TGL. . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.6 Validation loss with different chunk sizes of TGL. . . . . . . . . . . . . . . . . . . . . . . . 74
4.7 Normalized training time on the GDELT dataset of TGL. . . . . . . . . . . . . . . . . . . . 75
4.8 Test accuracy and communication time on the GDELT dataset. . . . . . . . . . . . . . . . . 77
4.9 Overview of the inaccuracy in node memory caused by batched training. . . . . . . . . . . 80
4.10 Overview of DistTGL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.11 Accuracy different of static and dynamic node memory. . . . . . . . . . . . . . . . . . . . . 83
4.12 Accuracy with and without pre-trained static node memory. . . . . . . . . . . . . . . . . . 84
x
4.13 Overview of the parallel training algorithms in DistTGL. . . . . . . . . . . . . . . . . . . . 86
4.14 Number of captured events in the node memory. . . . . . . . . . . . . . . . . . . . . . . . . 89
4.15 Convergence curve of DistTGL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.16 Performance of DistTGL under different parallelism choices. . . . . . . . . . . . . . . . . . 97
4.17 Convergence of DistTGL on the GDELT datasets. . . . . . . . . . . . . . . . . . . . . . . . 98
4.18 Training throughput of DistTGL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.1 Frequency of input to the time encoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.2 Overview of the hardware architecture of ViTeGNN. . . . . . . . . . . . . . . . . . . . . . 116
5.3 Neighbor Update Unit in ViTeGNN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.4 Vertex Memory Updater in ViTeGNN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.5 Device map of ViTeGNN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.6 Latency and throughput results of ViTeGNN. . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.7 ViTeGNN performance model accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
xi
Abstract
Graph Representation Learning (GRL) aims at generating low-dimensional node embedding vectors from
complex graph-structured relational data that are prevalent in the real world. Recently, Temporal Graph
Neural Networks (TGNNs) have extended the scope of GRL to dynamic graphs. TGNNs generate highquality and versatile dynamic node embeddings by simultaneously encoding the graph structure, node
and edge contexts, and their temporal dependencies. TGNNs are shown to demonstrably outperform
traditional dynamic graph analytic algorithms in many impactful applications that address critical realworld challenges, such as social network analysis, healthcare, weather prediction, and traffic management.
However, accurate, robust, and scalable TGNNs are challenging due to the prevalent noise in real-world data,
irregular memory accesses, complex temporal dependencies, and high computation complexity. Specifically,
current TGNNs do not scale well with respect to the graph size due to the following issues: 1. Weak models:
current TGNN models do not capture high-frequency information, recall long-term repetitive patterns, and
handle the diverse noise that is commonly found in real-world dynamic graphs. 2. Ineffcieint algorithms:
current TGNN training algorithms cannot leverage the massive parallel processing architecture of modern
hardware, while current TGNN inference algorithms cannot meet the requirements in different scenarios.
3. Unoptimized systems: current TGNN systems suffer from inefficient designs that could be optimized on
modern hardware, hindering overall performance.
In this dissertation, we address the above issues via model-algorithm-system co-design. For model
improvements, we propose a static node-memory enhancement that improves the capability of TGNN
xii
models to capture high-frequency information and long-term repetitive patterns. We also propose a
two-fold temporal adaptive sampling technique to handle the noise in real-world dynamic graphs. For
algorithm improvements, we propose two fast temporal neighbor samplers on CPU and GPU, a scalable
distributed training algorithm with heuristic guidelines to achieve the optimal configuration based on
hardware characteristics and task specifications, and a versatile inference algorithm with various inference
modes to serve the requirements of different applications. For system improvements, we propose techniques
including dynamic feature caching, simplified temporal attention, prefetching, and pipelining to compose
optimized training and inference systems on single and distributed GPU and FPGA. We demonstrate
significant improvements in accuracy, training time, inference latency, and throughput compared with
state-of-the-art TGNN solutions.
xiii
Chapter 1
Introduction
1.1 Temporal Graph Learning
Dynamic graphs provide powerful abstractions for the widely existing temporal relational data in various
science and engineering domains [110]. Dynamic graphs are characterized by their massive amounts [29],
the inherent complexity of information representation [24], and the evolving nature with respect to time [39].
Examples include the use of dynamic graphs to represent people interactions in social networks [35], links in
wireless networks [129], and landmarks and roads in transportation networks [77]. Temporal graph learning,
which leverages structural, contextual, and temporal information to understand, codify, and derive hidden
information in dynamic graphs, is the key computational technology applicable in diverse domains [55].
To capture the evolving nature of dynamic graphs, the general form of dynamic graphs is characterized
by a series of time-stamped graph events G({Event(i, ti)}) of new nodes and edges appearing, nodes and
edges features changing, and existing nodes and edges disappearing. Given a node v and a timestamp t in
a dynamic graph G, the objective of temporal graph learning is to generate a low-dimensional vector y
t
v
(also referred to as the dynamic node embedding), which contains rich contextual, structural, and temporal
information in G with respect to v at time t. These meaningful low-dimensional vectors allow downstream
applications to easily apply well-developed traditional or machine learning algorithms to complete their
tasks. For example, applying Multi-Layer Perceptrons (MLPs) for supervised dynamic node classification,
1
Support Vector Machines (SVMs) for unsupervised dynamic node classification, K-Nearest Neighbors
(KNNs) for dynamic graph clustering, and matrix factorization for time-aware content recommending.
Recently, along with the dominance of Graph Neural Networks (GNNs) in static graph learning, Temporal
Graph Neural Networks (TGNNs) have also resigned supreme in temporal graph learning. Formally, the
goal of TGNNs is to learn a function f parameterized by Θ that outputs the dynamic node embeddings
y
t
v = fΘ(v, t, G({Event(i, ti)})). (1.1)
In state-of-the-art TGNNs, Θ usually consists of a few aggregation layers with learnable weight matrices
and node memory that summarize the current node status. This allows TGNNs to iteratively gather and
aggregate information from neighboring nodes to reveal hidden patterns in the relationships and properties
of nodes, edges, and their temporal dependencies. Depending on the types of dynamic graphs and the use
cases, TGNNs can be classified into interpolation or extrapolation TGNNs and discrete-time or continuoustime TGNNs. Specifically, if the embedding vector y
t
v
is only computed using the information from graph
events that happen before time t, the corresponding TGNN is an extrapolation TGNN. If y
t
v
is computed
from the whole dynamic graph G, the corresponding TGNN is an interpolation TGNN. Discrete-time
or continuous-time TGNNs are distinguished by whether the timestamps {ti} of the graph events are
discrete or continuous. In this dissertation, we focus on the more general and challenging extrapolation
continuous-time TGNNs that compute the embedding vectors by
y
t
v = fΘ(v, t, G({Event(i, ti)|ti ≤ t}), ti ∈ R. (1.2)
Example TGNN applications include:
• Recommendation Systems: TGNNs can analyze user interactions and network dynamics to recommend relevant products, content, or connections on social media platforms.
2
• Fraud Detection: Analyzing user behavior patterns over time allows TGNNs to identify suspicious
activities and flag potential fraudulent accounts.
• Disease Outbreak Prediction: By analyzing disease spread patterns across geographical networks,
TGNNs can predict the trajectory of outbreaks and inform public health interventions.
• Traffic Flow Prediction: Analyzing historical traffic data and real-time sensor readings, TGNNs can
predict traffic congestion and suggest optimal routes for navigation systems.
• Stock Market Prediction: Analyzing historical stock prices and company relationships, TGNNs can
identify patterns and predict future market trends (although with inherent limitations due to market
complexity).
• Credit Risk Assessment: Analyzing a borrower’s financial history and social network connections
can help lenders assess creditworthiness and make informed loan decisions.
• Supply Chain Management: TGNNs can track the movement of goods through a supply chain
network, optimizing logistics and identifying potential disruptions.
• Climate Modeling: Analyzing weather patterns and interactions between various climate components
over time allows TGNNs to improve climate models and understand long-term trends.
1.2 Motivation
Real-world dynamic graphs are massive. Social media interactions, financial transactions, and scientific
collaborations all generate massive networks with dynamic structures and contents. For example, the online
retail platform Amazon has 310 million active users and billions of products in 2023 [90]. The social network
Facebook has more than 3 billion active users monthly in 2023 [25]. A weather model with 1 km2
resolution
3
leads to 0.51 billion nodes on the surface of the entire Earth. To capture these vast and ever-changing
relational data, real-world dynamic graphs can easily have billions of graph events.
Scaling up temporal graph learning to these large-scale dynamic graphs is crucial to solving many
impactful real-world problems, including social network analysis, healthcare, weather prediction, and
traffic management. Without the ability to handle these massive dynamic graphs, TGNNs could not be
practical for real-world applications. Large-scale implementations are crucial to bridge the gap between
theoretical capabilities and real-world impact. In addition, larger dynamic graphs encompass more historical
information and capture a wider range of interactions. This allows TGNNs to learn more intricate patterns
and make more accurate predictions. Real-world applications impose stringent requirements on both
training and inference speed. Recommender systems in social networks, for instance, need models that
update daily or even hourly. Fraud detection systems demand decisions within milliseconds. However,
achieving these speeds must not come at the cost of accuracy. This dissertation aims to develop accurate,
robust, and scalable TGNN training and inferencing solutions for real-world applications on CPUs, GPUs,
and FGPAs.
1.3 Challenges
There are four major challenges in scaling up temporal graph learning:
C1. Prevalent noise in real-world data. Dynamic graphs, much like their static counterparts, are
not immune to the presence of noise, which adds false and irrelevant information to the graph
signals. Specifically, dynamic graphs introduce two distinctive types of noise: (1) Deprecated links.
Dynamic graphs accumulate an increasing number of interactions over time. Some old interactions
could be irrelevant or even convey information that contradicts the current node status. (2) Skewed
neighborhood distribution. The distribution of interactions among different nodes in a dynamic graph
often exhibits significant disparities or sparsity. Unlike static graphs, dynamic graphs have many
4
repeated edges between the same two nodes at different timestamps. A long-standing node may
exhibit a skewed distribution of neighbors, while an emerging node may have very few neighbors.
For example, deprecated links can be observed in a social network graph when a person relocates to
another country, rendering the previous connections gradually less informative or even incorrect.
Skewness becomes evident when an individual engages in daily conversations with their best friend
while sending only a single message to a car dealer. The noise in dynamic graphs causes two critical
issues that significantly impair the accuracy of TGNNs. Firstly, when performing self-supervised
training with the link prediction task, inferior interactions are used as positive links. Secondly, it is
amplified in the iterative message-passing process, leading to high variance in the output embeddings.
The larger the dynamic graph, the more noise there exists, making it more challenging to learn the
obscured patterns.
C2. Irregular memory accesses. Graph algorithms are typically memory-bound problems [34], where
the random memory accesses are the bottleneck of the whole system. In temporal graph learning,
irregular memory accesses pose three challenges: (1) On large dynamic graphs, TGNNs only aggregate
from sampled temporal neighbors instead of looking at all of them to keep the complexity manageable.
In this sampling process, the sampler needs to consider the timestamps of the interaction between
the target nodes and their temporal neighbors, which incurs a large volume of random accesses to
the dynamic graph structure. (2) After the temporal neighbors are sampled, TGNNs need to fetch
their node features and corresponding edge features. Compared with the graph structure, the feature
information is several orders of magnitude larger, which could not be stored in the fastest and closest
memory to the process for large-scale dynamic graphs. Existing approaches [26, 64, 91] to accelerate
feature fetching on graphs do not consider the irregular memory access pattern on dynamic graphs,
resulting in poor performance. (3) State-of-the-art TGNNs [79, 56, 95, 105] maintain node-level history
to enhance the capability of TGNN aggregators in capturing long-term information. The reads and
5
writes on each node history need to strictly follow the chronological order, creating synchronization
overheads on modern processors.
C3. Complex temporal dependencies. The node-level history creates temporal dependencies and
requires training mini-batches to be small and scheduled in chronological order. Simply increasing the
batch size to achieve more data parallelism loses the temporal dependency information between events
and leads to information loss. On large dynamic graphs, the maximum amount of data parallelism to
exploit before significant accuracy loss is less than the parallelism provided by computing platforms
with multiple processors. This greatly restricts TGNNs from exploiting the advancements of the
massive parallel architecture from modern processors. Note that some of the irregular memory
accesses (Challenge C2) are also results of maintaining complex temporal dependencies.
C4. High computation complexity. TGNNs are significantly more compute-intensive compared with
static GNNs. In order to accurately capture the evolving nature of temporal neighborhoods, TGNNs
maintain node-level history [79] and adopt time encoding [116], which are both concatenated to
the original node and edge features, making the hidden dimension of TGNNs 2-3× larger than
static GNNs. In addition, most TGNNs [116, 79, 105, 106] rely on the temporal attention mechanism
(adopted from Transformer [97]) to aggregate features from temporal neighbors along with additional
sequence models like RNNs and GRUs. An artifact of this mechanism is that it requires computing
additional “keys” and “queries” for each temporal neighbor (more than 2× the number of operations
than a mean or max pooling aggregator). The combined effort makes TGNNs an order of magnitude
more expensive in computation complexity than static GNNs.
6
1.4 Scope
This dissertation focuses on scaling up temporal graph learning by performing model-algorithm-system
co-design to achieve accurate, robust, and scalable extrapolation and continuous-time TGNNs. Specifically,
the outcome of this dissertation is TGNN systems that (1) are robust to the widely existing noise in largescale dynamic graphs, (2) achieve high accuracy and convergence speed in training, using single-node or
distributed computing platforms, and (3) achieve high accuracy, low latency, high throughput, and low
power consumption at inference.
1.4.1 Metrics
The major metrics used in this dissertation are:
• Training and inference accuracy. Accuracy is an important metric both in training and inferencing.
It reflects the expressivity of the TGNN models and the efficiency of the training and inferencing
algorithms. High accuracy demonstrates joint efforts from powerful models and efficient training
and inferencing algorithms. In this dissertation, the metrics for accuracy include Average Precision
(AP) and Mean Reciprocal Rank (MRR) for the temporal link prediction task and AP and F1-Micro for
the dynamic node classification task.
• Training time. Training time measures how long the training system needs to reach a certain
accuracy. Fast training time demonstrates joint efforts from efficient training algorithms and optimized
systems. In this dissertation, training time is measured in seconds.
• Inference latency and throughput. Real-world TGNN applications have different latency and
throughput requirements. For example, fraud detection systems may focus on latency, while recommending systems may focus on throughput. It is important for TGNN inference systems to
7
have different variants optimized for different inference scenarios. In this dissertation, we measure
inference latency in milliseconds and inference throughput in thousands of interactions per second.
Depending on the specific research problem, other metrics used in this dissertation include convergence
speed, number of memory accesses, and number of multiplication and accumulations.
1.4.2 Computation Platforms
Modern computing platforms offer great opportunities for parallel computing by providing different types
of processors. The most popular processor types are CPU, GPU, and FPGA. CPUs excel at general-purpose
computation and task management. With hundreds of cores in state-of-the-art CPUs [3] paired with fast
Double Data Rate (DDR) RAM, CPUs can execute programs with complex branches in parallel. CPUs have
complex memory hierarchies consisting of disks, main memories, and multiple levels of on-chip cache,
boosting the performance of programs with frequent random memory accesses. In addition, the paired DDR
RAM could be as large as multiple terabytes, allowing large datasets to be fully loaded. State-of-the-art
GPUs [69] have tens of thousands of cores and excel at parallel processing tasks. These cores are divided
into bundles, where each bundle usually consists of 32 cores and must execute the same instruction at the
same time. State-of-the-art GPUs use Graphics DDR (GDDR) RAM or High Bandwidth Memory (HBM) as
their Video RAM (VRAM), which provides much higher bandwidth at the cost of a much higher latency.
As a result, GPUs perform extremely well at highly paralleled tasks such as matrix multiplication but
fall behind in sequential tasks or tasks with a lot of branches. Different than CPUs and GPUs, FPGAs
have highly flexible integrated circuits that can be configured to perform specific tasks. When properly
programmed, these extremely customizable processors can achieve high throughput and low latency for
real-time applications using significantly less power than CPUs and GPUs.
In this dissertation, we focus on using single-node CPU-GPU systems and distributed CPU-GPU systems
for TGNN training and FPGAs for TGNN inference. For training, we follow the industrial standard and
8
assume each computing node is equipped with a single CPU or dual CPUs and a single GPU or multiple
GPUs connected to the CPUs via PCIe connection. The GPUs may have high-speed interconnections,
such as NVlink [111], to exchange model weights and other information directly among the GPUs. We
choose GPUs for the primary computing resources due to their prominence advantages in performing the
basic building blocks of machine learning models — sparse and dense linear algebra operations. Hosting
the GPUs as accelerators, CPUs can store large dynamic graphs in their main memory when they do
not fit into the VRAM of GPUs and also allow GPUs to offload some hard-to-parallel operations, such
as the temporal neighbor sampling operation. For inference, we choose to use FPGAs since they have
extraordinary energy efficiency and also allow the programmers to implement flexible data paths and
customized memory organizations, which fit the use cases of TGNN inference.
1.5 Contributions
In this dissertation, we propose model-algorithm-system co-design to address the aforementioned four
challenges: (C1) prevalent noise in real-world data, (C2) irregular memory accesses, (C3) complex temporal
dependencies, and (C4) high computation complexity. This dissertation contributes to TGNN in three folds:
model, algorithm, and system.
• Model
M1. To mitigate the iteratively amplified noises in the message-passing process of TGNNs, we
propose a temporal adaptive neighbor sampling technique to select personalized high-quality
supporting neighbors for each target node individually.
M2. We create a unified TGNN abstraction that supports most major TGNN architectures by studying
the characteristics of a diverse set of TGNN variants, including snapshot-based TGNNs, time
encoding-based TGNNs, and memory-based TGNNs.
9
M3. We enhance the node memory in TGNNs by adding additional static node memory, which
improves both the accuracy and convergence rate.
M4. We propose a lightweight temporal attention mechanism and a related neighbor pruning
strategy, significantly reducing the computation and memory accesses at inference. We design
a knowledge distillation setup to train our simplified models to ensure comparable accuracy.
• Algorithm
A1. To avoid noisy samples being used in the self-supervised TGNN training, we propose a temporal
adaptive mini-batch selection technique that dynamically selects high-quality training samples
during the training process.
A2. We design and implement the first GPU neighbor finder for dynamic graphs, which is optimized
for the massive Single Instruction Multiple Data (SIMD) GPU architecture.
A3. We design a CSR-based data structure for rapid access to temporal neighbors and a parallel
sampler that supports different temporal neighbor sampling algorithms. Our parallel sampler
can quickly locate the temporal edges to sample from by maintaining auxiliary pointer arrays.
A4. We propose a novel random chunk scheduling technique that overcomes the deprivation of
intra-dependency when training with a large batch size for the methods using node memory,
which enables multi-GPU training on large-scale dynamic graphs.
A5. Based on the unique characteristics of TGNN training, we propose two novel parallel training
strategies — epoch parallelism and memory parallelism, which allow TGNNs to capture the
same number of dependent graph events on multiple GPUs as on a single GPU. We provide
heuristic guidelines to determine the optimal training configurations based on the dataset and
hardware characteristics.
10
A6. Besides the conventional TGNN inference mode (we name it ViTeGNN-bal), we propose
ViTeGNN-lat and ViTeGNN-thpt, two inference modes specialized for latency- and throughputcritical TGNN applications.
• System
S1. We design a dynamic GPU cache to speed up the feature-slicing process for large-scale datasets
that cannot be fully stored on the GPU VRAM. Our cache replacing policy achieves near-optimal
performance and requires minimal maintenance overhead.
S2. We design and implement the first single-node multi-GPU TGNN training system, allowing
users to compose different TGNN variants by writing simple configuration files.
S3. To overlap mini-batch generation and GPU training, we serialize the memory operations on
the node memory and efficiently execute them by an independent daemon process, avoiding
complex and expensive synchronizations across distributed trainers.
S4. To efficiently execute different TGNN inference modes, we design a unified hardware accelerator
on FPGA with a flexible data path and customized memory organization. It leads to zero
overhead of switching between inference modes at runtime. In the proposed hardware design,
we develop a novel hardware module to efficiently execute the complex neighbor list updating
process. We also propose a hardware mechanism to ensure chronological vertex updates
without sacrificing the computation parallelism. To dynamically select the best inference
mode at runtime, we further propose ViTeGNN-auto, an automatic inference mode guided by
a predictive performance model based on algorithm parameters, design configurations, and
memory characteristics.
These contributions lead to three major works on adaptive sampling for noise reduction, scalable
training, and versatile inferencing:
11
• TASER. Our contributions M1, A1, A2, and S2 lead to TASER, the first adaptive sampling method for
TGNNs optimized for accuracy, efficiency, and scalability on real-world noisy dynamic graphs. On
five popular datasets, TASER outperforms the corresponding baselines by an average of 2.3% in MRR
while achieving an average speedup of 5.1× in the total traiing time.
• TGL and DistTGL. Our contributions M2, A3, A4, and S3 lead to TGL, the first unified framework for
large-scale TGNN training on single-node multiple-GPU machines. TGL achieves similar or higher
accuracy for all baseline methods with an average speedup of 13×. Using 4 GPUs on a single machine,
TGL achieves an average of 2.3× speedup compared to a single GPU. Our contributions M3, A5, and
S3 lead to DistTGL, an efficient and scalable solution to train memory-based TGNNs on distributed
GPU clusters. DistTGL achieves near-linear speedup when scaling up to 32 GPUs in four machines,
outperforming state-of-the-art single-machine method by more than 10× in the total training time..
• ViTeGNN. Our contributions M4, A6, and S4 lead to ViTeGNN, a versatile TGNN inference solution
for memory-based TGNNs on FPGAs.Compared with the state-of-the-art implementations on CPU
and GPU, ViTeGNN on FPGA achieves 53.9/26.0/16.1× speedup and 8.2/4.0/2.5× speedup in the
ViTeGNN-lat/-bal/-thpt modes with less than one-fifth of the energy used, respectively.
It is worth noting that we release the open-sourced codes of all these works on Github, together with two
large-scale dynamic graph datasets with billions of edges — the GDELT and MAG datasets, which represent
dynamic graphs with long time duration and dynamic graphs with a large number of nodes. These opensourced codes and datasets accelerate advancements in TGNN researchers. For example, GraphMixer [22]
uses our TGL framework to develop new TGNN models. Yanping et al. [128] use the GDELT and MAG
datasets to evaluate the performance of a decoupled TGNN model.
12
1.6 Organization
The rest of this dissertation is organized as follows. In Chapter 2, we review the fundamentals of GNNs and
TGNNs. In Chapter 3, we propose TASER, a fast and accurate solution for noise in dynamic graphs. To
handle the diverse noise, TASER adapts its mini-batch selection based on training dynamics and temporal
neighbor selection based on the contextual, structural, and temporal properties of past interactions. To
alleviate the bottleneck in mini-batch generation, TASER implements a pure GPU-based temporal neighbor
finder and a dedicated GPU feature cache. In Chapter 4, we propose TGL and DistTGL, efficient and
scalable solutions to train TGNNs on single-node multiple-GPU machines and distributed GPU clusters.
TGL proposes a unified framework for various TGNN variants with optimizations in neighbor sampling and
multi-GPU training. DistTGL improves memory-based TGNN training with a more powerful static node
memory-enhanced model, a novel parallel training algorithm on multiple trainers, and an optimized system
with zero mini-batch generation overhead. In Chapter 5, we propose ViTeGNN, a versatile TGNN inference
solution on FPGAs. ViTeGNN adopts a better inference algorithm tailored for different applications, a
suite of algorithmic model optimizations to solve the computation and memory bottlenecks, and a unified
hardware architecture design with various optimizations. Finally, in Chapter 6, we conclude the dissertation
and propose several future directions.
13
Chapter 2
Background and Related Works
In this chapter, we first discuss the fundamentals of GNNs and TGNNs, including how TGNNs evolve from
GNNs.
2.1 Notation
In this dissertation, unless otherwise stated, we use calligraphic letters to represent sets; bold, capital, and
italic letters to represent matrices; bold and italic letters to represent vectors; and bold letters to represent
scalars or other single elements. Table 2.1 shows the common notations used in this dissertation.
2.2 Static Graph Neural Networks
Static GNNs are multi-layer Neural Networks (NNs) that generate embeddings for nodes∗
in static graphs
by gathering and aggregating from neighboring nodes iteratively, which is also referred to as the “message
passing” process. The generated node embeddings have been proven to significantly outperform traditional
graph signal processing [82, 86, 75, 83], random walk [36, 115, 114, 30], and matrix factorization [40, 15, 118,
6] methods.
∗
Edge- and graph-level embeddings could be obtained by further processing on the node embeddings, i.e., concatenating two
node embeddings for an edge embedding and averaging the node embeddings for a graph embedding. We focus on generating
node embeddings here.
14
Table 2.1: Common notation used in this dissertation.
Symbol Description
G(V, E) A dynamic graph with node set V and edge set E
u, v, i, j Nodes in dynamic graphs
e An edge in dynamic graphs
vv The node feature vector of v
eij The edge feature vector of edge ij
Nv(t) The set of temporal neighbors of node v at timestamp t
N s
v
(t) The set of sampled temporal neighbors of node v at timestamp t
sv The node memory of node v
t
−
v The time when node memory sv is updated
Φ(·) The time encoder
(·)
(l) Variables at the (l)
th layer
h
(l)
v The hidden feature vector of node v in the (l)
th layer
We first formally define the general form of a GNN operating on a static graph G(V, E) with node feature
matrix X ∈ R
|V|×d
(0) (d
(0) is the feature vector length of each node). The (l)
th GNN layer takes input from
the hidden feature matrix H(l−1) ∈ R
|V|×d
(l−1) and perform an aggregation step and a transformation step
to output H(l) ∈ R
|V|×d
(l)
. The input node feature matrix X serves as the hidden feature matrix H(0)
,
while the hidden feature matrix in the last layer H(L)
is the output node embeddings. From the perspective
of a single node v, its hidden feature vector h
(l)
v is computed by
h
(l)
v = UPDT(h
(l−1)
v
,AGGR({h
(l−1)
u
|u ∈ Nv})), (2.1)
where Nv is the set of (selected) neighboring nodes of node v. If we compute the node embeddings of all
the nodes in the graph together,
H(l) = UPDT(H(l−1)
,AGGR(H(l−1)
, E)). (2.2)
Using different UPDT and AGGR functions, GNNs develop many different variants. For example, the vanilla
GNN Graph Convolutional Network (GCN) [54] uses a simple mean operation as the AGGR function.
15
The UPDT function in GCN only takes input from the aggregated hidden features. Another GCN GraphSAGE [38] also uses the mean operation as the AGGR function but improves the UPDT function by using
the concatenation operation. There are also more complex GNNs that incorporate attention mechanism [98,
8], residual connections [60, 117], and mechanism to handle edge features [31]. We do not review their
details since they are out of the scope of this dissertation.
2.2.1 GNN Training
The learnable weights in the functions UDPT and AGGR could be obtained in many methods. If the
downstream tasks that take the node embeddings H(L)
as input use a supervised machine learning model
(i.e., a supervised node classification task using a Multi-Layer Perceptron (MLP) as a classifier), the GNN
weights could be directly updated using the gradients back propagated from the downstream task. If ground
truth information is hard to obtain, GNN weights could also be learned by performing semi-supervised
learning [54], self-supervised learning [78], or even unsupervised learning [119]. We also left the details of
these methods since they are out of the scope of this dissertation.
2.2.2 Mini-Batch GNN Computation
On large graphs, it is not possible to compute the node embeddings for all nodes in a single pass, following
Equation 2.2. Hence, GNNs split the node embeddings into smaller mini-batches and follow Equation 2.1 to
compute batch by batch. The input, intermediate variables, and output during the computation of a single
mini-batch usually fit in the closest memory of the processor (i.e., VRAM for GPUs), avoiding expensive
memory swapping operations. During training, this Stochastic Gradient Descent (SGD) approach not only
reduces the required memory size but also improves the generalizability of the trained GNN models [4].
However, the mini-batch computation for a batch of node B in the (l)
th layer requires not only the
hidden features of themselves in the previous layer {h
(l−1)
v |v ∈ B} but also the hidden features of their
16
(selected) neighbors {h
(l−1)
u |u ∈
S
v∈B Nv}. When computing the (l − 1)th layer, the hidden features
of both {h
(l−1)
v |v ∈ B} and {h
(l−1)
u |u ∈
S
v∈B Nv} need to be computed. This leads to the number of
involved nodes growing exponentially with respect to the number of layers, which is referred to as the
“neighbor explosion” problem.
To solve the neighbor explosion problem, researchers have developed methods to limit the number of
supporting neighbors per layer [38, 14, 13] and sample subgraphs instead of neighbors [122, 18]. These
methods work well in both training and inferencing, leading to a satisfying performance of static GNNs on
large-scale graphs.
2.3 Temporal Graph Neural Networks
Before we introduce how TGNNs evolve from static GNNs, we first formally define dynamic graphs.
2.3.1 Unified Representation of Dynamic Graphs
To capture the evolving nature of the nodes, edges, and their features, dynamic graphs are represented by a
series of time-stamped graph events G({Event(i, ti)}). Specifically, the graph events {Event(i, ti)} could
be further classified into six types: {N+}, {E+}, {NF}, {EF}, {N-}, {E-}.
• N+(v, vv, t): New node v appears with node features vv at time t.
• E+(e, eij , t): New edge e appears with node features eij at time t.
• NF(v, vv, t): Node v changes features to vv at time t.
• EF(e, eij , t): Edge e changes features to eij at time t.
• N-(v, t): Node v disappears at time t.
• E-(e, t): Edge e disappears at time t.
17
Note that different from static graphs, two nodes in a dynamic graph can have multiple edges at different
time-stamps with multiple edge features. Since handling these six types of events independently creates
unnecessary overheads, without loss of generalizability, we can transform all event types into new edges
appearing E+.
• N+(v, vv, t) → ∅: since a new appearing node does not have any connections with existing dynamic
graphs, it would not affect the results of the message passing process of other nodes. Hence, we can
simply record the node features vv and ignore the event of new nodes appearing. Note that some
new edge-appearing events naturally indicate new nodes appearing. We also need to record the node
features of these nodes.
• NF(v, vv, t) → {E+(e, euv, t)|u ∈ Nv}: we can transform the graph event of the feature vector of
node v changing to vv into a set of graph events of new edges appearing, where the new edges are
from node u ∈ Nv with edge feature vectors {euv} representing node feature changing. This is to
“notify” the neighbors of v that the node feature vector of v changes. An alternative way is to simply
change the recorded node feature vector without notifying the neighboring nodes.
• EF(e, eij , t) → E+(e, eij , t) or {E+(e, eui, t)|u ∈ Ni} ∪ {E+(e, eui, t)|v ∈ Nj}: there are three ways
to transform the graph event of the feature vector of edge e changing to eij . The most complex one is
to transform into two sets of graph events of new edges appearing. One set is to notify the neighbors
of i, and the other set is to notify the neighbors of j, similar to the case of node feature changing. A
simpler alternative is to add an edge between node i and j with edge features representing a change
of the previous edge feature. The simplest alternative is only to change the recorded edge features.
• N-(v, t) → {E+(e, euv, t)|u ∈ Nv}: similar to node features changing, we can notify the neighboring
node of v for a node deletion event.
18
• E-(e, t) → E+(e, eij , t) or {E+(e, eui, t)|u ∈ Ni} ∪ {E+(e, eui, t)|v ∈ Nj}: similar to edge features
changing, we can either notify neighboring nodes of i and j for an edge deletion event or just create
an edge addition event with edge features representing the deletion of a past edge.
Now, we can focus on handling dynamic graphs G({E+(e, eij , t)}) with new edges appearing as the graph
events. For better readability, we use E+(i, j, eij , t)}) to denote a graph event of an edge appearing at time
t from node i to node j with edge feature vector eij .
2.3.2 Time-Aware Message Passing
Similar to static GNNs, TGNNs perform time-aware message passing to jointly learn the structural, contextual, and temporal information in dynamic graphs. The general form of time-aware message passing could
be written as
h
(l)
v = UPDT({h
(l−1)
v
(t),AGGR({h
(l−1)
u
(t), euv, t|u ∈ Nv})|t ∈ TSNAP}), (2.3)
where TSNAP is the set of timestamps of previous snapshots. Compared with Equation 2.1 for (static) message
passing, two obvious differences are that the UPDT and AGGR functions take more inputs. In fact, we
identify three improvements to make the message passing process time-aware: (1) enhancing the input node
features h
(0)
v , v ∈ V to make them time-aware, (2) enhancing the AGGR function to make the aggregation
process time-aware, and (3) enhancing the UPDT function to make the update process time-aware. Note
that the unified TGNN training framework TGL, which we will introduce in Chapter 4, is based on this
abstraction of the major building blocks of TGNNs.
19
2.3.2.1 Time Encoding
We first introduce the time encoding technique, which is widely used in these TGNN building blocks. The
time encoding maps delta timestamps to fix-length vectors that usually serve as auxiliary node and edge
features.
For any two timestamps t1 and t2, we want to obtain a continuous function Φ : T → R
dT that maps the
time difference to a dT -dimensional vector space. Inspired by the positional encoding in the Transformer
architecture [96], Da et al. propose the first time encoding technique [116]. Guided by the Bochner’s
Theorem, They design a positive-semidefinite kernel K
K(t1, t2) := ⟨Φ(t1), Φ(t2)⟩ =
Z
R
e
iω(t1−t2)
p(ω)dω, (2.4)
where p(ω) is the probability measure. The real part of Equation 2.4 is
K(t1, t2) = Eω [cos(ω(t1 − t2))] = Eω [cos(ωt1) cos(ωt2) + sin(ωt1) sin(ωt2)] , (2.5)
which could be approximated using the Monte Carlo integral [76]
K(t1, t2) ≈
1
d
X
d
i=1
cos(ωit1) cos(ωit2) + sin(ωit1) sin(ωit2). (2.6)
Hence, we can drive the time encoding function
Φ(t) = cos(ωt + ϕ), (2.7)
where ω and ϕ are two learnable vectors with dT dimensions.
20
A later work [22] points out that the aforementioned time encoding function leads to rough loss
landscapes, thus suffering from training unstable issues and generalizing poorly. Specifically, the gradient
of the static time encoding is
∂Φ(t)
∂ω
= tsin(ωt + ϕ) (2.8)
scales proportional to the timestamps. To solve this issue, they propose a fixed time encoding function
Φ(t) = cos(ωt), ω =
n
α
−
i−1
β
odT
i=1
, (2.9)
where α and β are two hyper-parameters depending on the scale of the maximum timestamp to encode.
2.3.2.2 Time-Aware Node Features
For each node v ∈ V, we maintain a node memory vector sv to track the node history, which is initialized
to be a zero vector. When an edge euv connecting node u and node v appears at timestamp t, two messages
are generated at node u and node v
mu =
su||sv||Φ(t − t
−
u
)||euv
(2.10)
mv =
sv||su||Φ(t − t
−
v
)||euv
, (2.11)
where Φ(·) is the time encoding, t
−
u
is the timestamp when su is last updated, and euv is the edge feature.
Then, we use an update function MEM_UPDT to update the node memory of node u and node v,
su = MEM_UPDT (su,mu) (2.12)
sv = MEM_UPDT (sv,mv). (2.13)
21
The update function can be implemented using any sequence model. In TGN-attn [79], MEM_UPDT(·) is
implemented as GRU [19] cells:
ru = σ(Wr{mu||su} + br) (2.14)
zu = σ(Wz{mu||su} + bz) (2.15)
nu = tanh(Wn{mi
||rusi} + bn (2.16)
su = (1 − zu)nu + zusu, (2.17)
where W and b are learnable weights and biases. Since the UPDT function is only called when a related
graph event occurs, the lengths of the hidden status of different nodes in a dynamic graph are different.
After the node memory is computed, the input node feature vector is enhanced by
h
(0)
v = vv + MLP(sv). (2.18)
Note that there are two problems in TGNN training caused by the node memory, the information leak
problem and the information loss problem, which we discuss in detail in Chapter 4.
2.3.2.3 Time-Aware Aggregator
The aggregator function that gathers and aggregates information from neighboring nodes is the key
component in a TGNN layer. The first step in time-aware aggregation is to identify the set of neighboring
nodes to aggregate from, which is referred to as the temporal neighbor sampling process. Let Nv(t) be the
multiset of nodes that have past interactions with node v before timestamp t. Note that we use multiset
because, in dynamic graphs, a source node can have multiple interactions with the same destination node.
There are two types of TGNN sampling strategies: uniform and most recent temporal neighbor sampling.
In uniform neighbor sampling, the supporting nodes are uniformly selected from Nv(t). In most recent
22
neighbor sampling, the supporting nodes are selected from the most recent neighboring nodes in Nv(t).
Let the multiset of sampled/selected neighbors nodes of node v at time t to be N s
v
(t), the AGGR function
takes input from the (enhanced) input node feature vectors {h
(0)
u }, the edge feature vectors {evu} from the
sampled/selected interactions, and their timestamps {t} for u ∈ N s
v
(t).
Inspired by GAT [98], GATv2 [7], and Transformer [97], the (temporal) attention aggregator applies the
attention mechanism to compute the output. For simplicity, we avoid the subscription v when denoting the
input, output, and intermediate variables with respect to node v. The following equations in this Section
are all with respect to node v. Let H(l) be the matrix of input hidden features of all the nodes in N s
(t)
stacked horizontally and Evu be the input edge features of all the edges in N s
(t) stacked horizontally,
q
(l) = W(l)
q {h
(l−1)||Φ(0)} + b
(l)
s
(2.19)
K(l) = W(l)
k
{H(l−1)
u
||Evu||Φ(∆t)} + b
(l)
k
(2.20)
V
(l) = W(l)
v {H(l−1)
u
||Evu||Φ(∆t)} + b
(l)
v
(2.21)
h¯(l) = Softmax
q
(l)K(l)T
p
|N s
v
(t)|
!
V
(l)
(2.22)
h
(l) = Relu(LayerNorm(h¯(l)
)), (2.23)
where || denotes the concatenation operation and ∆t is the vector of the time differences of the current
timestamp with the last updated time of the node memory of u ∈ N s
(t).
23
The mixer aggregator [22], inspired by MLPMixer [93], combines MLP with transpose operations to
aggregate from temporal neighbors,
H
(l)
input = H(l−1)
u
||Evu||Φ(∆t) (2.24)
H
(l)
token = H
(l)
input + W(l)
token2GeLU(W(l)
token1H
(l)
input + b
(l)
token1) + b
(l)
token2 (2.25)
H
(l)
channel = H
(l)
token + GeLU(LayerNorm(H
(l)
token)W(l)
channel1 + b
(l)
channel1)W(l)
channel2 + b
(l)
channel2 (2.26)
h
(l) = Mean(H
(l)
channel). (2.27)
Since H
(l)
channel is computed by applying MLPs on the transposed H
(l)
token, it requires the input dimensions
of H
(l−1)
u to be the same for different nodes. To satisfy this requirement, the mixer aggregator applies
zero-padding for nodes with fewer temporal neighbors than the sampling budget.
2.3.2.4 Time-Aware Updater
Snapshot-based [84, 71] TGNNs apply time-aware updaters to encode the time information. In a snapshotbased TGNN, the graph events in dynamic graphs are divided into sub-sets with a certain time duration,
which are referred to as snapshots. Snapshot-based TGNNs apply sequence models on the learnable
weights in a fixed number of past snapshots or directly on the hidden features hu to capture the temporal
relationships. The sequence models are usually RNN [81], LSTM [42], or GRU [19]. The weights and hidden
features of the target node are computed snapshot by snapshot for t ∈ TSNAP according to time order.
Denote the output of the last iteration of the message passing process in the i
th snapshot with weights
W(i) to be H(i) and the sequence model to be SM, the weights and hidden features are computed by
W(i + 1) = SM(W(i), H(i)), or (2.28)
H(i + 1) = SM(H(i)). (2.29)
24
2.3.3 Example TGNNs
We show three widely used TGNN models as examples:
• EvolveGCN [71] (the EvolveGCN-O variant) is a 2-layer TGNN that applies a time-aware updater on
dynamic graph snapshots to capture the temporal dependencies. Within a snapshot, the architecture
of EvolveGCN is simply the static GCN [54] architecture, without any time encoding or temporal
aggregator. The temporal neighbors are sampled uniformly within each snapshot. For the updater
function, EvolveGCN uses an LSTM to regulate the weight matrices of the GCN aggregator of each
snapshot. Specifically, the weights of the GCN aggregators in one snapshot are computed by using a
shared-weight LSTM cell to update the weights of the GCN aggregators in the previous snapshot.
• TGAT [116] is a 2-layer TGNN that adopts the attention aggregator to capture the temporal dependencies. For the time encoder, TGAT uses the learnable time encoding function (Equation 2.7). For
the input node features, TGAT directly uses the provided node features without any node memory
mechanism. For the temporal neighbor sampling, TGAT samples two-hop temporal neighbors using
the uniform sampling strategy. For the aggregator, TGNN uses the attention aggregator. Since TGAT
does not rely on dynamic graph snapshots, the update function simply outputs the result of the
aggregator.
• TGN [79] is a 1-layer memory-based TGNN using the same learnable time encoding and attention
aggregator as in TGAT. However, TGN uses GRU-based node memory instead of directly using the
node features. The node memory in TGNN enlarges the receptive field (how large the subgraph that
influences the output representation of a node is) by providing information about neighbors as input.
Paired with the most recent neighbor sampling strategy, TGN can recall long and far away history
with a single layer.
25
Other than the aforementioned time-aware message passing techniques, there also exist some TGNN
variants that adopt temporal random walk [106, 50], dictionary-type neighborhood representation [63, 73],
and collaborative filtering [27] techniques.
2.3.4 TGNN Training
Real-world dynamic graphs for specific tasks are often scarce, where labeled data are expensive and timeconsuming. In addition, even when labeled data exists, directly supervised TGNNs often achieve poor
generalizability and adaptability due to noise and biases that widely exist in real-world data. As a result,
TGNNs are usually self-supervised by the temporal edges. Specifically, the weights of TGNNs are learned
by identifying positive edges from negative edges. Given an edge e representing an interaction from node
u with node v at timestamp t. Let u be the source node and v be the positive destination node. We first
randomly sample a node w as the negative destination node. Then, we apply an MLP-based classifier that
maps the dynamic node embeddings to probabilities,
Puv = σ(WoutReLU(Wsrch
(L)
u + bsrc + Wdsth
(L)
v + bdst) + bout) (2.30)
Puw = σ(WoutReLU(Wsrch
(L)
u + bsrc + Wdsth
(L)
w + bdst) + bout). (2.31)
After that, we compute the binary cross entropy loss and perform backward propagation and weight updates
Luv = log(Puv) + log(1 − Puw). (2.32)
2.4 Target Hardware Platforms
In this dissertation, we use GPUs (single GPU, multiple GPUs in a single machine, and multiple GPUs in
distributed systems) to perform TGNN training and FPGAs to perform TGNN inference.
26
Figure 2.1: Architecture of the NVIDIA H100 GPU [68].
2.4.1 GPU
GPUs are designed to handle a vast number of simultaneous calculations efficiently with massive Single
Instruction Multiple Data (SIMD). Figure 2.1 shows the architecture of the NVIDIA H100 GPU, featuring
multiple Streaming Multiprocessors (SMs), 80GB High Bandwidth Memory 3 (HBM3) VRAM for fast and
efficient data access, and 50MB L2 cache shared across all SMs. Each SM contains several key components:
Register Files (RF), shared memory, and L1 data cache. The RFs are critical for storing variables used by
threads. Each thread has access to its own 1 KB register file, which allows for rapid data retrieval and storage,
minimizing delays caused by memory access. The shared memory within each SM acts as a user-managed
cache, providing a fast storage space that all threads within the same block can use. This shared memory
is essential for enabling efficient communication and data sharing among threads, which is particularly
beneficial for parallel processing tasks that require frequent data exchanges. The L1 data cache, along with
the shared memory, is localized within each SM. The combined 256 KB of L1 cache and shared memory per
SM allows for efficient data access patterns, reducing the need for frequent access to slower global memory.
For program mapping, threads are the smallest unit of execution in the CUDA (Compute Unified Device
Architecture) programming model used by NVIDIA GPUs. Threads are grouped into thread blocks, which
27
Figure 2.2: Architecture of the AMD Alevo U280 FPGA [2].
are in turn organized into a grid. Each thread block is executed on a single SM, enabling localized data
sharing through the SM’s shared memory. The hierarchical memory structure in NVIDIA GPUs is designed
to maximize performance and minimize latency. At the thread level, each thread has its own register file,
providing the fastest data access. Moving up the hierarchy, threads within the same block share the SM’s
shared memory, which is slower than the register file but faster than the L1 and L2 caches. At the top of
the memory hierarchy, the global HBM3 memory is accessible by all threads but has the highest latency.
The L2 cache acts as an intermediary storage level between the global memory and the SMs, improving
data access speeds for frequently used data across different SMs.
Multiple GPUs, either in a single machine or distributed machines, could be connected through the
NVLink/NVSwitch technology, which facilitates high-speed communication between them. This interconnect technology is essential for scaling computational tasks across several GPUs, enabling them to
work together on large-scale problems effectively. The NVLink/NVSwitch provides a high-bandwidth data
pathway, ensuring that data can be shared quickly between GPUs without becoming a bottleneck.
28
2.4.2 FPGA
Field Programmable Gate Arrays (FPGAs) are reconfigurable integrated circuits that can be programmed
post-manufacturing to perform specific tasks. They offer high flexibility and parallelism, making them
ideal for diverse applications such as signal processing, data center acceleration, and custom hardware
implementations in areas like telecommunications and embedded systems. Figure 2.2 shows the architecture
of the AMD Alevo U280 FPGA, which integrates multiple high-bandwidth memory (HBM) modules,
extensive connectivity options, and dedicated controllers, ensuring high performance and flexibility. The
key components are:
• High Bandwidth Memory (HBM): The FPGA features two HBM modules, each with a capacity of
4 GB, summing up to 8 GB of high-speed memory. HBM technology significantly reduces latency
and increases bandwidth compared to traditional DDR memory, making it ideal for data-intensive
applications. The HBM modules are directly connected to the FPGA fabric, ensuring rapid data access
and processing.
• DDR Memory Interface: In addition to HBM, the U280 also supports traditional DDR memory,
providing additional storage capacity and flexibility. The DDR interface ensures compatibility with a
wide range of memory types, facilitating diverse application requirements.
• Quad Small Form-factor Pluggable (QSFP): The QSFP interface offers high-speed connectivity, typically used for network connections. It supports data transfer rates up to 100 Gbps, making it
suitable for applications that require high-bandwidth network communication, such as data center
interconnects and high-frequency trading systems.
• Flash Memory: The FPGA includes flash memory, which is utilized for storing configuration data
and firmware. Flash memory ensures that the FPGA can quickly load and initialize its configuration
upon power-up, reducing downtime and enhancing reliability.
29
• Clock Management: Efficient clock management is crucial for synchronizing various operations
within the FPGA. The clock interface in the U280 manages multiple clock sources, ensuring precise
timing and synchronization across different components and processes within the FPGA.
• PCIe Interface: The U280 FPGA connects to the host system via a PCIe x16 interface. PCIe (Peripheral
Component Interconnect Express) ensures high-speed data transfer between the FPGA and the host
system, supporting a wide range of applications that demand rapid data processing and low-latency
communication.
• Super Logic Region (SLR): The U280 FPGA contains three SLRs, which are high-density areas containing logic resources, memory blocks, and DSP slices. The logic resources are the fundamental
building blocks used to implement combinational and sequential logic. These include look-up tables
(LUTs), flip-flops, and multiplexers. The memory blocks provide on-chip storage, such as block RAM
(BRAM) and distributed RAM, enabling efficient data buffering and storage for various applications.
The DSP slices are specialized units designed for high-speed arithmetic operations, including multiplication, addition, and accumulation, essential for digital signal processing and complex mathematical
computations.
30
Chapter 3
TASER: Temporal Adaptive Sampling for Noise-Resistant TGNNs
3.1 Abstract
Recently, Temporal Graph Neural Networks (TGNNs) have demonstrated state-of-the-art performance
in various high-impact applications, including fraud detection and content recommendation. Despite the
success of TGNNs, they are prone to the prevalent noise found in real-world dynamic graphs like timedeprecated links and skewed interaction distribution. The noise causes two critical issues that significantly
compromise the accuracy of TGNNs: (1) models are supervised by inferior interactions, and (2) noisy input
induces high variance in the aggregated messages. However, current TGNN denoising techniques do not
consider the diverse and dynamic noise pattern of each node. In addition, they also suffer from the excessive
mini-batch generation overheads caused by traversing more neighbors. We believe the remedy for fast and
accurate TGNNs lies in temporal adaptive sampling. In this Chapter, we propose TASER, the first adaptive
sampling method for TGNNs optimized for accuracy, efficiency, and scalability. TASER adapts its mini-batch
selection based on training dynamics and temporal neighbor selection based on the contextual, structural,
and temporal properties of past interactions. To alleviate the bottleneck in mini-batch generation, TASER
implements a pure GPU-based temporal neighbor finder and a dedicated GPU feature cache. We evaluate
the performance of TASER using two state-of-the-art backbone TGNNs. On five popular datasets, TASER
31
outperforms the corresponding baselines by an average of 2.3% in Mean Reciprocal Rank (MRR) while
achieving an average of 5.1× speedup in training time.
3.2 Noise in Dynamic Graphs
Graph Neural Networks (GNNs), both static and temporal, recursively gather and aggregate information
from neighboring nodes to generate node embeddings. To reduce the high computation and memory
footprint, neighbor sampling approaches [38, 122, 14, 133] are widely used to alleviate the exponentially
growing neighbor size with respect to the number of GNN layers. However, most sampling methods
approximate full neighborhood training using a static distribution, which is agnostic to the node/edge
features, model architecture, and task performance. These sampling policies are vulnerable to noise since
they cannot distinguish between relevant and irrelevant neighbors, leading to a large sampling variance.
To address these issues, researchers have designed adaptive sampling methods [47, 62, 120, 123], where
the sampling distribution is node-dependent and guided by the task performance. On static graphs, these
methods, which come with theoretical guarantees for variance reduction and are adaptive to performance,
can generate high-quality and robust node embeddings.
Dynamic graphs, much like their static counterparts, are not immune to the presence of noise, which
adds false and irrelevant information to the graph signals. Specifically, dynamic graphs introduce two
distinctive types of noise: (1) Deprecated links. Dynamic graphs accumulate an increasing number of
interactions over time. Some old interactions could be irrelevant or even convey information that contradicts
the current node status. (2) Skewed neighborhood distribution. The distribution of interactions among
different nodes in a dynamic graph often exhibits significant disparities or sparsity. Unlike static graphs,
temporal graphs have many repeated edges between the same two nodes at different timestamps. A
long-standing node may exhibit a skewed distribution of neighbors, while an emerging node may have
very few neighbors. For example, deprecated links can be observed in a social network graph when a
32
person relocates to another country, rendering the previous connections gradually less informative or even
incorrect. Skewness becomes evident when an individual engages in daily conversations with their best
friend while sending only a single message to a car dealer. The noise in dynamic graphs causes two critical
issues that significantly impair the accuracy of TGNNs. Firstly, when performing self-supervised training
with the link prediction task, inferior interactions are used as positive links. Secondly, it is amplified in the
iterative message-passing process, leading to high variance in the output embeddings.
To improve the performance of TGNNs on dynamic graphs with temporal and structural noise, researchers have proposed denoising techniques based on edge dropping [92, 59] and implemented humandefined heuristics [116, 79] to non-adaptive TGNN samplers. However, these approaches require extensive
tuning and often achieve worse performance since they assume the whole graph follows the same noise
pattern and ignore the differences in different nodes at different timestamps. For example, TGAT [116]
employed the inverse timespan sampler to solve the deprecated links problem, which samples past neighbors with probabilities inversely proportional to their time deltas, but found that it performs worse than
the original uniform sampler. Adaptive sampling, on the other hand, could learn customized sampling
probabilities, encompassing any human-defined heuristics that may exist since the learnable sampler
considers not only dynamic graph information but also training dynamics and task performance. Therefore,
we believe that adaptive sampling is integral to any approach addressing the noise problem in TGNNs,
given its comprehensive consideration of all available information sources when estimating personalized
neighborhood sampling probability distributions.
Despite the urgent need for adaptive sampling in TGNNs, it is notably challenging to construct an
efficient and reliable solution. We identify the three main challenges regarding noise as follows: (1) To
capture the dynamics in Temporal Graphs, the adaptive sampler should project not only structural and
contextual information into sampling probabilities but also the time and frequency of the interactions. (2)
Existing adaptive sampling methods only support co-training with simple and static GNN aggregators
33
0 20 40 60 80
10
15
20
25
Training Time (s)
Number of Neighbors
Wikipedia
Prep.
Prop.
0 100 200 300 400
10
15
20
25
Training Time (s)
Reddit
Prep.
Prop.
Figure 3.1: Runtime (per epoch) breakdown for TGAT with different numbers of neighbors per layer. Prep.
refers to the mini-batch generation time (neighbor finding, feature slicing, and CPU-GPU data transferring),
while Prop. refers to the propagation time (forward and backward propagation).
and cannot be generalized to particularly complex temporal aggregators. (3)Adaptive samplers require
traversing a large and time-restricted neighborhood, resulting in enormous training time, especially when
scaling to large-scale datasets. Specifically, when the number of traversed neighbors increases, the minibatch generation overheads (i.e., temporal neighbor finding and feature slicing with CPU-GPU data transfer)
lead to an order-of-magnitude increase in the training time. Figure 3.1 shows the runtime breakdown of
TGAT when the receptive field increases. On both datasets, the mini-batch generation time dominates the
training time. Besides, adaptive sampling requires encoding node/edge features with learnable weights.
Due to this compute-intensive nature, achieving fast adaptive sampling necessitates training on the GPU as
well as specific GPU optimizations to alleviate the mini-batch generation bottleneck.
To overcome the aforementioned challenges, we propose TASER, the first efficient and scalable adaptive
sampling method for TGNNs. TASER provides a general solution for adaptive sampling in TGNNs and
supports most TGNNs designed for Continuous Time Dynamic Graphs (CTDGs). To mitigate the mini-batch
generation bottleneck, TASER implements a pure GPU-based temporal neighbor finder and a dedicated
GPU feature cache. Our main contributions are:
34
• We propose a novel two-fold temporal adaptive sampling technique — temporal adaptive minibatch selection and temporal adaptive neighbor sampling. Temporal adaptive mini-batch selection
selects high-quality training samples, while temporal adaptive neighbor sampling selects high-quality
supporting neighbors.
• We implement the first GPU neighbor finder for dynamic graphs, which is optimized for the massive
Single Instruction Multiple Data (SIMD) GPU architecture. Compared with a state-of-the-art CPU
neighbor finder, our GPU neighbor finder supports arbitrary training order while achieving an
average speedup of 46×.
• We design a dynamic GPU cache to speed up the feature-slicing process for large-scale datasets
that cannot be fully stored on the GPU VRAM. Our cache replacing policy achieves near-optimal
performance and requires minimal maintenance overhead.
• In the experiments, we implement TASER on two state-of-the-art backbone TGNNs. On five popular
datasets, TASER outperforms the corresponding baselines by an average of 2.3% in MRR. Our GPU
neighbor finder and 20% GPU feature cache achieve an average of 5.1× speedup in the total training
time.
3.3 Related Works
The recent success of practical GNN applications is attributed to their ability to quickly and accurately
learn graph representations. Techniques such as graph denoising and GPU accelerations enable GNNs to
efficiently process noisy real-world data at large scales.
35
3.3.1 Dynamic Graph Denoising
Existing dynamic graph denoising techniques are mainly based on dynamic graph sparsification. TGAT [116]
proposes a heuristics sampling policy based on the probability of inversed timespan. TGN [79] further
improves the timespan inversed sampling by sampling the most recent neighbors. To avoid redundancy,
TNS [107] proposes to insert learnable spaces in the most recent neighbors. STEP [59] proposes an
unsupervised graph pruning method that drops the noisy interactions. TGAC [12] devises dynamic graph
augmentation techniques for contrastive TGNN learning, which measures the edge sample probability by
computing the PageRank or Eigenvector of nodes in both ends. However, all these methods do not consider
the disparity of noise among different nodes at different times.
3.3.2 Adaptive Mini-Batch Selection
Adaptive mini-Batch selection, commonly referred to as adaptive importance sampling, constantly reevaluates the relative importance of each training sample during training. The main idea behind these
methods is to use gradient information to reduce variance in uniformly stochastic gradients in order to
improve convergence. GRAD [132] relies on both features and logits for solving least-squares problems.
[89] proposed a general algorithm with efficient computation to speed up coordinate-descent and SGD.
MVS [21] extends these methods to GNNs, considering both the variance introduced by mini-batch selection
and neighbor sampling. In contrast to reducing variance and speeding up optimization, TGNN training
requires avoiding selecting noisy interactions as positive samples.
3.3.3 Adaptive Neighbor Sampling
As one of the graph denoising techniques, adaptive neighbor sampling methods learn a sample probability
distribution for each neighboring node of a given target node. AS-GCN [47] minimizes the GCN sampling
variance by training a self-dependent function based on node features. Bandit Sampling [62] formulates
36
the variance reduction for adaptive sampling as an adversary bandit problem, and Thanos [123] further
proposes a biased reward function to avoid instability. In contrast to variance reduction, PASS [120]
directly optimizes task performance by approximating gradient propagation through the non-differentiable
sampling operation of GCN. To scale to large graphs, PASS adopts a two-step sampling approach, which first
samples a fixed scope and then adaptively samples the neighbors within the scope. However, these adaptive
sampling methods can not capture the temporal information of dynamic graphs and are not compatible
with temporal aggregators.
3.3.4 Neighbor Finding
Optimized GPU graph neighbor finders could achieve orders-of-magnitude higher throughput compared
with CPU neighbor finders by leveraging the massive SIMD architecture and avoiding the data transfer
overheads from CPU to GPU. DGL [103] provides an easy-to-use GPU neighbor finder with unified virtual
memory access support, demonstrating a speedup of 1.5× to 3.9× in total training time compared to the
pipeline, which samples on the CPU and trains on the GPU. Quiver [91] further proposes a workload-based
scheduler that dynamically assigns tasks to the CPU and GPU to solve the imbalanced workload problem
due to the unpredictable latency when working on sparse nodes. Biased (weighted) neighbor finding
based on inverse transformation sampling [70], rejection sampling [49], and alias method [104] are also
well studied on GPUs. However, they don’t work on dynamic graphs and can not be used for temporal
neighborhood sampling. TGL [131] proposes the T-CSR data structure and a parallelized neighbor finder
optimized for dynamic graphs on multi-core CPUs. Its key limitation is the reliance on pointer arrays
for rapidly locating candidate temporal neighbors, which requires scheduling the training mini-batches
chronologically. Besides, Tea [44] is a state-of-the-art general-purpose CPU random walk engine for biased
neighbor finding on dynamic graphs. However, it does not support high-dimensional feature transformation,
which adaptive sampling requires.
37
3.3.5 Graph Feature Caching
The neighbor explosion problem [38] causes an enormous number of memory operations to fetch the node
and edge features. On large graphs whose entire node and edge feature matrices cannot be stored in GPU
VRAM, the CPU-GPU feature slicing and loading process easily becomes the bottleneck during training.
GNS [26] addresses this issue by periodically selecting a global set of nodes for all mini-batches and caching
their features on GPU. Data Tiering [64] uses reverse PageRank to predict the access probability of each
node. Quiver [91] further proposes a connectivity-aware node feature caching strategy that considers the
probability of a node being sampled as a multi-hop neighbor. However, these approaches are designed for
the memory access pattern of static GNNs and do not consider temporal information.
3.4 Approach
In this section, we present TASER, a high-performance temporal adaptive sampling method for TGNNs. An
overall illustration of one mini-batch training for TASER on a 1-layer TGNN is shown in Figure 3.2. First,
we maintain an importance score for each training sample, enabling the adaptive selection of a batch of
high-quality samples in each step. Next, we adopt the bi-level neighbor sampling scheme used in PASS [120]
to improve the performance on large graphs. Initially, a GPU temporal neighbor finder samples a set of
candidate neighbors from the temporal neighborhood N (v, t) using a static policy. Then, we slice features
from both the GPU cache and CPU memory, where the GPU cache is updated at the end of every epoch.
Following this, a parameterized temporal adaptive neighbor sampler is applied to sample a fixed-size set of
informative supporting neighbors from the pre-sampled neighborhood. Finally, the TGNN model is trained
on the representative node samples with their denoised supporting neighborhoods. We further update the
sample importance score P(v) and the sampler’s parameter θ during forward and backward propagation,
respectively.
38
Version 4, Oct 4
(d) Temporal Adaptvie Neighbor
Sampling
(a) Temporal Adaptive
Mini-batch Selection
(b) GPU Temporal Neighbor
Finding
(c) GPU Feature Caching
(e) Temporal Aggregation and Weight
Update
…
VRAM RAM
…
Decoder
Select Training Samples
based on Logits
Local Mem
Access
Zero-Copy
Access over
PCI-e
Backward
Update (d)
TGNN
…
Update
Cache
Feature
Matrix
Encoder
Timestamps
Frequencies
Identities Forward
Update (a)
TGNN
v ∼ P(v)
u ∼ qθ(u|v)
Lmodel ∇Lmodel
Eu∼qθ
(u)
P(v) ∇θLsample
Fig. 2. One training iteration of TASER on a one-layer TGNN. (a) Randomly select a set of mini-batch samples based on the pre-computed importance
score P proportional to the logits (temporal adaptive mini-batch selection). (b) Sample a subset of neighbors from the temporal neighborhood using our
GPU temporal neighbor finder. (c) Slice the features of sampled neighbors from the VRAM cache and RAM. (d) Apply temporal adaptive neighbor sampling
(parameterized by θ) to sub-sample the supporting neighbors for TGNN by encoding timestamps, frequencies, and identities along with features. (e) Perform
forward and backward propagation. Update the importance score P for adaptive mini-batch selection and back-propagate through the model loss and sample
loss to train the TGNN model and temporal adaptive sampler.
positive and negative edge, respectively. After that, the model
loss is computed as follows:
Lmodel = −
1
|EB|
∑︂
e∈EB
ϕ(yˆe
, ye) + ϕ(yˆe
′ , ye
′ ), (10)
where ϕ stands for the cross entropy loss function.
Unlike the original method that chronologically picks minibatch samples, TASER adaptively selects mini-batch samples
based on training dynamics. First, we maintain a list of
importance scores P ∈ R
|Etrain|
to evaluate the noisiness
of each edge sample, which is initialized uniformly. Then,
we randomly sample a batch of training edges EB with the
probability proportional to corresponding importance scores.
As shown in Figure 2 (e), after the forward propagation,
TASER update the importance score P(e) for every positive
sample e(v1, v2, t) ∈ EB with
P(e) = sigmod(yˆe
) + γ, (11)
where γ is a hyperparameter representing the magnitude
of a uniform distribution mixed with the adaptive sample
importance distribution.
Since dynamic graphs tend to be significantly noisier than
static graphs, selecting positive samples with high confidence
yˆe
can effectively improve accuracy. Given the cross entropy
loss function, the gradient of the loss with respect to logits is
inversely proportional to the logits. When the gradient update
of a sample is large, it indicates that the sample is informative,
but it is also more likely to be an outlier in the data distribution.
To balance the noisiness and diversity in training samples,
we can adjust the value of γ. A larger value of γ makes the
selector prone to sample noisier samples to amplify training. In
practice, we found that γ = 0.1 works well on all the datasets.
B. Temporal Adaptive Neighbor Sampling
Existing adaptive neighbor samplers only support attributed
static graphs, which do not consider time restrictions and
fail to distinguish recurring interactions in the temporal
neighborhood. They are specifically designed for a single
type of aggregator and cannot achieve high accuracy when
extended to other aggregators. To address these issues, we
propose a general encoder-decoder scheme that is suitable
for various graph data and temporal aggregators. Fig. 2 (d)
illustrates the forward propagation of our temporal adaptive
neighbor sampler. Formally, given a temporal neighborhood
Ns(vi
, t0), TASER adaptively computes the sample policy
qθ (uj , tk|vi
, t0) that estimates the probability of sampling
neighbors (uj , tk) ∈ Ns(vi
, t0) given node vi at time t0.
Neighbor Encoder. Auxiliary information must be incorporated to discriminate the unique noise patterns in dynamic
graphs, including outdated or redundant interactions. To generate a time-aware sample policy, we use the fixed time-encoding
function T E(∆t) proposed by GraphMixer [5] to encode
temporal information for each neighbor, as shown in Eq. (8).
T E(∆t) maps the relative timespan from the continuous time
domain to a dtime-dimensional vector space. Besides being
perceptual of time, the sampler also requires to distinguish the
reappearance of neighboring nodes to differentiate redundant
neighbors. We propose the frequency encoding by leveraging
the sinusoidal encoding [37]:
F E(f req(u),2i) = sin (︂
freq(u)/100002i/dfreq)︂
F E(f req(u),2i−1) = cos (︂
freq(u)/100002i/dfreq)︂ (12)
where freq(u) denotes the frequency of a neighbor node u
appearing in the neighborhood Ns(v, t0). Since the frequency
Figure 3.2: One training iteration of TASER on a one-layer TGNN. (a) Randomly select a set of mini-batch
samples based on the pre-computed importance score P proportional to the logits (temporal adaptive
mini-batch selection). (b) Sample a subset of neighbors from the temporal neighborhood using our GPU
temporal neighbor finder. (c) Slice the features of sampled neighbors from the VRAM cache and RAM. (d)
Apply temporal adaptive neighbor sampling (parameterized by θ) to sub-sample the supporting neighbors
for TGNN by encoding timestamps, frequencies, and identities along with features. (e) Perform forward
and backward propagation. Update the importance score P for adaptive mini-batch selection and backpropagate through the model loss and sample loss to train the TGNN model and temporal adaptive sampler.
39
The rest of the section is arranged as follows. We first propose our two-fold adaptive sampling
technique regarding the mini-batch sample selection in Section 3.4.1 and the supporting neighbor sampling
in Section 3.4.2. Then, we propose the pure-GPU neighbor finder in Section 3.4.3 and the dynamic GPU
cache in Section 3.4.4.
3.4.1 Temporal Adaptive Mini-batch Selection
In order to capture the pattern of node states changing over time, TGNNs are trained on interactions that
cover the entire training set. Unlike training GNNs on static graphs, where the models only recover the
final states of different nodes, TGNNs need to recover different states for the same node during training.
However, learning to recover deprecated or cold-start states may significantly impair the accuracy of
TGNNs. To reduce the noise present in the training samples, we propose a temporal adaptive mini-batch
selection method that utilizes the dynamic model predictions to sample high-quality training edges.
We first recall the original mini-batch SGD training process of TGNNs. Given a training set Etrain, we
chronologically sample a subset of edges EB ⊆ Etrain for each batch. For each training edge e(v1, v2, t) ∈ EB,
we set its label ye = 1 and randomly sample a destination node v
′
2 ∈ V to form a negative edge e
′
(v1, v′
2
, t)
with label y
′
e = 0. During the forward propagation, we derive the logits yˆe and yˆe
′ for each positive and
negative edge, respectively. After that, the model loss is computed as follows:
Lmodel = −
1
|EB|
X
e∈EB
ϕ(ˆye, ye) + ϕ(ˆye
′, ye
′), (3.1)
where ϕ stands for the cross entropy loss function.
Unlike the original method that chronologically picks mini-batch samples, TASER adaptively selects
mini-batch samples based on training dynamics. First, we maintain a list of importance scores P ∈ R
|Etrain|
to evaluate the noisiness of each edge sample, which is initialized uniformly. Then, we randomly sample
a batch of training edges EB with the probability proportional to corresponding importance scores. As
40
shown in Figure 3.2 (e), after the forward propagation, TASER update the importance score P(e) for every
positive sample e(v1, v2, t) ∈ EB with
P(e) = sigmod(ˆye) + γ, (3.2)
where γ is a hyperparameter representing the magnitude of a uniform distribution mixed with the adaptive
sample importance distribution.
Since dynamic graphs tend to be significantly noisier than static graphs, selecting positive samples with
high confidence yˆe can effectively improve accuracy. Given the cross entropy loss function, the gradient of
the loss with respect to logits is inversely proportional to the logits. When the gradient update of a sample
is large, it indicates that the sample is informative, but it is also more likely to be an outlier in the data
distribution. To balance the noisiness and diversity in training samples, we can adjust the value of γ. A
larger value of γ makes the selector prone to sample noisier samples to amplify training. In practice, we
found that γ = 0.1 works well on all the datasets.
3.4.2 Temporal Adaptive Neighbor Sampling
Existing adaptive neighbor samplers only support attributed static graphs, which do not consider time
restrictions and fail to distinguish recurring interactions in the temporal neighborhood. They are specifically
designed for a single type of aggregator and cannot achieve high accuracy when extended to other aggregators. To address these issues, we propose a general encoder-decoder scheme that is suitable for various graph
data and temporal aggregators. Fig. 3.2 (d) illustrates the forward propagation of our temporal adaptive
neighbor sampler. Formally, given a temporal neighborhood Ns(vi
, t0), TASER adaptively computes the
sample policy qθ (uj , tk|vi
, t0) that estimates the probability of sampling neighbors (uj , tk) ∈ Ns(vi
, t0)
given node vi at time t0.
41
Neighbor Encoder. Auxiliary information must be incorporated to discriminate the unique noise patterns
in dynamic graphs, including outdated or redundant interactions. To generate a time-aware sample
policy, we use the fixed time-encoding function T E(∆t) proposed by GraphMixer [22] to encode temporal
information for each neighbor. T E(∆t) maps the relative timespan from the continuous time domain to a
dtime-dimensional vector space. Besides being perceptual of time, the sampler also requires to distinguish
the reappearance of neighboring nodes to differentiate redundant neighbors. We propose the frequency
encoding by leveraging the sinusoidal encoding [97]:
F E(freq(u),2i) = sin
freq(u)/100002i/dfreq
F E(freq(u),2i−1) = cos
freq(u)/100002i/dfreq
(3.3)
where freq(u) denotes the frequency of a neighbor node u appearing in the neighborhood Ns(v, t0). Since
the frequency is indeed discrete and has limited values, we choose the positional encoding (i.e. sinusoidal
encoding) instead of the time encoding to encode frequency. However, when two nodes exhibit the same
appearance frequency, the sampler remains unable to distinguish between them. To address this limitation,
we propose the identity encoding IE(uj )
. Given a sorted neighbor list {(u1, t1),(u2, t2), ...,(ubN
, tbN
)} of
Ns(v, t0), where t0 > t1 > ... > tbN
, we defined the identity encoding for each neighbor as:
IE(uj ,i) = 1(uj=ui)
, i = 1, 2, · · · , bN . (3.4)
In addition to these three encodings, we incorporate the contextual information of nodes and edges. For
each neighbor (u, t) ∈ Ns(v, t0), we align the dimensions of node feature xu and edge feature xvut to dfeat:
h(u) = GeLU(Wnxu) (3.5)
h(v,u,t) = GeLU(Wexvut). (3.6)
42
Finally, we concatenate all the encodings as well as features to derive the neighbor embedding as the input
of the decoder:
z(u,t) = {h(u)
||h(v,u,t)
||T E(∆t)
||F E(freq(u))||IE(u)}. (3.7)
To ensure a balanced impact from various information sources, we set the dimensions to dfeat = dtime = dfreq
across all datasets. The dimension of neighbor embedding z(u,t)
is denoted as denc.
Neighbor Decoder. After we encode the unique characteristics of dynamic graphs into the neighbor
embedding z(u,t)
, the subsequent processes can be modeled as a general adaptive neighbor sampling
problem. For simplicity, we omit the timestamps and also do not differentiate the recurrence of interactions
in a neighborhood. The goal of the neighbor decoder is to generate a customized neighborhood importance
distribution q(u|v) for each neighborhood Ns(v). Given that dynamic graphs often lack node features,
rather than learning an exact pair-wise importance score for aggregation as GAT [98], we are more interested
in estimating the relative importance of a node within the neighborhood q(u|{u
′
, u′ ∈ Ns(v)}). Therefore,
we use a 1-layer MLP-Mixer [93] to first transform the hidden embedding dimension and then transform
the neighbor dimension for each neighborhood:
ZNs(v) = MLP-Mixer
{zu1
, zu2
, ..., zubN
}
, (3.8)
43
where ZNs(v) ∈ R
bN ×denc . In doing so, the neighbor embedding not only depends on the global transformation but also captures the neighborhood correlations. To coordinate with different temporal aggregators,
our neighbor decoder supports various predictors [98, 8, 97], including
qlinear(u|v) = σu
wlZNs(v)
, (3.9)
qgat(u|v) = σu
LeakyReLU
a
T · [Wgzu∥Wgzv]
(3.10)
qgatv2(u|v) = σu
a
TLeakyReLU (Wg2 · [zu∥zv])
(3.11)
qtrans(u|v) = σu
(Wtzv)
W′
tZNs(v)
T
√
bN
, (3.12)
where σ is the softmax function. For the target node embedding zv, we concatenate the node feature (if it
exists) with zero time encoding and one frequency encoding.
zv = {h(v)
||T E(0)||F E(1)}. (3.13)
Empirically, we observed that this target node embedding works well even without node features.
Co-Training with Temporal Aggregators. Figure 3.2 (e) demonstrates a temporal aggregator that
combines the messages from sampled neighbors during the forward pass and subsequently back-propagates
based on Lmodel. However, the parameter of the sampling policy qθ(u|v) can not be directly updated through
back-propagation since the sampling process is non-differentiable. We need to construct an auxiliary loss
Lsample to update θ. For an arbitrary temporal aggregator with neighbors sampled following qθ(·|v), we
can rewrite the forward propagation in the form of:
h
(l)
vi = g
(l)
nEqθ(uj |vi)
[f (vi
, uj )] , f ∈ H(l)
o , (3.14
Algorithm 1: TASER Training: One Iteration
Require : a minibatch of labeled edges {ek, yk}
b
k=1, neighbor finding budget m, neighbor sampling
budget n, L-layer TGNN model f, adaptive neighbor sampler q(j|i)
Ensure :updated TGNN model and adaptive neighbor sampler
1 Vact ← all nodes in {ek}
b
k=1;
2 G ← empty supporting neighbor set;
3 for l ← L to 1 do
4 for (vi
, t) ∈ Vact do
5 Ns ← {uj}
m
j=1 sampled from N (vi
, t);
6 N ′
s ← {uj}
n
j=1 sampled from Ns with q(j|i);
7 update Vact and G[s] with N ′
s
;
8 end for
9 end for
10 Lmodel ← loss({f(ek, G[k]
), yk}
b
i=1);
11 update f by back-propagating Lmodel;
12 Lsample ← construct loss following Eq.(3.17) or Eq.(3.18);
13 update q(j|i) by back-propagating Lsample;
14 return f and q(j|i);
where g
(l)
and H(l)
are functions defined by the temporal aggregator at the l-th layer. This implies that every
appearance of the expectation should be considered when calculating ∇θh
(l)
vi
. For simplicity, we concisely
denote qθ(uj |vi) as qθ(uj ) and f(vi
, uj ) as f(uj ) below. We can then approximate each ∇θEqθ(uj )
[f(uj )]
using the log-derivative trick [112] with n Monte Carlo samples {uj ∼ qθ(uj )}
n
j=1:
∇θEqθ(uj )
[f(uj )] ≈
1
n
Xn
j=1
∇θ log qθ(uj )f(uj ). (3.15)
Next, we show how to calculate ∇θLmodel when co-training with temporal aggregators. For the TGAT
aggregator, denote a(vi
, uj ) as the unnormalized attention score and τi,j = e
a(vi,uj )
. The TGAT aggregator
can be transformed in the form of Eq. (3.14) as:
h
(l)
vi = Eqθ(uj )
[f1(uj )] / Eqθ(uj )
[f2(uj )] , (3.16)
45
where f1 (uj ) = τi,j [V
(l)
]j and f2 (uj ) = τi,j . According to the chain rule and Eq. (3.15),
∇θLmodel ≈
dLmodel
dh
(l)
vi
·
1
λ3n
Xn
j=1
τi,j [V
(l)
]j∇θ log qθ(uj )
+
dLmodel
dh
(l)
vi
·
µ
λ4n
Xn
j=1
τi,j∇θ log qθ(uj ),
(3.17)
where λ =
Pn
j=1 τi,j , µ =
Pn
j=1 τi,j [V
(l)
]j . Similarly, for the GraphMixer aggregator,
∇θLmodel =
dLmodel
dh
(l)
vi
·
1
n
nXn
j=1
w
′
jkµjk∇θ log qθ(uj )
odenc
k=1
, (3.18)
where µjk = wT
k h
(l−1)
uj
and uj ∼ qθ(uj ).
Based on Eq. (3.17) and Eq.(3.18), we can construct the sample loss Lsample by freezing the terms except
for the log probability log qθ(uj ), and leveraging the autograd mechanism built in deep learning frameworks
to update θ. Algorithm 1 shows one iteration of TASER training on a L-layer TGNN.
Remark. (Adaptive sampling vs. Attention) Contrary to the bottom-up approaches of graph attention, our
temporal adaptive sampler computes in a top-down manner, which does not require any hidden features h
(l
′
)
u
when generating sampling probabilities for nodes at layer l (l
′ < l). Under a fixed-size scope, it reduces the
computational complexity exponentially w.r.t. the number of layers.
3.4.3 GPU Temporal Neighbor Finding
Neighbor finding on dynamic Graphs is complex as the neighborhood N (v, t) for each node varies over time.
To rapidly identify the candidate neighbor set, we store dynamic graphs in the T-CSR data structure [131],
which sorts the outgoing neighbors according to their timestamps. As shown in Algorithm 2, we employ
a block-centric parallel sampling design to leverage the hierarchical GPU architecture. Specifically, each
target node is allocated a thread block, and each thread inside the block is assigned to sample a neighbor for
46
Algorithm 2: GPU Temporal Neighbor Finding
Input :target nodes {(vi
, t′
i
)}
b
i=1, neighbor budget m, T-CSR graph G
Output :sampled neighbors neigh
1 for block i ← 1 to b do in parallel
2 for thread j ← 1 to m do in parallel
3 if j = 1 then
4 {(uk, tk)}
dvi
k=1 ← G[vi]
;
5 p ← BinarySearch({t1, ..., tdvi
}, t
′
i
);
6 end if
7 SyncThreads();
8 if most recent neighbor finding then
9 neigh[i][j] ← (up−j , tp−j );
10 else if uniform neighbor finding then
11 initialize bitmap M;
12 SyncThreads();
13 keep randomly selecting r ∈ [1, p) until CheckBitmap(r, M) = F alse;
14 neigh[i][j] ← (ur, tr);
15 end
16 end
the target node. We first identify the pivot pointer in each neighborhood using binary search with a single
thread and then use the shared memory bitmap [70] for collision detection in uniform sampling without
replacement. After each thread selects a neighbor, an atomic compare-and-update operation is performed
to detect whether this neighbor has been selected. This block-centric design has three major benefits.
Firstly, self-supervised TGNN training with TASER necessitates a large number of supporting neighbor
candidates for thousands of mini-batch samples in each training iteration. The block-wise design can
efficiently saturate GPU resources while avoiding intra-warp scheduling overhead. Secondly, the threads in
the same warp access the same neighbor information, which can be cached in the shared memory. Thirdly,
the complexity of the binary search is proportional to the neighbor size while the complexity of the bitmap
is inversely proportional to the neighbor size, leading to a balanced workload across different blocks. In
experiment, we verify that our GPU neighbor finder achieves order-of-magnitude speedup compared with
existing CPU neighbor finders [131].
47
Algorithm 3: GPU Edge Feature Caching
Require : edge features {xe, ∀e ∈ E}, edge caching budget k, cache replacement threshold ϵ
1 Q ← {0}
|E|
i=1;
2 Randomly cache k edge features to VRAM;
3 for epoch 1 to T do
4 for each edge read request e do
5 xe serve from VRAM cache or RAM;
6 Q[e] ← Q[e] + 1;
7 end for
8 if |cached edges ∩ Qtopk| < ϵ then
9 update cache with Qtopk edge features
10 end if
11 end for
3.4.4 GPU Feature Caching
To address the issue of the dominant CPU-GPU feature slicing overhead in TGNN training, we propose a
GPU cache for the features of nodes and edges with high-access frequency. For dynamic graphs, since edge
features are usually tremendously larger than node features, here we demonstrate the more commonly used
case of edge feature caching. Due to the temporal adaptive mini-batch selection and neighbor sampling,
the access pattern of TASER changes during the training process, which requires a dynamic cache. One
naive approach is to maintain an O(|E| × |E|) matrix to store the access frequency of every supporting
neighbor for every training sample. However, this results in unacceptable storage overhead, and the cache
update time may even exceed the training time. Although increasing the cache line size can quadratically
reduce the memory overhead, the cache hit rate also drops drastically due to the more coarse-grained
policy. Empirically, we observe that increasing the cache line size from 1 to 512 leads to more than 20%
drop in cache hit rate. On the other hand, since TASER uses the Adam optimizer, the dynamic edge access
pattern will eventually stabilize. Therefore, we leverage the historical edge access pattern to update the
cache policy. After each epoch, if the overlap between the cached edges and the k most frequently accessed
edges Qtopk of the previous epoch is less than a predefined threshold ϵ, we swap the cached content with
48
Table 3.1: Dataset Statistic. |dv| and |de| show the dimensions of node and edge features, respectively.
|V| |E| |dv| |de| train/val/test
Wikipedia 9,227 157,474 - 172 110k/23k/23k
Reddit 10,984 672,447 - 172 470k/101k/101k
Flights 13,169 1,927,145 100 - 600k/200k/200k
MovieLens 371,715 48,990,832 - 266 600k/200k/200k
GDELT 16,682 191,290,882 413 130 600k/200k/200k
features of Qtopk. Note that this lightweight cache replacement policy only requires O(|E|) computation,
significantly less than the probability-based policy even with a large line size.
Algorithm 3 shows the GPU edge feature caching during training. For each iteration of mini-batch
training, TASER concurrently slices a batch of edge features layer by layer and updates the frequency of
accessed edges in parallel. For edge features that are not stored in the VRAM cache, we directly slice the
feature through the unified virtual memory with zero-copy access over PCI-e.
3.5 Experiments
3.5.1 Experimental Setup
Datasets. We evaluate the performance of TASER on five dynamic graph datasets, whose statistics are
shown in Table 3.1. Among them, Wikipedia [56], Reddit∗
[56], and MovieLens [99] are bipartite graphs
without node features. Flights [73] is a traffic graph without edge features, and GDELT [131] is a large-scale
knowledge graph including both node and edge features. The tasks are to predict user posts (Wikipedia,
Reddit, MovieLens), flight schedules (Flights), and news (GDELT). To simulate the use cases in real-world
applications, for large-scale datasets with more than one million temporal edges, we use the latest one million
edges with 60%, 20%, and 20% chronological splits as the training, validation, and test sets, respectively.
TGNN Models. We build TASER on two state-of-the-art TGNN models. TGAT [116] uses a 2-layer
attention-based temporal aggregator with supporting nodes uniformly sampled from the historical neighbors.
∗The Reddit dataset used in this paper is obtained exclusively from the work [56], and no data is directly scraped from the
Reddit website.
49
GraphMixer [22] uses a single-layer MLP-Mixer temporal aggregator with the most recent neighbors as
supporting nodes. To ensure a fair comparison, we keep the number of supporting neighbors to 10, the
default value in both baselines. Note that TASER does not provide any additional input to the TGNN models
other than selecting high-quality supporting neighbors of the same size.
Configurations. For the TGNN models, we follow the default parameters used in the TGL framework [131]
for a fair comparison. In particular, we use the 0.0001 learning rate, 600 batch size, 200 training epochs, and
n = 10 supporting neighbors per node for all the datasets and all the models. We set the dimension of all
the hidden embeddings and encodings to 100. For methods with adaptive neighbor sampling, we set m = 25
as the budget of the neighbor finder for all the datasets, except for the ablation study in Section 3.5.6. We
follow DistTGL to evaluate the performance of transductive temporal link prediction using Mean Reciprocal
Rank (MRR) with 49 randomly sampled negative destination nodes. Please refer to our open-sourced code†
for more details on the hyper-parameters.
Hardware and Software. We implement TASER using Python 3.11, PyTorch 2.0.1, DGL 1.1, and CUDA
12.2. All the experiments are conducted on a machine with dual 96-Core AMD EPYC 9654 CPUs paired
with 1.5TB ECC-DDR5 RAM and a single NVIDIA RTX 6000 Ada GPU with 48GB ECC-GDDR6 VRAM.
3.5.2 Accuracy
Table 3.2 shows the accuracy of TASER on the five datasets. We create two variants to better evaluate
the effectiveness of each of these two adaptive sampling methods in TASER, where w./ Ada. Mini-Batch
denotes baseline methods with adaptive mini-batch selection and w./ Ada. Neighbor is the one with adaptive
neighbor sampling. With both adaptive mini-batch selection and neighbor sampling, TASER achieves an
average of 2.3% MRR improvements over the baselines. TGAT gets an average of 3.1% improvements with
TASER, while GraphMixer only gets 1.4% improvements. Intuitively, this is because TGAT takes 2-hop
†
https://github.com/facebookresearch/taser-tgnn
50
Table 3.2: Accuracy of TASER and baselines in MRR (%). All results are an average of 5 runs. (First, Second.) Wikipedia Reddit Flights MovieLens GDELT TGAT GraphMixer TGAT GraphMixer TGAT GraphMixer TGAT GraphMixer TGAT GraphMixer Baseline 68.76
±0.40 74.05
±0.21 81.05
±0.04 75.11
±0.07 80.50
±0.08 77.86
±0.08 63.11
±0.04 69.01
±0.06 79.01
±0.12 76.26
±0.47
w./ Ada. Mini-Batch 72.22
±0.41 75.34
±0.19 82.57
±0.08 76.25
±0.07 82.65
±0.06 78.89
±0.09 63.97
±0.08 69.11
±0.04 80.34
±0.04 76.74
±1.09
w./ Ada. Neighbor 73.96
±0.51 74.70
±1.24 81.66
±0.04 75.63
±0.17 81.46
±0.31 78.94
±0.93 65.51
±0.42 69.26
±0.18 80.22
±0.06 76.49
±0.15
TASER 75.98
±0.35 76.48
±0.97 82.59
±0.16 76.85
±0.56 82.64
±0.25 79.39
±0.64 65.79
±0.13 69.47
±0.10 81.04
±0.11 76.99
±0.72
(Improvement) (+7.22) (+2.43) (+1.54) (+1.74) (+2.14) (+1.53) (+2.68) (+0.46) (+2.03) (+0.73)
51
neighbors as the input, which benefits more from the adaptive neighbor sampler compared to the 1-hop
neighbors of GraphMixer. On the one hand, each variant of TASER consistently outperforms the baseline
TGNNs by a large margin, revealing the effectiveness of TASER both in denoising training samples and
supporting neighbors. On the other hand, our results suggest that these two orthogonal adaptive sampling
techniques can be employed collectively to either enhance or, at least, maintain accuracy.
We notice that the same neighbor decoder, when paired with different temporal aggregators, leads
to remarkably different performances. This justifies the need for a general encoder-decoder scheme in
TASER. In addition, increasing the integrity of the whole model by using a neighbor decoder with a similar
architecture as the temporal aggregator can reduce training difficulties and thus improve accuracy. We note
substantial accuracy gains (up to 6%) when training MLP-Mixer with GraphMixer, yet observe minimal
improvements with TGAT, whereas TGAT exhibits a preference for the GATv2 neighbor decoder. For the
neighbor encoder, our proposed frequency encoding and identity encoding consistently work well with
any neighbor decoders, reducing the variance of test accuracy and improving the MRR by 0.6% ∼ 1.8%.
3.5.3 Runtime
In this section, we evaluate the speedup of our proposed optimizations in TASER. The training time of
TASER can be broken down into the four dominant steps: neighbor finding, adaptive neighbor sampling,
feature slicing, and forward and backward propagation. We build TASER using the optimized temporal
aggregators as proposed in TGL [131]. For the baseline, we slice all the features from RAM to GPU in each
training iteration and use the original neighbor finder implementation in TGAT [116] and GraphMixer [22].
Note that although TGL provides a high-performance parallel CPU neighbor finder, it maintains a pointer
array for efficient temporal neighborhood searching that only supports training in chronological order,
which does not work in TASER since our mini-batch selection is randomly sampled from a dynamic
distribution.
52
Table 3.3: Total Runtime Breakdown per Epoch (sec). NF, AS, FS, and PP denote neighbor finding, adaptive
neighbor sampling, feature slicing, and propagation, respectively. The percentages (%) represent the runtime
ratios of a particular step relative to the total epoch. The arrows (↑) refer to the same runtime as the one it
points to.
TGAT GraphMixer
NF (%) AS FS (%) PP Total (Impr.) NF (%) AS FS (%) PP Total (Impr.)
Wikipedia
Baseline 40.27 (70%) 2.55 11.26 (19%) 3.73 57.81 (1.00
×) 0.75 (23%) 0.46 0.61 (19%) 1.45 3.28 (1.00
×)
+GPU NF 0.07 (0%) 00.00 (64%) 17.61 (3.28
×) 0.04 (2%) 0.00 (24%) 2.56 (1.27
×)
+10% Cache 0.00 (1%) 0.99 (13%) 7.35 (7.86
×) 0.00 (2%) 0.18 (8%) 2.13 (1.53
×)
+20% Cache 0.00 (1%) 0.71 (10%) 7.07 (8.17
×) 0.00 (2%) 0.16 (8%) 2.11 (1.55
×)
+30% Cache 0.00 (1%) 0.00 0.54 (8%) 0.00 6.90 (8.38
×) 0.00 (2%) 0.00 0.13 (6%) 0.00 2.08 (1.57
×)
Reddit
Baseline 218.56 (77%) 10.18 41.56 (15%) 12.62 282.93 (1.00
×) 3.23 (23%) 1.98 2.36 (17%) 6.23 13.79 (1.00
×)
+GPU NF 0.37 (1%) 00.00 (64%) 64.73 (4.37
×) 0.19 (2%) 0.00 (22%) 10.75 (1.28
×)
+10% Cache 0.00 (1%) 4.41 (16%) 27.58 (10.25
×) 0.00 (2%) 0.81 (9%) 9.19 (1.50
×)
+20% Cache 0.00 (1%) 2.95 (11%) 26.12 (10.82
×) 0.00 (2%) 0.71 (8%) 9.09 (1.51
×)
+30% Cache 0.00 (2%) 00.00 2.36 (9%) 00.00 25.53 (11.08
×) 0.00 (2%) 0.00 0.60 (7%) 0.00 8.98 (1.53
×)
MovieLens
Baseline 276.28 (69%) 17.62 79.15 (20%) 27.60 400.66 (1.00
×) 5.57 (13%) 4.80 19.61 (52%) 12.71 42.68 (1.00
×)
+GPU NF 0.54 (0%) 00.00 (63%) 124.92 (3.20
×) 0.32 (1%) 00.00 (24%) 37.45 (1.14
×)
+10% Cache 0.00 (1%) 12.05 (21%) 57.81 (6.93
×) 0.00 (2%) 0.46 (2.5%) 18.30 (2.33
×)
+20% Cache 0.00 (1%) 9.67 (17%) 55.43 (7.22
×) 0.00 (2%) 0.45 (2.5%) 18.29 (2.33
×)
+30% Cache 0.00 (1%) 00.00 7.99 (15%) 00.00 53.75 (7.45
×) 0.00 (2%) 0.00 0.46 (2.5%) 00.00 18.30 (2.33
×)
GDELT
Baseline 322.40 (83%) 17.08 17.84 (5%) 29.52 386.84 (1.00
×) 6.5 (12%) 6.19 15.37 (29%) 25.33 53.33 (1.00
×)
+GPU NF 0.56 (1%) 000.00 (27%) 65.00 (5.95
×) 0.36 (1%) 00.00 (32%) 47.24 (1.12
×)
+10% Cache 0.00 (1%) 3.08 (6%) 50.25 (7.69
×) 0.00 (1%) 0.52 (2%) 32.40 (1.64
×)
+20% Cache 0.00 (1%) 2.17 (4%) 49.34 (7.83
×) 0.00 (1%) 0.54 (2%) 32.42 (1.64
×)
+30% Cache 0.00 (1%) 00.00 2.39 (5%) 00.00 49.56 (7.80
×) 0.00 (1%) 0.00 0.54 (2%) 00.00 32.41 (1.64
×)
As shown in Table 3.3, the bottlenecks of the baseline are neighbor finding and feature slicing. After
applying our GPU neighbor finder and GPU feature caching with 20% of total edge features, the ratio of
mini-batch generation time (i.e., neighbor finding time plus feature slicing time) to the total runtime drops
53
significantly from 40% ∼ 92% to 3% ∼ 18%. The rest of the runtime mainly lies in the neural network
computation, which is proportional to the computational complexity. Since the Flights dataset does not
contain edge features and the node features can be entirely stored on GPU, we do not demonstrate its
runtime. TASER achieves an average of 8.68× speedup on TGAT and 1.77× speedup on GraphMixer. TGAT
is a 2-layer TGNN and requires a squared number of supporting neighbors, suffering a greater impact from
the inefficiency of neighbor finding and feature slicing. On GDELT, since we use the latest one million
temporal edges for training and evaluation, caching 20% of edge features is already sufficient for the training
set.
3.5.4 GPU Neighbor Finder
Fig. 3.3(a) compares the runtime of different uniform neighbor finders, including the original Pythonimplemented neighbor finder [116], the high-performance CPU parallel neighbor finder from TGL [131],
and our TASER GPU neighbor finder. Since the TGL neighbor finder only supports chronological training
order, we use chronological order on all three neighbor finders for a fair comparison. To better reflect
the actual runtime of CPU neighbor finders, we also include the CPU-GPU data loading time for the
sampled neighbor indices. Note that the TGL neighbor finder is built on a pointer array that leverages
the chronological training order for fast memory access. Although we do not specifically optimize for
the chronological training order, our GPU neighbor sampler is still orders of magnitude faster than the
TGL neighbor finder. As shown in Fig. 3.3(a), when the number of neighbors per layer is set to 25, our
TASER neighbor finder achieves a speedup of more than three orders of magnitude compared to the original
neighbor finder, and a 37 ∼ 56× speedup compared to the TGL neighbor finder, across all five datasets.
54
(a)
5 10 15 20 25
10−1
101
103
#neighbors / layer
Time (sec)
Wikipedia
Origin Neigh Finder (CPU) TGL Neigh Finder (CPU, only supports chrono. order) TASER Neigh Finder (GPU)
5 10 15 20 25
#neighbors / layer
gRedditg
5 10 15 20 25
#neighbors / layer
gFlightg
5 10 15 20 25
#neighbors / layer
gMovieLensg
5 10 15 20 25
#neighbors / layer
gGDELTg
43×
52× 56× 42× 37×
(b)
0 100 200
90
80
70
Epoch
Cache Hit Rate (%)
Wikipedia
Oracle Cache 10%, 20%, 30% TASER Cache 10% TASER Cache 20% TASER Cache 30%
0 100 200
90
80
70
Reddit
Epoch
0 100 200
80
70
60
Epoch
MovieLens
0 100 200
95
90
85
100
Epoch
GDELT
Figure 3.3: (a) Total sampling time per epoch of a 2-layer TGAT with different neighbor finders and different
numbers of neighbors per layer. (b) Cache Hit Rate of TASER caching strategy and Oracle caching strategy
with different training epochs.
55
(a)
0.6363 0.6729 0.6669 0.6855
0.7306 0.7416 0.7598
0.7654 0.7735
0.7803
m = 10m = 15m = 20m = 25
n = 5
n = 10
n = 15
n = 20
0.63 0.68 0.73 0.78
(b)
0.7309 0.7078 0.7307 0.7094
0.7539 0.7497 0.7648
0.7633 0.7806
0.7820
m = 10m = 15m = 20m = 25
n = 5
n = 10
n = 15
n = 20
0.72 0.74 0.76 0.78
Figure 3.4: Test MRR of (a) TGAT and (b) GraphMixer with TASER on the Wikipedia dataset. m and
n denote the numbers of neighbors selected by the neighbor finder and the adaptive neighbor sampler,
respectively.
3.5.5 GPU Cache
We compare our GPU caching strategy with the Oracle caching strategy, which assumes the access frequency
of each edge is known in advance. Both our caching strategy and the Oracle caching strategy are updated
at the end of each epoch. Fig. 3.3 (b) shows that our GPU caching strategy achieves a near-optimal cache
hit rate, close to the Oracle cache with the same size. We choose the 10%, 20%, and 30% cache ratio as
they fit mainstream GPUs with 8GB, 16GB, and 40GB VRAM on the GDELT dataset, respectively. The
cache hit rates increase proportionally with the cache ratio until the Oracle cache is able to include all the
accessed features. We observe that, as the entire model’s weights progressively stabilize, our GPU cache
rarely necessitates an update after 20 epochs, further reaffirming the low maintenance of our strategy. Note
that the cache hit rate of the Oracle caching strategy can also reflect the explore-and-exploit strategy of
the adaptive samplers. For instance, on the Wikipedia dataset, the cache hit rate of the 10% Oracle cache
first increased to 73% and then gradually decreased to 69%, illustrating that the adaptive samplers initially
exploit high-reward edges and subsequently explore other training samples and supporting neighbors to
improve the accuracy.
56
3.5.6 Ablation Study
We evaluate the performance of TASER with different neighbor budgets. Fig. 3.4 demonstrates that TASER
is versatile to various numbers of neighbor candidates m and sampled supporting neighbors n. We note
that the accuracy does not improve when increasing m for GraphMixer with n = 5. Since GraphMixer is a
one-layer TGNN model, it has only 5 supporting nodes per root node when n = 5, while a 2-layer TGAT
has 5 + 5 × 5 = 30 supporting nodes. Selecting n = 5 as the hyper-parameter choice for GraphMixer is
suboptimal for real-world applications, leading to inaccurate supervision for the adaptive sampler from
the TGNN model. The results validate our hypothesis that, with a larger number of neighbor candidates
m, the adaptive neighbor sampler is capable of selecting supporting neighbors that provide more pivotal
information for task prediction. It also shows that TASER consistently performs well when TGNNs prefer a
larger number of supporting neighbors n.
57
Chapter 4
TGL and DistTGL: Multi-GPU Scalable TGNN Training
4.1 Abstract
Many real-world graphs contain time domain information. Temporal Graph Neural Networks capture
temporal information as well as structural and contextual information in the generated dynamic node
embeddings. Researchers have shown that these embeddings achieve state-of-the-art performance in many
different tasks. In the first half of this Chapter, we propose TGL, a unified framework for large-scale
offline Temporal Graph Neural Network training where users can compose various Temporal Graph Neural
Networks with simple configuration files. TGL comprises five main components: a temporal sampler,
a mailbox, a node memory module, a memory updater, and a message passing engine. We design a
Temporal-CSR data structure and a parallel sampler to efficiently sample temporal neighbors to form
training mini-batches. We propose a novel random chunk scheduling technique that mitigates the problem
of obsolete node memory when training with a large batch size. To address the limitations of current
TGNNs only being evaluated on small-scale datasets, we introduce two large-scale real-world datasets with
0.2 and 1.3 billion temporal edges. We evaluate the performance of TGL on four small-scale datasets with a
single GPU and the two large datasets with multiple GPUs for both link prediction and node classification
tasks. We compare TGL with the open-sourced code of five methods and show that TGL achieves similar or
better accuracy with an average of 13× speedup. Our temporal parallel sampler achieves an average of
58
173× speedup on a multi-core CPU compared with the baselines. On a 4-GPU machine, TGL can train one
epoch of more than one billion temporal edges within 1-10 hours.
Memory-based Temporal Graph Neural Networks are powerful tools in dynamic graph representation
learning and have demonstrated superior performance in many real-world applications. However, their
node memory favors smaller batch sizes to capture more dependencies in graph events and needs to be
maintained synchronously across all trainers. As a result, existing frameworks suffer from accuracy loss
when scaling to multiple GPUs. Even worse, the tremendous overhead of synchronizing the node memory
makes it impractical to deploy the solution in GPU clusters. In the second half of this Chapter, we propose
DistTGL — an efficient and scalable solution to train memory-based TGNNs on distributed GPU clusters.
DistTGL has three improvements over existing solutions: an enhanced TGNN model, a novel training
algorithm, and an optimized system. In experiments, DistTGL achieves near-linear convergence speedup,
outperforming the state-of-the-art single-machine method by 14.5% in accuracy and 10.17× in training
throughput.
4.2 TGL: A General Framework for TGNN Training
Existing graph deep learning frameworks like DGL and PyG do not provide efficient data structure, sampler,
and message passing primitive for dynamic graphs, which requires users to implement extra modules to
compose TGNN models. In addition, it is also challenging to design an efficient and versatile framework
that is capable of unifying the different schemes of different TGNN variants. Recently, PyTorch Geometric
Temporal (PyGT) [80] attempted to design a library for dynamic and temporal geometric deep learning on
top of PyG. However, PyGT only supports discrete time snapshot-based methods and full batch training on
small-scale spatial-temporal graphs.
To fill these gaps, we develop TGL, the first general framework for large-scale offline TGNNs training. In
this work, we focus on the widely used edge-timestamped dynamic graphs where each edge is associated with
59
100 101 102
94
96
98
100
16.94×
16.73×
8.51×
4.38×
Training Time per Epoch (s)
AP
TGL Baseline
JODIE
TGAT
TGN
APAN
Figure 4.1: Accuracy and per epoch training time of TGL compared with the baselines on the Wikipedia
dataset (600 batch size).
a timestamp. TGL supports all TGNN variants that aggregate and refine information from maintained states
or features of selected temporal neighbors. The survey [53] categorizes dynamic graphs into Continuous
Time Dynamic Graphs (CTDGs) and Discrete Time Dynamic Graphs (DTDGs) based on the continuous
or discrete quantity of the timestamps. However, we believe that DTDGs are essentially CTDGs with
granulated timestamps. Hence, we design TGL to support the more general CTDGs and evaluate TGL by
comparing the performance of TGNN variants targeting both CTDGs and DTDGs in the experiments. Our
main contributions are
• We design a unified framework that supports efficient training on most TGNN architectures by
studying the characteristic of a diverse set of TGNNs variants including snapshot-based TGNNs [84,
71], time encoding-based TGNNs [116, 79, 105], and memory-based TGNNs [56, 79, 105, 95].
• We design a CSR-based data structure for rapid access to temporal neighbors and a parallel sampler
that supports different temporal neighbor sampling algorithms. Our parallel sampler can quickly
locate the temporal edges to sample from by maintaining auxiliary pointer arrays.
60
CPU GPU
Memory
Mailbox
Sampler
Memory
Updater
Attention
Aggregator
timestamped nodes dynamic embeddings
1
2
2 3
3
4
5
6
6
Figure 4.2: Overview (forward path) of the proposed framework. ⃝1 Sample neighbors for the root nodes
with timestamps in the current mini-batch. ⃝2 Lookup the memory and the mailbox for the supporting
nodes. ⃝3 Transfer the inputs to GPU and update the memory. ⃝4 Perform message passing using the
updated memory as input. ⃝5 Compute loss with the generated temporal embeddings. ⃝6 Update the
memory and the mailbox for next mini-batch.
• We propose a novel random chunk scheduling technique that overcomes the deprivation of intradependency when training with a large batch size for the methods using node memory, which enables
multi-GPU training on large-scale dynamic graphs.
• To better compare the performance of various TGNN methods, we introduce two large-scale datasets
with billions of edges – the GDELT and MAG datasets which represent dynamic graphs with long
time duration and dynamic graphs with larger number of nodes.
• We compare the performance of TGL with the baseline open-sourced codes on two small-scale
datasets. TGL achieves similar or higher accuracy for all baseline methods with an average speedup
of 13× as shown in Figure 4.1. On the large-scale datasets, TGL achieves an average of 2.3× speedup
when using 4 GPUs.
4.2.1 Approach
In this section, we present TGL – a general framework for efficient TGNNs training on large-scale dynamic
graphs. Figure 4.2 shows the overview of the training of TGL on a single GPU. We split the modules with
61
learnable and non-learnable parameters to store on GPU and CPU respectively. For datasets where the GPU
memory is sufficient to hold all information, the non-learnable modules can also be stored and computed on
the GPU to speed up the training. To be compatible with different TGNN variants, we design five general
components: the temporal sampler, the mailbox, the node memory, the memory updater, and the attention
aggregator. For snapshot-based TGNNs, the temporal sampler would sample in each snapshot individually.
Note that in TGL, we do not treat graph snapshots as static windows. Instead, the graph snapshots are
dynamically created according to the timestamp of the target nodes. This allows the snapshot-based TGNNs
to generate dynamic node embeddings at any timestamps instead of a constant embedding in a static
snapshot.
TGNNs are usually self-supervised by the temporal edges because it is hard to get dynamic graphs
with enough dynamic node labels to supervise the training. Training with temporal edges causes the
“information leak” problem where the edges to predict are already given to the models as input. The
information leak problem in the attention aggregator can be simply avoided by not sampling along the
edges to predict. In node memory, the information leak problem is eliminated by caching the input from
previous mini-batches [79], which enables the node memory to receive gradients. In TGL, we adopt the
mailbox module [105] to store a fixed number of the most recent mails for updating the node memory.
When a new interaction appears, we first update the node memory of the involved nodes with the cached
messages in the mailbox. The messages in the mailbox are updated after the dynamic node embeddings are
computed. Note that to keep the node memory consistent, the same updating scheme is used at inference
when updating the node memory is not necessary.
4.2.2 Parallel Temporal Sampler
Sampling neighbors on dynamic graphs is complex as the timestamps of the neighbors need to be considered.
In the offline training process, TGL stores the entire dynamic graph statically where the timestamps are
62
Algorithm 4: Parallel Temporal Sampler
Data: sorted T-CSR G
Input: root nodes n with timestamp tn, number of layer L, number of neighbors in each layer kl
,
number of snapshots S, snapshot length ts
Output: DGL MFGs
1 advance the pointer of n to tn in pt(S + 1) in parallel;
2 for l in 0..L do
3 for s in 0..S do
4 if l ≥ 0 then
5 set n and tn to sampled neighbors in l − 1;
6 end if
7 if l == 0 then
8 advance the pointer of n to tn − s ∗ ts in pt(S − s − 1) in parallel;
9 else
10 binary search in the snapshots Ss for each node n ∈ n in parallel;
11 end if
12 foreach n ∈ n in parallel do
13 sample kl neighbors within the snapshot Ss;
14 end foreach
15 generate DGL MFGs;
16 end for
17 end for
attached to the nodes and edges. For snapshot-based TGNNs, the temporal samplers need to identify the
snapshots before sampling. Other TGNNs that either sample uniformly from all past neighbors or sample
most recent neighbors can be treated as single snapshot TGNNs with infinite snapshot length. Their temporal
samplers also need to identify the candidate edges and their sampling probabilities. Hence, it is important
to design a data structure that can rapidly identify the dynamic candidate set of temporal neighbors to
sample from. Combined with the fact that the mini-batches in TGNNs training follow chronological order
(have non-decreasing timestamps), we propose the Temporal-CSR (T-CSR) data structure.
The T-CSR Data Structure Besides the indptr and indices array of the CSR data structure, for each node,
T-CSR sorts the outgoing edges according to their timestamps as shown in Figure 4.3. After sorting all the
edges in a dynamic graph, we assign edge ids according to their position (indexes) in the sorted indices
and times arrays. In addition, for a TGNN model with n snapshots, we maintain n + 1 pointers for each
node that points at the first and last edges in these snapshots. Formally, the T-CSR data structure is defined
63
e1, t1
e2, t2 e3, t3
e4, t4
v1
v2 v3 v4 v5
indptr · · · v1 v2 · · ·
indices · · · v2 v3 v4 v5 · · ·
times · · · t1 t2 t3 t4 · · ·
pt2 pt1pt0
S1 S0
Figure 4.3: T-CSR representation of the node v1 with four temporal edges e1 to e4 with timestamps t1 to
t4 connected to neighbors v1 to v4. The indices and times arrays are sorted by the edge timestamps and
indexed by the edge ids e1 to e4. S0 and S1 denote two snapshots of the temporal graph, designated by the
pointers pt0 to pt2.
by an indptr array of size |V | + 1, an indices array and a time array of size |E|, and n + 1 pointers array
of size |V |, which leads to a total space complexity of O(2|E| + (n + 2)|V |). For dynamic graphs with
inserting, updating, and deletion of edges and nodes, the T-CSR data structure can treat them as standalone
graph events and allocate their own entries in the indices and times array.
Sampling With the help of the T-CSR data structure, we can quickly choose an edge between two pointers
uniformly, or pick edges closest to the end pointer for the most recent neighbors. These pointers are stored
in arrays and takes an additional O(n|V |) storage and O(|E|) computation complexity to maintain in one
epoch, but allows the sampler to identify candidate edges in O(1). By contrast, performing binary search
would lead to O(|E| log |E|) computation complexity to identify candidate edges in one epoch. Note that
some TGNNs like TGAT [116] use the timestamp of the neighbors to sample multi-hop temporal neighbors,
instead of using the timestamp of the root nodes. For these TGNNs, the proposed pointer only works for
the hop-1 neighbors. Since the edges are sorted in T-CSR, we can still to use binary search to quickly find
out the candidate edges before sampling.
Parallel Sampling To leverage the multi-core CPU resources in the host machine, we exploit data parallelism to sample on the root nodes in a mini-batch as shown in Algorithm 4. In each mini-batch, the target
64
Algorithm 5: Random Chunk Scheduling
Data: training edges E, sorted T-CSR G, TGNN model M
Input: batch size bs, chunk size cs, training epochs E
1 for e in 0..E do
2 es ← rand(0, bs/cs) ∗ bs;
3 ee ← es + bs;
4 while ee ≤ |E| do
5 sample MFGs from E(es..ee);
6 train for one iteration with the current MFG;
7 es ← es + bs;
8 ee ← ee + bs;
9 end while
10 end for
nodes are evenly distributed to each thread to update the pointers and sample the neighbors. Note that
when updating the pointers in parallel, it is possible that multiple threads share the same target nodes with
different timestamps, which causes race conditions. We add fine-grained locks to each node to avoid the
pointers being advanced multiple times under such conditions. When the same target nodes at different
timestamps appears multiple times in one mini-batch, it is also possible that the target nodes with small
timestamps sample temporal neighbors from the future. We prevent information leaks in such situations by
strictly enforcing that the sample temporal neighbors have smaller timestamps than the root nodes. After
each thread finishes sampling in each mini-batch, we generate a DGL Message Flow Graph (MFG) for each
layer [103], which contains all the input data needed in the forward and backward propagation and pass it
to the trainer.
4.2.3 Parallel Training
In order to scale static GNN training to large graphs, recent works [125, 102] increase the batch size to
take advantage of the massive data parallelism provided by multi-GPU servers or GPU clusters. However,
training TGNN with a large batch size suffers from the intrinsic temporal dependency in the node memory.
65
Defining the dependent edges as pairs of training edges that share common supporting nodes in the source
or destination nodes, we can divide the edge dependencies into two types:
• Intra-batch dependencies refer to the dependent edges in the same mini-batch. In TGNN training,
the intra-batch dependencies are discarded in order to process the edges in a mini-batch in parallel.
• Inter-batch dependencies refer to the dependent edges in different mini-batches. TGNNs take
these inter-batch relations into account by updating the node memory and the mailbox after each
mini-batch.
Since the total number of intra- and inter-batch dependencies is constant on one dynamic graph, training
with a larger batch size discards more intra-batch dependencies and learns less inter-batch dependencies
which leads to lower accuracy. To mitigate this issue, we propose a random chunk scheduling technique
that divides the training edges into chunks and randomly picks one chunk as the starting point in each
training epoch, which allows close chunks to be arranged in different mini-batches in different training
epochs, hence learning more inter-batch dependencies. The random chunk training algorithm is shown in
Algorithm 5.
To train TGL on multiple GPUs, we adopt the synchronized training setup of multiple GPUs on a
single node. On n GPUs, we launch n training processes and one sampling process with inter-process
communication protocols.
4.2.4 Experiments
We perform detailed experiments to evaluate the performance of TGL. We implement TGL using PyTorch
1.8.1 [72] and DGL 0.6.1 [103]. The parallel temporal sampler is implemented using C++ and integrated
to the Python training script using PyBind11 [48]. The open-sourced code of TGL could be found at
https://github.com/tedzhouhk/TGL.
66
We select five representative TGNN variants as the baseline methods and evaluate their performance in
TGL.
• JODIE [56] is a pure memory-based TGNN method that uses RNN to update the node memory by the
node messages. We use the open-sourced code implemented as a baseline in TGN [79] as the baseline
code.
• DySAT [84] is a snapshot-based TGNN that uses RNN to combine the node embeddings from different
snapshots.
• TGAT [116] is an attention-based TGNN that gathers temporal information by the attention aggregator.
• TGN [79] is a memory-based TGNN that applies the attention aggregator on the node memory
updated by GRU with the node messages.
• APAN [105] is a pure memory-based TGNN method that uses the attention aggregator to update the
node memory by the node messages delivered to the multi-hop neighbors.
For a fair comparison, we set the receptive field to be 2-hop and fixed the number of neighbors to sampler per
hop at 10. The size of the mailbox is set to be 10 mails in APAN and 1 mail in other methods. For the COMB
function, we use the most recent mail in all methods, as we do not see a noticeable difference if switched
to the mean of mails. We set the dimension of the output dynamic node embeddings to be 100. We apply
the attention aggregator with 2 attention heads for the message passing step in all baseline methods. For
DySAT, we use 3 snapshots with the duration of each snapshot to be 10000 seconds on the four small-scale
datasets, 6 hours on GDELT, and 5 years on MAG. TGL uses dynamic snapshot windows to ensure that the
time resolution of the generated dynamic node embeddings is the same as the other TGNNs. For fairness,
we add layer normalization to JOIDE and TGAT, which allows all methods to have layer normalization and
67
Table 4.1: Dataset statistic. The max(t) column shows the maximum edge timestamp (minimum edge
timestamp is 0 in all datasets). |dv| and |de| show the dimensions of node features and edge features,
respectively. The * denotes randomized features.
|V | |E| max(t) Labels Classes |dv| |de|
Wikipedia 9K 157K 2.7e6 217 2 - 172
Reddit 11K 672K 2.7e6 366 2 - 172
MOOC 7K 412K 2.6e6 - - - 128*
LastFM 2K 1.3M 1.3e8 - - - 128*
GDELT 17K 191M 1.8e5 42M 81 413 186
MAG 122M 1.3B 120 1.4M 152 768 -
in-between each layer. For all methods, we sweep the learning rate from {0.01,0.001,0.0001} and dropout
from {0.1,0.2,0.3,0.4,0.5}. The TGNN models are trained with the link prediction task and directly used in the
dynamic node classification task without fine-tuning [116, 79]. On all datasets, we follow the extrapolation
setting that predicts the links or node properties in the future, given the dynamic graphs in the past. We
provide comprehensive and nondiscriminatory benchmark results for various TGNNs by evaluating them
in the TGL framework.
4.2.4.1 Datasets
Table 4.1 shows the statistic of the six datasets we use to evaluate the performance of TGL. As the Wikipedia
[79], Reddit [79], MOOC [56], and LastFM [56] datasets are small-scale and bipartite dynamic graphs, in
order to evaluate the performance on general and large-scale graphs, we introduce two large-scale datasets
– GDELT and MAG. These two datasets contains 0.2 and 1.3 billion edges in multiple years and focus on
testing the capability of TGNNs in two different dimensions.
The GDELT dataset is a Temporal Knowledge Graph (TKG) originated from the Event Database in
GDELT 2.0 [58] which records events happening in the world from news and articles in over 100 languages
every 15 minutes. Compared with the previous small-scale featureless dataset extracted from the same
source [51], we propose a larger and featured version using the events happened from the beginning of
2016 to the end of 2020. Our GDELT dataset is a homogeneous dynamic graph where the nodes represent
68
actors and temporal edges represent point-time events. Each node has a 413-dimensional multi-hot vector
representing the CAMEO codes attached to the corresponding actor to server as node features. Each
temporal edge has a timestamp and a 186-dimensional multi-hot vector representing the CAMEO codes
attached to the corresponding event to server as temporal edge features. The link prediction task on the
GDELT dataset predicts whether there will be an event happening between two actors at a given timestamp.
For the node classification task, we use the countries where the actors were located when the events
happened as the dynamic node labels. We remove the dynamic node labels for the nodes that have the same
labels at their most recent timestamps to make this task more challenging. We use the events before 2019, in
2019, and in 2020 as training, validation, and test set, respectively. The GDELT dataset has dense temporal
interactions between the nodes and requires TGNNs to be able to capture mutable node information for a
long time duration.
The MAG dataset is a homogeneous sub-graph of the heterogeneous MAG240M graph in OGB-LSC [43].
We extract the paper-paper citation network where each node in MAG represents one academic paper. A
directional temporal edge from node u to node v represents a citation of the paper v in the paper u and has
a timestamp representing the year when the paper u is published. The node features are 768-dimensional
vectors generated by embedding the abstract of the paper using RoBERTa [61]. The link prediction task on
the MAG dataset predicts what papers will a new paper cite. For the node classification dataset, we use
the arXiv subject areas as node labels. We use the papers published before 2018, in 2018, and in 2019 as
training, validation, and test sets. The MAG dataset tests the capability of TGNN models to learn dynamic
node embeddings on large graphs with stable nodes and edges.
4.2.4.2 Parallel Temporal Sampler
The performance of our parallel temporal sampler is evaluated on the g4dn.8xlarge instance on AWS EC2
with 32 virtual CPUs and 64GB of main memory. We select three representative sampling algorithms
69
Table 4.2: Execution time and improvement with respect to baseline samplers on the Wikipedia dataset for
one epoch.
DySAT TGAT TGN
#Threads 1 8 32 1 8 32 1 8 32
Time (s) 1.161 0.446 0.371 1.557 0.569 0.370 0.094 0.46 0.039
Improv. - - - 23× 48× 57× 69× 188× 289×
1 2 4 8 16 32
1
2
3
4
#Threads
Speedup
DySAT
TGAT
TGN
0 0.2 0.4 0.6 0.8 1
DySAT TGAT TGN
Ptr. BS Spl. Oth.
Figure 4.4: (a) Scalability of the temporal sampler on the Wikipedia dataset. (b) Runtime breakdown
(normalized by single thread runtime) of the temporal sampler on the Wikipedia dataset with 1 (top), 8
(mid), and 32 (bottom) threads. Ptr., BS, Spl., and Oth. denote the time to update pointers (line 4 and 11),
to perform binary search (line 13), to sample neighbors (line 16), and to generate DGL MFGs (line 18) in
Algorithm 4, respectively.
• DySAT 2-layer sampling represents the temporal graph sampling for snapshot-based methods. The
supporting nodes are chosen uniformly from the temporal neighbors in each dynamic snapshot.
• TGAT 2-layer sampling represents the uniformly temporal graph sampling, which selects supporting
nodes uniformly from all past temporal neighbors.
• TGN 1-layer sampling represents the most recent temporal graph sampling, which selects the most
recent temporal neighbors as supporting nodes. Most recent sampling algorithms are usually used in
memory-based methods and hence require one less supporting layer.
Table 4.2 shows the improvement (speedup) of the temporal parallel sampler in TGL compared with the
samplers in the open-sourced baselines using different numbers of threads. The baseline samplers sample
the neighbors by performing single-thread vectorized binary search on sorted neighbor lists. We show the
sampling time for one epoch with a batch size of 600 positive and 600 negative edges. With our efficient
70
T-CSR data structure, TGL spends less than 0.5 seconds in sampling on one epoch of the Wikipedia dataset
for all three sampling algorithms. Using 32 threads, TGL achieves 57× and 289× speedup compared with
the sampler in TGAT and TGN. The speedup is a result of combined factors of 1) the T-CSR data structure,
2) data parallelism, and 3) the efficiency of C++ over Python.
Figure 4.4 shows the runtime and the runtime breakdown of our temporal parallel sampler using a
different number of threads. TGL achieves 3.13×, 4.20×, and 2.42× speedup using 32 threads for the
DySAT, TGAT, and TGN sampling algorithms. The reasons for the sub-linear speedup are 1) node-wise locks
in updating the pointers, 2) memory performance bottleneck when fetching the selected edge information,
and 3) linear workload with respect to the number of threads when generating DGL MFGs.
4.2.4.3 Single-GPU Training
We evaluate the performance of TGL using the same g4dn.8xlarge AWS EC2 instance with one Nvidia
T4 GPU. We find that on all small datasets, the batch size of 600 positive edges with 600 negative edges
are a good balance point between the convergence rate and training speed for memory-based TGNNs.
Hence, for a fair comparison, we use a batch size of 600 for all five selected TGNN variants in TGL and
their open-sourced baselines. For the MOOC and LastFM datasets, we randomly generate 128-dimensional
edge features since the original datasets do not contain node or edge features. We use 32 threads in the
temporal parallel sampler. All data are stored on GPU to avoid the data transfer overhead.
Table 4.3 shows the accuracy and per epoch training time of the five baselines and TGL in the link
prediction task. We report the accuracy in Average Precision (AP) on both the positive and negative test
edges. For all methods, TGL achieves similar or higher AP than the baselines with significantly faster runtime
(see Figure 4.1). The accuracy improvement on TGAT and JODIE is because we use layer normalization
in between each layer. The accuracy of TGAT and TGN also benefits from better hyperparameters and
convergence. TGN achieves the highest AP in the link prediction task on all datasets except the LastFM
71
Table 4.3: Link prediction results on the Wikipedia, Reddit, MOOC, and LastFM datasets. The Time columns
refer to the training time per epoch. (First second) Wikipedia Reddit MOOC LastFM Baseline TGL Baseline TGL TGL TGL AP Time (s) AP Time (s) Speedup AP Time (s) AP Time (s) Speedup AP Time (s) AP Time (s) JODIE 94.35 16.6 98.90 1.0 16.94
× 96.56 89.0 99.45 4.2 21.24
× 98.95 2.8 78.78 8.7
DySAT - - 96.37 6.4 - - - 98.57 21.5 - 98.76 19.5 76.39 48.4
TGAT 95.09 110.1 97.26 6.6 16.73
× 97.82 576.2 99.48 39.9 14.45
× 98.50 24.5 54.82 91.4
TGN 98.34 17.7 99.62 2.1 8.51
× 98.47 91.9 99.78 10.5 8.33
× 99.59 5.7 73.76 18.7
APAN 98.12 8.8 98.14 2.0 4.38
× 99.22 121.7 99.24 8.8 13.85
× 98.58 5.6 62.73 18.2
72
0 20 40 60 80 100
0.8
0.85
0.9
0.95
1
Training time (s)
Validation AP
APAN
DySAT
JODIE
TGAT
TGN
0 0.5 1
JODIE DySAT TGAT TGN APAN
Normalized time
⃝1
⃝2
⃝3
⃝4
⃝5
⃝6
Figure 4.5: Validation AP with training time (left) and normalized runtime breakdown (right) on the
Wikipedia dataset. The circled numbers refer to the six steps in Figure 4.2.
Table 4.4: Dynamic node classification result (First second)).
Wikipedia Reddit GDLET MAG
AP F1-Micro
JODIE 81.73 70.91 11.25 43.94
DySAT 86.30 61.70 10.05 50.42
TGAT 85.18 60.61 10.04 51.72
TGN 88.33 63.78 11.89 49.20
APAN 82.54 62.00 10.03 -
dataset, followed by JODIE, DySAT, and TGAT. The pure memory-based TGNN and JODIE achieve top-tier
accuracy with the fastest training time. With efficiently implemented components and an optimized data
path, TGL achieves an average of 13× speedup in the per-epoch training time.
Figure 4.5 shows the convergence curve and runtime breakdown on the Wikipedia dataset. With our
temporal parallel sampler, the sampling overhead in TGL is negligible. For computation-intensive two-layer
TGNNs like DySAT and TGAT, the runtime is dominated by the computation on GPU. For memory-based
models, the time spent in updating the node memory and the mailbox takes up to 30% of the total training
time.
Table 4.4 shows the results of directly using the learned TGNN models on the dynamic node classification
task. On the Wikipedia and Reddit datasets, the node classification tasks are to identify banned users. Since
the number of positive labels is small compared with the number of negative labels, we train the MLP
73
0 10 20 30
10
15
20
Epoch
Validation Loss
600-1 4800-1 4800-16 4800-32
0 10 20 30
20
30
40
Epoch
Figure 4.6: Validation loss (moving average of 5 epochs) with different chunk sizes when using random
chunk scheduling algorithm with large batch size on Wikipedia (left) and Reddit (right) dataset. We denote
the batch size x and number of chunks per batch y as x − y in the legends.
classifiers with an equal number of randomly sampled negative labels, similar to training link prediction
models. We also show the accuracy as AP on both the positive nodes and sampled the negative nodes. TGN
and JODIE achieve the highest AP on the Wikipedia and Reddit datasets, whereas JODIE achieves more
than 7% AP than other methods on the Reddit dataset. We assume this is due to the noisy neighbors in
the Reddit dataset, which prevent high-expressive models from learning general patterns on the graph
structure.
4.2.4.4 Random Chunk Scheduling
To evaluate the effectiveness of the random chunk scheduling technique, we train the TGN model which
has the best overall performance on the two small-scale datasets, as training with small batch size and
plotting various convergence curves on the large-scale datasets is too slow. To make a fair comparison, we
train the baseline models with the best group of hyperparameters (0.001 learning rate, 600 batch size). We
then increase the batch size and also linearly increase the learning rate, as a larger batch size leads to a
better approximation of the total loss [33]. Specifically, we train the same model with 8× the batch size and
learning rate (0.008 learning rate, 4800 batch size) with the chunk sizes of 4800, 300, and 150 (number of
chunks per batch size 1, 16, and 32). Since the node memory used in the validation process is inherited
74
1 GPU 2 GPUs 4 GPUs 8 GPUs
0
0.2
0.4
0.6
0.8
1
JODIE
DySAT
TGAT
TGN
APAN
Figure 4.7: Normalized Training time per epoch with different numbers of GPUs on the GDELT dataset.
from the training process, when we compute the validation loss, we first reset the node memory and use
a constant batch size of 600 to run one whole epoch on the training and validation set. Figure 4.6 shows
the validation loss under different batch sizes and chunk sizes on the Reddit and Wikipedia datasets. The
models trained with the batch size of 4800 and no random chunk scheduling cannot learn on both datasets
after 5 to 10 epochs, due to the lost dependencies in the training mini-batches. On the Wikipedia dataset,
the batch size of 4800 with 16 chunks per batch performs better than no chunks while the same batch size
with 32 chunks per batch achieves similar convergence after 30 epochs. On the Reddit dataset, our random
chunk scheduling technique also mitigates the overfitting issue and achieves close to baseline convergences
within 30 epochs.
4.2.4.5 Multi-GPU Training
We evaluate the performance of TGL on the two large-scale datasets with multiple GPUs. We use the
p3dn.24xlarge instance on EC2 with 96 virtual CPUs, 768GB of main memory, and 8 Nvidia V100 GPUs. We
use 64 threads in the temporal parallel sampler and assign 8 threads for each trainer process. We use a local
batch size of 4000 positive and 4000 negative edges on each GPU. The global copy of the node memory and
the mailbox are stored in the shared memory. The trainer process then overlaps the MFG copy to GPU with
the computation on GPU by creating additional copying threads on different CUDA streams. The gradients
in each iteration are synchronized among the trainer processes through the NCCL backend.
75
Table 4.5: Link prediction results of TGL on the GDELT and MAG datasets. The Time columns refer to the
training time per epoch (First second)).
GDELT MAG
AP Time (s) AP Time (s)
JODIE 97.98 599.2 99.41 4128.3
DySAT 98.72 10651.4 98.27 19748.6
TGAT 96.49 8499.2 99.02 32104.5
TGN 99.39 915.9 99.49 8912.5
APAN 95.28 1358.5 - -
Table 4.5 shows the AP and running time in the link prediction task. Similar to the single GPU results,
TGN achieves the highest AP and JODIE has the fastest training time. On the GDELT datasets, the memorybased models can train one epoch within 30 minutes, while the non-memory-based models need more than
3 hours. On the MAG dataset, APAN throws out of memory error as it requires the mailbox to store 10
most recent mails for each node in the graph. Figure 4.7 shows the scalability of TGL on multiple GPUs.
TGL achieves 2.74×, 2.28×, 2.25×, 2.30× and 1.80× speedup by using 4 GPUs for JODIE, DySAT, TGAT,
TGN, and APAN, respectively. For 8 GPUs, the bandwidth between CPU and main memory to slice the
node and edge features and update the node memory and the mailbox and the number of PCI-E channels to
copy the MFGs to the GPUs are saturated.
Table 4.4 shows the F1-Micro of the trained models in the multiple-class single-label dynamic node
classification task. On the GDELT dataset, all models perform badly where JODIE and TGN have slightly
better performance than others. On the MAG dataset, TGAT and DySAT with two complete graph attention
layers, achieve the highest and second highest accuracy, while JODIE with no graph attention layer achieves
the lowest accuracy.
4.3 DistTGL: Distributed Memory-Based TGNN Training
On dynamic graphs, the number of related events on each node increases as time evolves. When this
number is large, neither temporal attention-based aggregation nor historical neighbor sampling methods
76
104
105
106
0.43
0.46
0.49
Batch Size
Test F1
0 10 20 30 40
1 Node
2 Nodes
4 Nodes
Time (s)
(a) (b)
Figure 4.8: (a) Test accuracy of the GDELT dataset under different batch sizes. (b) Time per epoch spends in
reading and writing of the node memory on different numbers of machines.
allow TGNNs to capture the entire history. To compensate for the lost history, researchers have designed
Memory-based Temporal Graph Neural Networks (M-TGNNs) [56, 95, 79, 105] that maintain node-level
memory vectors to summarize independent node history. The node memory in M-TGNNs not only allows
the aggregator to gather information from fewer historical neighbors but also enlarges the receptive field
because the node memory vectors already contain information multiple hops away. As a result, state-ofthe-art M-TGNN TGN [79] only requires a single GNN layer with some recent neighbors as supporting
nodes. In the benchmark in TGL [131], M-TGNNs fill out the top ranks both in accuracy and training time.
Despite the success of M-TGNNs, it is difficult to deploy them to large-scale production applications due
to their poor scalability. The auxiliary node memory creates temporal dependencies and requires training
mini-batches to be small and scheduled in chronological order. Specifically, there are two major challenges
in exploiting data parallelism in M-TGNN training. First, simply increasing the batch size loses the temporal
dependency information between events and leads to information loss. Figure 4.8(a) shows that the accuracy
decreases as the batch size increases on the GDELT dataset. On smaller datasets, this decrease in accuracy
is usually observed for relatively small batch sizes around 102
-103
edges [79], which are not big enough to
appreciate the speedup provided by multi-GPU data parallelism. Second, all the trainers need to access and
maintain a unified version of node memory, leading to an enormous amount of remote traffic in distributed
77
systems. Unlike static GNN training, the remote accesses to the node memory (typically hundreds of
megabytes per mini-batch) have strict temporal dependencies. Due to these excess and interdependent
remote accesses, distributed training is often slower than single-machine training. Figure 4.8(b) shows
the case when the node memory is distributed to all machines where each machine owns a unique and
equally-sized portion. Furthermore, the remedy to cross-machine traffic in static GNN training [126, 125,
9] — graph partitioning technique METIS [52], is not applicable to dynamic graphs. As a result, on both
small- and large-scale datasets, the training time of the state-of-the-art M-TGNN framework [131] using 8
GPUs on a single node is 10 − 100× slower than state-of-the-art distributed static GNNs [126, 124], with
an unsatisfactory 2-3× speedup over a single GPU.
Therefore, we propose DistTGL — an efficient and scalable solution to train M-TGNNs on distributed
GPU clusters. DistTGL improves the existing M-TGNN training solutions from three perspectives:
• Model: We enhance the node memory in M-TGNNs by adding additional static node memory, which
improves both the accuracy and convergence rate.
• Algorithm: We design a novel training algorithm to overcome the challenges of accuracy loss and
communication overhead in distributed scenarios.
• System: We build an optimized system adopting prefetching and pipelining techniques to minimize
the mini-batch generation overhead.
Compared with existing methods, DistTGL has significant improvement in convergence and training
throughput. To the best of our knowledge, DistTGL is the first work that scales M-TGNN training to
distributed GPU clusters. DistTGL is publicly available at Github∗
. Our main contributions are
∗
https://github.com/amazon-science/disttgl
78
• Based on the unique characteristics of M-TGNN training, we propose two novel parallel training
strategies — epoch parallelism and memory parallelism, which allow M-TGNNs to capture the same
number of dependent graph events on multiple GPUs as on a single GPU.
• We provide heuristic guidelines to determine the optimal training configurations based on the dataset
and hardware characteristics.
• To overlap mini-batch generation and GPU training, we serialize the memory operations on the node
memory and efficiently execute them by an independent daemon process, avoiding complex and
expensive synchronizations.
• In experiments, DistTGL achieves near-linear speedup when scaling to multiple GPUs in convergence
rate, outperforming state-of-the-art single machine method [131] by more than 10×.
4.3.1 Batched M-TGNN Training
Since the training of M-TGNNs needs to be synchronized with the node memory, the training samples need
to be scheduled chronologically. Theoretically, the node memory of a node needs to be immediately updated
after a relevant graph event occurs on that node so that later dependent nodes can use this up-to-date node
memory in the message passing process. Without changing the algorithm, we can process consecutive graph
events that do not have overlapping nodes in batches by updating their node memory in parallel. However,
this limits the batch size to no more than a few graph events on most dynamic graphs. In practice, the tiny
batch size is computationally infeasible on modern hardware, such as GPU, intended for highly paralleled
programs. To solve this problem, M-TGNNs process the incoming graph events in larger fixed-size batches
and update the node memory for the nodes that have new mails once per batch to reduce the computation
79
information loss staleness
input node memory
input mails
updated node memory
time
mini-batch mini-batch mini-batch
previous previous previous pcurrentp
Figure 4.9: Overview of the inaccuracy in node memory caused by batched training.
time. Let {mu} be the set of mails generated at node u in a batch of graph events, su is then updated using
a COMB(·) function
su = UPDT(su, COMB({mu})). (4.1)
Note that the mails {mu} is not using the up-to-date node memory (since it is not computed yet) but
using the outdated node memory at the last batch of graph events. In TGN-attn, the COMB(·) function
simply outputs the most recent mail. This batching approach both updates the node memory in batch and
computes the attention-based message passing in batch. The batched update to node memory causes two
types of inaccuracy in the node memory — staleness and information loss (Figure 4.9). The staleness in
the node memory refers to the problem where the node memory is not up-to-date due to the reversed
computation order to avoid the information leak problem. The information loss in the node memory refers
to the node memory not being updated by the mails that are filtered out by the COMB(·) function as well
as the inaccuracy of the mails due to using outdated node memory. When the batch size is increased, both
the staleness and information loss in the node memory increase, resulting in lower accuracy [79]. Besides
these two types of inaccuracy, another common inaccuracy in sequence models is the inaccuracy due to
not re-computing the hidden embeddings when the weights are updated, which generally does not affect
the performance.
80
4.3.2 Related Works
Dynamic graph representation learning plays an important role in many real-world problems. Many
discrete TGNNs [71, 84, 32, 37], continuous TGNNs [95, 116, 79, 105], and non-GNN methods [94, 106] are
proposed to learn node embeddings on dynamic graphs. There are many existing works that accelerate the
message passing scheme in GNNs on a single node [103, 28] and on distributed GPU clusters [125, 126, 9,
124, 1]. In discrete TGNNs, the propagation within a graph snapshot is the same as static GNNs where these
existing methods can be directly applied to. There are also some existing works that specialize in discrete
TGNNs on a single GPU [121, 127] and distributed systems [10]. However, these methods do not apply to
continuous M-TGNNs due to the unique propagation rule of M-TGNNs. Accelerating continuous M-TGNNs
is challenging due to the aforementioned antithesis between training speed and accuracy. Distributed
M-TGNN training is even more challenging due to the high volume of data synchronization. There are a
few works that accelerate M-TGNNs training. TGL [131] proposes a general framework for single-node
multiple-GPU continuous TGNNs. However, TGL does not support distributed GPU clusters. The speedup
of TGL on multiple GPUs in a single machine is also unsatisfactory, only achieving 2 − 3× speedup on 8
GPUs. EDGE [16] proposes to speedup the training by replacing the dynamic node memory of active nodes
with static learnable node memory, gambling on the chance that active nodes have stable embeddings.
To the best of our knowledge, there is no existing work for M-TGNN training that achieves near-linear
scalability on single-node multiple-GPU or operates on distributed GPU clusters. For the inference task,
TGOpt [108] proposes to accelerate TGNN inference by de-duplication, memorization, and pre-computation.
Another work [130] proposes a system-architecture co-design that accelerates M-TGNN inference on FPGAs.
Unfortunately, these techniques do not apply to M-TGNN training.
81
Machine 0
CPU
RAM
GPU 0
GPU 1
GPU 2
GPU 3
Memory 0
Trainer 0
Trainer 1
Trainer 2
Trainer 3
Machine 1
CPU
RAM
GPU 0
GPU 1
GPU 2
GPU 3
Memory 1
Trainer 4
Trainer 5
Trainer 6
Trainer 7
Ethernet
Mini-batch
i
Mini-batch
i
Mini-batch
i + 2
Mini-batch
i + 2
Mini-batch
i + 4
Mini-batch
i − 1
Mini-batch
i + 1
Mini-batch
i + 1
Mini-batch
i + 3
Mini-batch
i + 3
Mini-batch
j
Mini-batch
j
Mini-batch
j + 2
Mini-batch
j + 2
Mini-batch
j + 4
Mini-batch
j − 1
Mini-batch
j + 1
Mini-batch
j + 1
Mini-batch
j + 3
Mini-batch
j + 3
Iteration i Iteration i + 1 Iteration i + 2 Iteration i + 3 Iteration i + 4
Memory Daemon Process
Trainer Process
Model Sync (NCCL)
Node Memory Write
Node Memory Read
Mini-batch Parallelism
(parallelism within each minibatch)
Epoch Parallelism
(parallelism among multiple epochs)
Memory Parallelism
(parallelism among different time segments)
Figure 4.10: Overview of DistTGL training with 2 × 2 × 2 (mini-batch×epoch×memory) parallelism on
two four-GPU machines. For simplicity and easier understanding, we draw the reads and writes to the node
memory at the beginning and end of each training iteration. In our optimized system, they have performed
asynchronously with the training iterations and are fully overlapped with the GPU computation.
82
low degree node high degree node
static>dynamic
static=dynamic
dynamic>static
Figure 4.11: Accuracy differences of each node with static and dynamic node memory on the Wikipedia
dataset, sorted by node degrees. Positive bars (in dynamic>static region) indicate that dynamic node
memory has better accuracy than static node memory for those nodes, and vice versa.
4.3.3 Approach
We propose DistTGL — an efficient and scalable solution to train M-TGNNs on distributed GPU clusters.
DistTGL achieves scalability through improvements from three perspectives: model, algorithm, and system.
From the model perspective, we introduce the static node memory that explicitly separates the timeirrelevant node information. From the algorithm perspective, we propose two novel parallel training
strategies and a method to determine the best combination of these strategies on any given dataset and
hardware configuration. From the system perspective, we design an efficient system to reduce and overlap
mini-batch generation overhead with GPU training. We introduce these improvements in the three following
subsections.
4.3.3.1 M-TGNN Model with Static Node Memory
M-TGNNs rely on node memory to summarize the node history. Previous work [16] argues that the node
memory of nodes with active interactions is static. While this may be true on some evolving graphs like
citation graphs, it fails on the dynamic graphs where high-frequency information is important, such as
in fraud detection [87]. Figure 4.11 shows the comparison of the accuracy in the temporal link prediction
task that predicts destination nodes from source nodes using static and dynamic node memory. We do not
83
0 65000 130000
0.6
0.7
0.8
Training Iteration
Validation MRR
gFlightsg
0 15000 30000
0.3
0.4
0.5
0.6
Training Iteration
gMOOCg
1 GPU 2 GPUs 4 GPUs 8 GPUs
w/ static node memory
w/o static node memory
Figure 4.12: Validation accuracy on the Flights and MOOC datasets with and without pre-trained static
node memory.
observe any noticeable inclination on higher degree nodes favoring static node memory or vice versa. We
also observe similar results on the other datasets used in this work.
We believe that a general TGNN model should be able to capture both the dynamic and static node
information of all nodes. In DistTGL, we separate the static and dynamic node memory and capture
them explicitly. DistTGL keeps the original GRU node memory on all nodes to capture the dynamic node
information and implements an additional mechanism to capture the static node information. There are two
major benefits brought by this additional static node history. First, it enhances the capability of M-TGNNs
to capture node history with burst interactions. Due to the batching of updating the node memory, if a node
interacts with others many times in a short time period, it is inevitable that the COMB(·) function used in
the dynamic node memory would filter out most of these interactions, resulting in a loss of high-frequency
information. The static node memory, combined with the time encoding [116] in the temporal attention
aggregator, could boost the performance in such cases. Second, the static node memory explicitly separates
the information irrelevant to batch sizes, which improves the performance of data parallelized training.
Since the static node memory is irrelevant with time, all graph events can be used to supervise the training
process, allowing it to capture all static information regardless of batching. In this work, since most dynamic
84
graphs do not have node features, we use learnable node embeddings pre-trained with the same task as the
static node memory due to its simplicity. The pre-training of these embeddings can be easily done in any
well-optimized distributed static GNN frameworks [125, 126, 9, 124, 1]. Note that the static node memory
is similar to learnable weights in the M-TGNN models and does not include any information in the test
set. On the other hand, the dynamic node memory contains information in the test set and would cause
information leaks if not handled properly. DistTGL also supports other kinds of learnable or non-learnable
static node memory, such as co-trained embedding tables or even node embeddings generated by static
GNNs.
Figure 4.12 shows the two datasets that have the most significant improvement with pre-trained static
node memory. On a single GPU, our improved model achieves remarkably better accuracy on both datasets
and a smoother convergence curve on the Flights dataset (we do not show the curves for multi-GPU for
a clearer visualization). On the MOOC dataset, our model with static node memory also improves the
scalability in convergence on multiple-GPU using epoch parallelism.
4.3.3.2 Parallel Training Algorithm
A straightforward approach to train M-TGNNs in parallel is to process the graph events in large global
batches and distribute them to multiple trainers, which is used by TGL [131] in the setting of multiple GPUs
on a single node. We refer to this approach as the mini-batch parallelism, which relaxes the inter-batch
dependencies in node memory. However, the key to achieving good accuracy in multi-GPU M-TGNN
training is to maintain the temporal dependency when the graph events are processed in large batches.
To solve this problem, we propose two novel parallel training strategies — epoch parallelism and memory
parallelism. Epoch parallelism relaxes the dependencies in the node memory due to weight updates and
trains different epochs simultaneously on different trainers. Memory parallelism trades space for accuracy
by maintaining multiple copies of the node memory at different timestamps. In the rest of this section, we
85
(a) Mini-batch Parallelism
i
i + 3
i + 6
i + 9
i + 1
i + 4
i + 7
i + 10
i + 2
i + 5
i + 8
i + 11
Itr. 0
Itr. 1
Itr. 2
Itr. 3
R
W
R
W
R
W
R
W
R
W
R
W
R
W
R
W
R
W
R
W
R
W
R
W
P0 P1 P2
(b) Epoch Parallelism
i
i + 1
i + 2
i + 3
i − 1
i
i + 1
i + 2
i − 2
i − 1
i
i + 1
Itr. 0
Itr. 1
Itr. 2
Itr. 3
R
W
R
W
R
W
R
W
P0 P1 P2
reorder
i
i
i
i + 3
i − 2
i + 1
i + 1
i + 1
i − 1
i − 1
i + 2
i + 2
R
W
R
W
R
W
R
W
P0 P1 P2
(c) Memory Parallelism
0
1
0
1
2
3
2
3
4
5
4
5
Itr. 0
Itr. 1
Itr. 2
Itr. 3
R
W
R
W
R
W
R
W
R
W
R
W
R
W
R
W
R
W
R
W
R
W
R
W
P0 P1 P2
reorder
0
1
2
3
2
3
4
5
4
5
0
1
R
W
R
W
R
W
R
W
R
W
R
W
R
W
R
W
R
W
R
W
R
W
R
W
P0 P1 P2
Figure 4.13: Overview of mini-batch parallelism, epoch parallelism, and memory parallelism on three
trainer processes. The “R” and “W” denote read and write operations to the shared node memory. In
epoch parallelism, the arrows denote cross-process communication to send mini-batch data. In memory
parallelism, the arrows denote cross-process communication to send the updated node memory.
first introduce the three types of parallelism and their advantages and disadvantages. Then, we discuss
how to design an optimal training algorithm given any task specifications and hardware configurations.
Mini-Batch Parallelism. Mini-batch parallelism simply trains a large global batch on multiple trainers
in parallel. On n GPUs, a global batch of graph events is evenly divided into n local batches, where each
GPU is responsible for computing the output embeddings of one local batch. Figure 4.13(a) shows the case
when a global batch is divided into three local batches on three trainers. Since the global mini-batches are
generated in chronological order, we also split them into local mini-batches chronologically and ignore the
intra-dependency within each global mini-batch. Specifically, these n trainers first fetch the node memory
and cached mails of the assigned root nodes and their supporting nodes. Then, they compute the forward
and backward propagation and update the model weights. Before they use the computed node memory
to update the node memory and cached mails, they need to make sure all trainers have finished the fetch
operations to avoid Write-After-Read (WAR) hazard. However, to ensure the model weights can receive
enough feedback in the backward propagation, we do not update the node memory and cached mails of the
supporting nodes and re-compute them when they are referenced later. Because the fetch and update of the
86
node memory are done simultaneously in all trainers, the node embeddings generated for later graph events
in the global batch cannot perceive the earlier graph events, incurring both staleness and information loss
in the node memory. In addition, mini-batch parallelism requires all trainers to maintain the same copy of
node memory, which leads to enormous communication overhead on distributed systems.
Epoch Parallelism. Epoch parallelism leverages data parallelism by training different epochs simultaneously using only one copy of the node memory. In the vanilla M-TGNN training, self-supervised by
temporal edges on a single GPU, we first sample some negative destination nodes for the root nodes in
mini-batch i. We then collect the supporting nodes for all positive and negative root nodes and fetch their
node memory and cached mails. In the later epochs, for the same root nodes in mini-batch i, we sample
different sets of negative destination nodes and follow the same procedure to get their node memory and
cached mails. To train on the same mini-batches in different epochs in parallel on n trainers, we ignore the
difference in node memory due to weight updates in the last n − 1 epochs. Thus, we can prepare one set of
inputs of the positive nodes and n sets of inputs of the negative nodes and train them in parallel. Note that
these mini-batches need to be scheduled in different iterations so that the gradients of positive nodes are not
simply multiplied by n. This scheduling increases the variance of the gradients of the sampled mini-batches,
as the same set of positive nodes is learned for n consecutive iterations. The left part of Figure 4.13(b)
shows the case when applying epoch parallelism to three trainers. In each iteration, trainer P0 fetches
the node memory and cached mails for one positive mini-batch and three negative mini-batches. After
P0 finishes one iteration, it writes to the node memory and sends the prepared mini-batches (one positive
mini-batch and the two unused negative mini-batches) to P1. P1 receives the mini-batches from P0 and
sends them (one positive mini-batch and the unused one negative mini-batch) to P2 after the computation.
Note that only P0 needs to write back the updated node memory to the global copy of node memory in the
main memory. Although the node memory of this mini-batch in P1 and P2 is updated using a more recent
version of the weights, writing them to the global copy would lead to Read-After-Write (RAW) hazards
87
with later training iterations. We also tried a finer-grained updating policy which updates nodes that do
not have this RAW hazard in P1 and P2. However, it does not outperform the original policy. To reduce the
cross-trainer communication, we further optimize the algorithm by reordering the mini-bathes so that each
trainer works on the same positive samples (with different negative samples) for n consecutive iterations
(see the right part in Figure 4.13(b)). However, epoch parallelism still requires all trainers to access the same
node memory, which is impractical on distributed systems.
Memory Parallelism. Memory parallelism trades space for time by training different time segments of
the dynamic graph simultaneously using separate copies of node memory. The left part in Figure 4.13(c)
shows the case when applying memory parallelism on a dynamic graph with 6 mini-batches with three
trainers and three copies of node memory. Each trainer is only responsible for one-third of the whole
dynamic graph, i.e., a time segment of two consecutive mini-batches. In every iteration, each trainer needs
to fetch its own node memory and cached mails. The design on the left requires the intermediate node
memory to be transferred across the processes after the trainers finish their time segments. For example, P0
needs to send the node memory of all the nodes in the graph to P1 after iteration 1, which is expensive in
distributed systems. To solve this problem, we reorder the mini-batches across the trainer (see the right part
in Figure 4.13(c)) so that each trainer trains sequentially on all the segments using its own node memory.
Since each trainer owns its individual node memory, there is no synchronization of the node memory across
the trainers, making it the only suitable strategy for distributed systems.
Optimal Training Algorithm. The aforementioned three parallelization strategies all have their own
unique characteristics. We summarize their advantages and disadvantages in Table 4.6. To achieve optimal
training performance, we provide heuristic guidelines for DistTGL users to combine these strategies to
pick their advantages and offset their disadvantages. Consider a distributed system with p machines and q
GPUs per machine. Let i × j × k = p × q be a training configuration where i represents how many GPUs
to compute each mini-batch, k represents how many copies of node memory to maintain, and j represents
88
Table 4.6: Summary of the three parallel training strategies on n trainers. The comparison with single-GPU
training is made based on the same local batch size. The “Training overhead” row refers to the overheads in
mini-batch generation at the beginning of each training iteration. The advantages are marked in bold.
Mini-batch
Parallelism
Epoch Parallelism Memory
Parallelism
Captured
dependency
less than
single-GPU
same as
single-GPU
same as
single-GPU
Training
overhead
same as
single-GPU
n times
single-GPU
same as
single-GPU
Main memory
requirement
same as
single-GPU
same as
single-GPU
n times
single-GPU
Synchronization
across trainers
weights and
node memory
weights and
node memory
weights
only
Gradient descent
variance
same as
single-GPU
more than
single-GPU
same as
single-GPU
0 1000 2000
0
200
400
600
Node Degree Index
Events
bs=300
bs=600
bs=1200
bs=2400
bs=4800
Figure 4.14: Number of captured events in the node memory with different batch sizes, sorted by node
degree from high to low on the Wikipedia dataset.
how many epochs to train in parallel for each copy of node memory. We determine the optimal choice of
(i, j, k) from task requirements and hardware configurations. There are two constraints from the hardware
side. First, we need to have k ≥ p as memory parallelism is the only strategy that does not synchronize
node memory across the trainers. Then, the main memory of each machine should be able to hold k/p
copies of node memory and cached mails, or at least hold sufficient cache if using the disk-based memory
caching storage option.
Under these constraints, we first determine i according to the largest batch size. Figure 4.14 shows
that when the batch size increases, fewer graph events would be captured in the node memory, especially
89
for high-degree nodes. DistTGL users can set a threshold for the amount of missing information so that
DistTGL would reversely find out the largest batch size. For applications where high-frequency information
is crucial, we can set a stricter threshold for high-degree nodes. Based on this batch size, i can be determined
according to the GPU specifications. For j and k, we always prefer to apply memory parallelism since it
leads to better convergence, which we have also verified from experiments (see Figure 4.15.(b)). In summary,
we first determine i based on task requirements, then k based on hardware specification, and lastly j is
fixed by p × q/i × k.
For example, on a distributed system with 4 machines and 8 GPUs in each machine, we determine the
largest batch size is 3200 edges. The GPU saturates when batch size is larger than 1600 edges. So we first
set local batch size to be 1600 edges and i = 2. The main memory of each machine can hold two copies of
the node memory. Then we set k = 32/2/2 = 8. Finally, j is fixed to be 32/2/8 = 2.
4.3.3.3 Distributed Training System
Designing a scalable distributed training system for M-TGNNs is not trivial. Even for the most straightforward mini-batch parallelism, previous work [131] only achieves 2-3× speedup using 8 GPUs on a single
node due to excessive overheads in the mini-batch generation. We solve this issue by prefetching the
mini-batches in a separate process and pipelining the sub-tasks (loading from disk, slicing features, slicing
node memory, writing back to node memory) within one mini-batch generation. Figure 4.10 shows an
overview of DistTGL serializing the memory operations and executing them asynchronously on separate
processes. Here, we focus on describing the most important design that handles the reads and writes to the
node memory. As memory parallelism works on separate copies of node memory, which has no dependency
and can be easily parallelized, we consider the case for each i × j trainer group that shares the same copy
of the node memory. Since k ≥ p, each trainer group must have all the processes on the same physical
90
Algorithm 6: Memory Daemon Process
Input :read1_idx_buf, mem_write_buf, mail_write_buf, write_1idx_buf
Modify :read_status, write_stats
Output :mem_read_buf, mail_read_buf
1 repeat
2 reset memory and mail;
3 rank = 0;
4 repeat
5 for r in [rank,rank + j) do in parallel
6 wait until write_status[r] == 1;
7 write to memory from mem_write_buf[r];
8 write to mail from mail_write_buf[r];
9 write_status[r] = 0
10 end
11 rank += i;
12 rank = 0 if rank == i × j;
13 for r in [rank,rank + j) do in parallel
14 wait until read_status[r] == 1;
15 for jj in [0, j) do in parallel
16 slice memory to mem_read_buf[r][jj];
17 slice mail to mail_read_buf[r][jj];
18 end
19 read_status[r] = 0;
20 end
21 until epoch end;
22 until training end;
machine. Within each i × j group, the memory operations can be serialized as a spin lock acting on each i
sub-group. For example, for i × j = 2 × 2, we have the memory access sequence
(R0R1)(W0W1)(R2R3)(W2W3)(R0R1)(W0W1)· · · ,
where Ri and Wi denote read and write requests from trainer i, and there is no ordering for the requests
within each bracket.
In DistTGL, instead of implementing an expensive cross-process lock mechanism, we launch an additional memory daemon process for each group of i × j trainer processes to handle the read and write
requests for all the trainers in that group. Let bs be the local batch size, d be the number of sampled
91
supporting nodes for each root node, and dmem be the dimension of the node memory. The memory process
allocates the following buffers, which are shared with the trainers:
• mem_read_buf of size [i × j, j, bs × d, dmem] that holds the results of the memory read requests.
• mail_read_buf of size [i × j, j, bs × d, 2dmem] that holds the results of the mail read requests.
• read_1idx_buf of size [i × j, j, bs × d + 1] that holds the indexes of the read requests and its length.
• mem_write_buf of size [i × j, bs, dmem] that holds the input of the memory write request.
• mail_write_buf of size [i × j, bs, 2dmem] that holds the input of the mail write request.
• write_1idx_buf of size [i × j, bs + 1] that holds the indexes of the read requests and its length.
• read_status of size [i × j] that indicates the status of the read request.
• write_status of size [i × j] that indicates the status of the write request.
Algorithm 6 shows the pseudo-code of the memory daemon process. Each trainer process issues the
read and write requests by copying the inputs to the shared buffers and setting the elements of its rank
in read_status and write_status to be 1. The memory daemon process executes these requests in
serialized order, puts the read results to the buffers, and resets the status. Note that the first read request of
each epoch is not issued, as the results are always all zero matrices right after the initialization.
4.3.4 Experiments
We perform detailed experiments to evaluate the performance of DistTGL. We implement DistTGL using
PyTorch [72] 1.11.0 and DGL [103] 0.8.2.
4.3.4.1 Datasets
Table 4.7 shows the statistics of the five datasets for the evaluation. The task on each dataset is
92
Table 4.7: Dataset statistic. The max(t) column shows the maximum edge timestamp (minimum edge
timestamp is 0 in all datasets). |dv| and |de| show the dimensions of node features and edge features,
respectively. The * mark denotes pre-trained features.
|V | |E| max(t) |dv| |de|
Wikipedia 9,227 157,474 2.7e6 100* 172
Reddit 10,984 672,447 2.7e6 100* 172
MOOC 7,144 411,749 2.6e7 100* -
Flights 13,169 1,927,145 1.0e7 100* -
GDELT 16,682 191,290,882 1.6e8 413 130
• Wikipedia [57] is a bipartite user-internet page graph where one graph event represents one user
modifies the one Wikipedia page. The edge features are extracted from the text that the users update
the pages with. The task on this dataset is temporal link prediction.
• Reddit [57] is a bipartite user-reddit graph where one graph event represents one user posts to one
sub-reddit. The edge features are extracted from the text of the post. The task on this dataset is
temporal link prediction.
• MOOC [57] is a bipartite user-course action graph where one graph event represents one user
interacting with one class item (i.e., watching a video, answering a question). The task on this dataset
is temporal link prediction.
• Flights [73] is a traffic graph where each node represents one airport, and each edge represents one
flight between the two airports. The task on this dataset is temporal link prediction.
• GDELT [131] is a knowledge graph tracking events happening all over the world where each node
represents one actor, and each edge represents one event. Since the temporal link prediction task
used in TGL [131] is too simple, we use the 130-dimensional CAMEO code as edge features and set
the task to be a 56-class 6-label dynamic edge classification problem that predicts the rest of the
56-dimensional edge features.
93
For the temporal link prediction task, to reduce the variance in the validation and test accuracy, we
randomly sample 49 negative destination nodes (for bipartite graphs, we only sample from the other graph
partition) and report the Mean Reciprocal Rank (MRR) of the true destination nodes. For the dynamic edge
classification task, we report the F1-Micro score.
4.3.4.2 Model
We use the most efficient one-layer TGN-attn [79] model enhanced with the static node memory. We follow
the original work to set the dimension of node memory to 100 and the number of most recent neighbors
to 10 for each node. We pre-train the static node history with the same GNN architecture but only with
static information using DGL [103]. On the Wikipedia, Reddit, MOOC, and Flights datasets, we pre-train
10 epochs with stochastically selected mini-batches. On the GDELT dataset, we only pre-train 1 epoch.
The pre-training of all datasets takes less than 30 seconds on a single machine. For the Wikipedia, Reddit,
MOOC, and Flights datasets, we set the local batch size to be the largest available batch size 600 [131]. For
the GDELT dataset, the local batch size is set to 3200, limited by the GPU capacity. We set the learning rate
to be linear with the global batch size. To ensure fairness, we keep the total number of traversed edges to
be the same in multi-GPU training. The number of training iterations for x GPUs will be 1/x compared
to a single GPU. On the Wikipedia, Reddit, MOOC, and Flights datasets, we traverse the training events
100 times (100 epochs on a single GPU). On the larger GDELT dataset, we traverse the training events 10
times (10 epochs on a single GPU). On the Wikipedia, Reddit, MOOC, and Flights datasets, we perform
evaluations after every training epoch using the node memory in the first memory process. On the GDELT
dataset, due to the slow evaluation process (as DistTGL only accelerates training), we perform validation
and testing every 2000 training iterations on a randomly selected chunk of 1000 consecutive mini-batches
in the validation and the test set, starting with all-zero node memory and mails.
94
4.3.4.3 Hardware
All experiments are performed on AWS EC2 cloud using g4dn.metal instances. Each instance has dual Intel
Platinum 8259CL CPUs paired with 384GB ECC-DDR4 memory, 8 Nvidia T4 GPUs with 16GB GDDR6
memory for each GPU, two 900GB NVMe SSDs, and 100Gbps Ethernet connection. We create the instances
in the same group of rack to make sure the cross-machine latency is minimized. We sample the mini-batch
in advance and store them on the two NVMe SSDs in RAID0 mode to maximize the throughput. The
positive edges in the mini-batches are reused in every epoch. For the negative edges, we observe that in
the temporal link prediction task, a small number of groups of negative edges are enough. So we prepare
10 groups of negative edges and randomly use them in the total 100 epochs. We assign 6 CPU threads for
each trainer and memory process so that the total 96 physical threads can serve the needs for maximum
memory parallelism of k = 8 on a single machine. To further overlap the mini-batch generation with
the GPU computation, we pre-fetch the pre-sampled static information from disks j iterations in advance.
However, the dynamic node memory still needs to be obtained following the serialized order in the memory
process. For all methods, the node memory and cached mails are stored in the main memory and transferred
between CPU and GPU in every training iteration.
4.3.4.4 Convergence
We first evaluate the convergence of DistTGL by comparing the validation accuracy after different numbers
of training iterations and the testing accuracy for the final model. We start with the performance of epoch
parallelism on the Wikipedia, Reddit, Flights, and MOOC datasets, as the largest batch sizes on these
datasets do not allow mini-batch parallelism. Figure 4.15(a) shows the convergence curves of applying 1
(as the baseline), 2, 4, and 8 epoch parallelism. When j = 2, we observe more than 2× speedup for the
number of training iterations before reaching 70%, 80%, and 90% of the best validation accuracy on all four
datasets, especially on the Flights datasets where the final test accuracy is even higher than the baseline.
95
0 8000 16000
0.6
0.7
0.8
Validation MRR
Wikipedia
1×1×1 (0.8354)
1×2×1 (0.8277)
1×4×1 (0.8170)
1×8×1 (0.8122)
0 30000 60000
0.6
0.7
0.8
gRedditg
1×1×1 (0.8476)
1×2×1 (0.8463)
1×4×1 (0.8421)
1×8×1 (0.8401)
0 65000 130000
0.6
0.7
0.8
Flights
1×1×1 (0.8086)
1×2×1 (0.8145)
1×4×1 (0.8037)
1×8×1 (0.7685)
0 15000 30000
0.3
0.4
0.5
0.6
gMOOCg
1×1×1 (0.5757)
1×2×1 (0.5652)
1×4×1 (0.5715)
1×8×1 (0.5514)
0 600 1200 1800
0.6
0.7
0.8
Training Iteration
Validation MRR
1×8×1 (0.8122)
1×4×2 (0.8127)
1×2×4 (0.8116)
1×1×8 (0.8300)
0 2000 4000 6000 8000
0.7
0.8
Training Iteration
1×8×1 (0.8401)
1×4×2 (0.8423)
1×2×4 (0.8440)
1×1×8 (0.8447)
0 5000 10000 15000
0.6
0.7
0.8
Training Iteration
1×8×1 (0.7685)
1×4×2 (0.7895)
1×2×4 (0.7981)
1×1×8 (0.7985)
0 1000 2000 3000
0.4
0.5
0.6
Training Iteration
1×8×1 (0.5514)
1×4×2 (0.5724)
1×2×4 (0.5737)
1×1×8 (0.5739)
(a)
(b)
Figure 4.15: (a) Convergence curve of DistTGL with different epoch parallelism j using 1-8 GPUs on one
node. (b) Convergence curve of DistTGL with combinations of different epoch and memory parallelism
j × k using 8 GPUs on one node. The test MRR is shown between parentheses in the legend. Compared
with the single-GPU baseline, DistTGL with memory parallelism 1 × 1 × 8 achieves near-linear converge
speedup with negligible accuracy loss on 8 GPUs.
96
(a)
0.8534 0.8346 0.8361 0.8300
0.8277 0.8213 0.8116 0.8309
0.8170 0.8127 0.8241 0.8165
0.8122 0.8012 0.7842
k = 1 k = 2 k = 4 k = 8
j = 1
j = 2
j = 4
j = 8
0.8534 0.8303 0.8073 0.7842
(b)
14274 8418 4392 1830
7686 4026 2013 1098
4209 2013 1098 549
2013 1098 549
k = 1 k = 2 k = 4 k = 8
j = 1
j = 2
j = 4
j = 8
10843 7412 3980 549
Figure 4.16: (a) Test MRR and (b) number of iterations before convergence with different epoch parallelism
j and memory parallelism k on the Wikipedia dataset.
We believe that the super-linear scalability is due to the larger global negative batch size, where we observe
similar convergence speed improvement when we increase the number of negative samples during training
for the baseline. Unfortunately, increasing the number of negative samples cannot be used to speedup the
convergence as the computation complexity is linear with the number of root nodes. When j = 4, epoch
parallelism still manages to achieve linear speedup except on the Flights dataset with the most number of
unique edges [73]. When j = 8, epoch parallelism leads to a significant test accuracy drop and non-linear
speedup. The sub-linear scalability for epoch parallelism, when j is large, is expected as it trains on the
same positive nodes consecutively in multiple iterations, leading to increased variance in the mini-batch
gradients.
Then, on the same four datasets, we fix j × k = 8 and evaluate the convergence with different memory
parallelism. Figure 4.15(b) shows the convergence curves of different epoch and memory parallelism.
Compared with epoch parallelism (1 × 8 × 1), memory parallelism achieves both better validation accuracy
and notably better test accuracy due to better gradient estimation in each mini-batch. In general, the larger
the memory parallelism k is, the better the test MRR. The training configuration with the largest k = 8
achieves linear speedup in convergence compared with the single GPU baseline with only an average of
0.004 drop in test MRR. Figure 4.16 shows the test MRR and the number of training iterations to reach
the best validation MRR of different training configurations when i = 1 and j × k ≤ 32. The experiment
97
0 80000 160000
0.3
0.4
0.5
Training Iteration
Validation F1
GDELT
1×1×1 (0.4831)
8×1×1 (0.4935)
8×1×2 (0.4962)
8×1×4 (0.4896)
Figure 4.17: Convergence of DistTGL on the GDELT datasets. The test F1-Micro is shown between
parentheses in the legend.
results agree with our strategy for optimal training configuration, where we prioritize memory parallelism
over epoch parallelism within the hardware limit.
For the GDELT dataset, we verify that the largest batch size without accuracy loss is larger than the
capacity of one machine (see Figure 4.8(a)), which also agrees with previous work [131]. Hence we follow
our optimal training configuration choosing policy and prioritize mini-batch parallelism. Figure 4.17 shows
the convergence of DistTGL on the GDELT datasets. The single GPU baseline 1 × 1 × 1 converges very
slowly. Increasing the learning rate can speedup the convergence to some extent but will also lower the
accuracy. By contrast, mini-batch parallelism 8 × 1 × 1 enjoys the benefit of larger batch size and achieves
super-linear speedup. To further speedup on more trainers, we need to use memory parallelism to solve the
massive communication overhead across machines. On multiple machines, the combination of memory
parallelism and mini-batch parallelism achieves satisfying convergence speedup with the highest test
accuracy. We also test the performance of memory and epoch parallelism on the GDELT dataset. Memory
parallelism achieves similar convergence as mini-batch parallelism while epoch parallelism has a slightly
worse performance than the other two.
4.3.4.5 Training Throughput
We evaluate the training throughput of DistTGL on up to four 8-GPU machines. We do not test on
more machines as the training time on the largest GDELT dataset is already less than 30 minutes on four
98
Wikipedia Reddit Flights MOOC GDELT
0
150
300
450
600
750
1×
1×
1×
1×
1
1.84
×
×
1.95×
1.99×
1.96×
1.97×
3.65×
3.77×
3.94×
3.92×
3.75
7.19
×
×
6.45×
7.58×
7.49×
7.17×
13.81×
12.87×
14.32×
14.59×
14.15×
24.97×
24.19×
25.98×
26.60×
23.49×
Throughput (kE/s)
(a)
1GPU
2GPU
4GPU
8GPU
2×8GPU
4×8GPU
TGN TGL-TGN DistTGL
23.77
(1×)
15.84
(0.67×)
7.92
(0.33×)
0
(0×)
1GPU 6.45
1GPU
2GPU
4GPU
8GPU
21.07
15.49
11.69
7.29
1×1×1 23.77
1×2×1
1×4×1
1×8×1
epoch
parallelism 22.18 22.05 21.61
1×1×2
1×1×4
1×1×8
memory
parallelism 21.87 21.70 21.36
1×8×2
1×1×16
2-node 21.03 20.52
1×8×4
1×1×32
4-node 18.70 18.54
Throughput per GPU (kE/s)
Wikipedia
TGN TGL-TGN DistTGL
24.96
(1×)
16.64
(0.67×)
8.32
(0.33×)
0
(0×)
Not Completed
1GPU
2GPU
4GPU
8GPU
18.15
11.97
8.56
4.92
1×1×1 24.96
2×1×1
4×1×1
8×1×1
mini-batch
parallelism 24.54 23.37 22.37
1×1×2
1×1×4
1×1×8
memory
parallelism 23.54 20.11 14.81
8×1×2
1×1×16
2-node 22.07 14.04
8×1×4
1×1×32
4-node 18.32 12.20
(b) gGDELTg
Figure 4.18: (a) Training throughput of DistTGL. We show the parallel training strategies with the best
accuracy (memory parallelism on the four small datasets and mini-batch parallelism on the two large
datasets on each node) for each dataset. The bars with red frames denote the optimal training configuration
on different number of GPUs. (b) Training throughput per GPU of DistTGL compared with TGN and
TGL-TGN on the Wikipedia and GDELT datasets.
machines while it only takes a few minutes to train on the smaller datasets. Figure 4.18(a) shows the training
throughput and the speedup compared with the single GPU baseline of the optimal training configuration
on 2, 4, and 8 GPUs on a single machine, 16 GPUs on two machines, and 32 GPUs on four machines. On
8/32 GPUs on 1/4 machines, DistTGL achieves close to linear speedup averaging 7.27/25.08×, respectively.
In terms of absolute throughput, the training throughput on the Reddit and Flights datasets is around
10% slower than the other datasets due to the larger amount of writes to the node memory and cached
mails. Since DistTGL only applies memory parallelism across machines, the memory operations are evenly
distributed to each machine. There is no cross-machine traffic besides the synchronization of model weights,
99
leading to a balanced workloads in each trainer. Due to the small TGNN models with only a few megabytes
of weights, DistTGL also achieves near-linear speedup scaling on distributed systems.
We also compare the performance of DistTGL with the vanilla single GPU implementation TGN [79] and
its optimized version TGL-TGN [131] that supports single-machine multiple-GPU. Figure 4.18(b) shows the
training throughput per GPU of the two baseline methods and DistTGL in different training configurations
on the Wikipedia and GDELT datasets. On the GDELT dataset, TGN does not finish training in 10 hours.
DistTGL with the optimal training configurations (memory parallelism on the Wikipedia dataset and a
combination of mini-batch and memory parallelism on the GDELT dataset) significantly outperform TGN
and TGL. On 2, 4, and 8 GPUs, DistTGL achieves an average of 1.24×, 1.91×, and 2.93× improvement,
respectively, compared with TGL. The 1 × 1 × 1 single GPU implementation of DistTGL is also faster than
TGL due to our system optimization that overlaps the read and write operations from and to node memory.
On the GDELT dataset, memory parallelism does not scale linearly on 8 GPUs due to the limitation of the
bandwidth between CPU and RAM, whereas the scalability is notably better on multiple machines.
100
Chapter 5
ViTeGNN: Versatile TGNN Inferencing on FPGAs
5.1 Abstract
Temporal Graph Neural Networks (TGNNs) are powerful models to capture temporal, structural, and
contextual information on temporal graphs, outperforming other methods in many high-impact downstream
tasks. However, achieving high-performance TGNN inference in production environments is challenging
because TGNN models suffer from high computation complexity and intrinsic temporal data dependency that
hinders data parallelism. In addition, real-world TGNN applications have different latency and throughput
requirements. This work presents ViTeGNN, a versatile TGNN inference solution for Memory-based TGNNs
(M-TGNNs) on FPGAs. ViTeGNN performs algorithm-model-architecture co-design to meet the latency and
throughput requirements of real-world TGNN applications. Besides the vanilla inference mode ViTeGNNbal that updates embeddings for nodes interacting with others, we propose ViTeGNN-lat and ViTeGNN-thpt,
optimized for latency and throughput. Our model optimizations include a lightweight method to compute
attention scores and a related temporal neighbor pruning strategy to reduce computation and memory
accesses. These are holistically coupled with key hardware optimizations that leverage the FPGA hardware.
We propose a novel hardware module to execute the complex neighbor update process efficiently. To
ensure similar accuracy vis-á-vis the original model, the simplified models are trained using the knowledge
distillation technique. We propose a unified hardware design that supports all of these three inference
101
modes without FPGA reconfiguration. Enabled by our flexible hardware architecture, we further propose
ViTeGNN-auto, which automatically selects the best inference mode at runtime based on latency and
throughput requirements, guided by our accurate performance model. We evaluate the performance of
the proposed hardware accelerator on five real-world datasets. ViTeGNN-bal reduces the computation
complexity by an average of 62% and memory accesses by an average of 36% with only 0.0042 accuracy loss.
Compared with state-of-the-art implementations on CPU and GPU, our FPGA implementation achieves
53.9/26.0/16.1× speedup and 8.2/4.0/2.5× speedup for ViTeGNN-lat/-bal/-thpt, respectively.
5.2 TGNN Inference
In production environments, TGNNs are usually used to compute dynamic node embeddings on the
incoming stream of graph signals for downstream tasks. However, several unique characteristics of TGNNs
make them inefficient for deployment on General Purpose Processors (GPPs). First, TGNNs are significantly
more compute-intensive compared with static GNNs. In order to accurately capture the evolving nature
of temporal neighborhoods, most TGNNs [116, 79, 105, 106] rely on a temporal attention mechanism
(adopted from Transformer [97]) to aggregate features from temporal neighbors along with additional
sequence models like RNNs and GRUs. An artifact of this mechanism is that it requires computing additional
“keys” and “queries” for each temporal neighbor (more than 2× the number of operations than a mean or
max pooling aggregator). Second, graph signals can appear asynchronously at varying rates. Temporal
neighbor sampling and vertex information updates associated with these signals lead to intrinsic sequential
dependencies, which require the system to process small batches. Current TGNN implementations are
mostly GPU-focused, where the coarse-grained parallelism usually leads to significantly worse performance
on small versus large batches [66, 20, 23]. Third, different applications have different latency and throughput
requirements when processing these signals. For example, fraud detection systems [88, 17] require updating
the node embeddings immediately after a graph signal so that fraud activities can be identified and blocked
102
promptly. On the other hand, recommending systems [100, 85] typically operate on large-scale graphs with
billions and trillions of nodes and edges and require high throughput to process the streaming graph signals.
We believe that while algorithmic optimizations per se are useful in partially solving the above challenges,
a more holistic approach that also leverages fine-grained parallelism, flexibility in logic design, low-latency
on-chip memory (for customized memory access patterns), and high-density resources (programmable DSPs
for customized data paths) of hardware platforms such as FPGAs can provide a superior overall solution.
In this chapter, we present ViTeGNN, an algorithm-model-architecture co-design for high-performance
TGNN inference on FPGA. ViTeGNN has three major innovations:
• Algorithm: ViTeGNN has three different inference modes: ViTeGNN-lat, ViTeGNN-bal,
and ViTeGNN-thpt, serving the different requirements on latency and throughput in different applications. We also propose ViTeGNN-auto that automatically selects the best inference mode at
runtime, guided by our accurate performance model.
• Model: ViTeGNN consists of a suite of algorithmic model optimizations to solve computational and
memory bottlenecks imposed by model constraints.
• Architecture: ViTeGNN has a unified hardware architecture design to support the three inference
modes without runtime FPGA reconfiguration. We develop an efficient hardware mechanism to
support the neighbor list updating of three inference modes and an efficient hardware pipeline for
model inference.
We begin with an analysis of the current computation processes of general M-TGNN inference, along with
a case study evaluating computation-communication characteristics. Based on the identified bottlenecks,
we design specialized inference algorithms optimizing for latency and throughput and perform algorithmic
and hardware-specific optimizations to make TGNN inference computationally tractable with negligible
accuracy loss. We summarize our main contributions below:
103
• Besides the conventional ViTeGNN-bal inference mode, we propose ViTeGNN-lat and ViTeGNN-thpt,
two inference modes specialized for latency- and throughput-critical TGNN applications.
• We propose a lightweight temporal attention mechanism and a related neighbor pruning strategy,
significantly reducing the computation and memory accesses at inference. We design a knowledge
distillation setup to train our simplified models to ensure comparable accuracy.
• To efficiently execute the three inference modes, we design a unified hardware accelerator on FPGA
with a flexible data path and customized memory organization, which can execute three inference
modes without FPGA reconfiguration. It leads to zero overhead of switching between inference
modes.
• In the proposed hardware design, we develop a novel hardware module to efficiently execute the complex neighbor list updating process. We also propose a hardware mechanism to ensure chronological
vertex updates without sacrificing the computation parallelism.
• To dynamically select the best inference mode at runtime, we propose ViTeGNN-auto, an automatic
inference mode guided by a predictive performance model based on algorithm parameters, design
configurations, and memory characteristics.
• We implement the proposed design on the state-of-the-art FPGA platform Xilinx Alveo U280. Compared with the state-of-the-art implementations on CPU and GPU, ViTeGNN achieves 53.9/26.0/16.1×
speedup and 8.2/4.0/2.5× speedup on ViTeGNN-lat/-bal/-thpt modes, respectively.
104
Algorithm 7: Conventional Memory-based TGNN Inference
Input :An incoming edge stream Enew; Current node memory {sv : v ∈ V}; Most recent neighbors
{N (v) : v ∈ V}; Vertex feature vectors {fv : v ∈ V}; Edge feature vectors {fe : e ∈ E}
Output :Embeddings {hu, hv : {u, v} ∈ Enew}
/* Process Enew in batches chronologically */
1 foreach batch {e(u, v, fe, te)} ∈ Enew do
/* update node neighbors */
2 {N (u)} = {UpdateNeighbor(N (u), v)};
3 {N (v)} = {UpdateNeighbor(N (v), u)};
/* compute combined messages */
4 {mu} = {COMB({MSG(m), m ∈ N ′
(u)}};
5 {mv} = {COMB({MSG(m), m ∈ N ′
(v)}};
/* update node memory */
6 {su} = {UPDT(mu, su)};
7 {sv} = {UPDT(mv, sv)};
/* compute node embeddings */
8 {hu} = {GNN((su, fu), {(sz, fz), z ∈ N (u)})};
9 {hv} = {GNN((sv, fv), {(sz, fz), z ∈ N (v)})};
10 return {hu}, {hv};
11 end foreach
5.3 Conventional M-TGNN Inference
Algorithm 11 shows the conventional batched inference process of memory-based TGNNs. When a batch
of new graph events occurs, only the node embeddings of the directly affected nodes (i.e., source and
destination nodes for new interactions as graph events) are computed and updated.
5.4 Inference Performance Metrics
To quantify the performance of TGNN inference, we formally define our evaluation metrics of throughput
and latency. Since most existing datasets only have new edges as graph signals, we define the throughput
as the number of new edges that can be processed per second. Defining the execution time as the total time
to process the incoming edges, we define throughput as
Inference throughput (E/s) = # of new edges
execution time (s). (5.1)
105
Table 5.1: Number (#) and percentages (%) of thousands of MEMs (kMEM) and thousands of MACs (kMAC)
and the average execution time on CPU and GPU per dynamic node embedding.
Wikipedia Reddit
kMEM kMAC Exec. Time (ns) kMEM kMAC Exec. Time (ns)
# % # % 1 Thread 32 Threads GPU # % # % 1 Thread 32 Threads GPU
sample 0.0 0.3% 0 0% 8 4 4 0.1 1.1% 0 0% 9 4 4
memory 5.7 99.7% 48.4 6.4% 243 54 25 5.7 98.9% 48.4 6.4% 214 68 20
GNN 0 0% 703.5 93.6% 195 27 3 0 0% 703.5 93.6% 249 27 3
total 5.7 100% 751.9 100% 446 85 32 5.8 100% 751.9 100% 472 99 27
The inference latency is defined as the time between the occurrence of a graph event that affects the
node embedding of a node and the time when the most up-to-date embedding of that node is computed.
The inference latency can be decomposed into two parts: algorithm latency and computation latency.
Algorithm latency refers to the extra latency introduced by the inference algorithm due to lazy updating,
while computation latency refers to the latency of the real computation.
5.5 Case Study: TGN Inference
Table 5.1 shows the complexity and execution time in the three parts. The memory accesses are primarily
in the memory part to access the messages and edge features. The computation is dominated by the GNN
part to aggregate from selected temporal neighbors. For a serial processor (1 CPU), the bottleneck is the
computation in the GNN part. For highly parallel machines (32 CPU threads and GPU), the bottleneck
lies in the memory part that accesses the memory and aggregates the messages. Although the number of
memory operations is not enormous, due to the need to identify the most recent messages, it still becomes
the bottleneck when executed on a highly-paralleled machine with complex cache hierarchy.
106
Algorithm 8: ViTeGNN-lat Affected Node Identification
Input :A batch of edges {e(u, v)}; Most recent neighbors {N (v) : v ∈ V}; Affected node array
N −1
(v) and N
−1
valid(v)
Output :Target nodes to inference {n}
1 foreach edge e(u, v) ∈ {e(u, v)} do
/* update node neighbors */
2 N (u), uremoved = UpdateNeighbor(N (u), v);
3 N (v), vremoved = UpdateNeighbor(N (v), u);
/* update affected node arrays */
4 if N −1
(u) or N −1
(v) is full then
5 Remove invalid nodes or expand array ;
6 end if
7 Append v to N −1
(u) Unset u in N
−1
valid(uremoved) ;
8 Append u to N −1
(v) Unset v in N
−1
valid(vremoved);
9 end foreach
10 return S
{N −1
(n)[N
−1
valid(n)] : n ∈ {e(u, v)}}
5.6 Approach
5.6.1 Inference Algorithms
Since different TGNN applications have different requirements for latency and throughput, it is important
to design different inference algorithms (modes). Besides the conventional TGNN inference algorithm
(Algorithm 11), we propose ViTeGNN-lat and ViTeGNN-thpt, optimizing for latency and throughput,
respectively. For simplicity, we denote the conventional TGNN inference algorithm as ViTeGNN-bal, a
latency-throughput balanced inference mode.
5.6.1.1 ViTeGNN-lat
ViTeGNN-lat is a TGNN inference algorithm that generates node embeddings for all affected nodes in
real-time, designed for latency-critical applications like fraud detection and high-frequency advertising.
When a new edge appears between node u and node v, not only the embeddings of the directly involved
nodes u and v are affected by this graph event but also other nodes {n : u ∈ N (n) ∨ v ∈ N (n)} whose
most recent neighbors include nodes u or v. In ViTeGNN-bal, the node embeddings of these nodes {n}
107
Algorithm 9: ViTeGNN-lat Inference
Input :An incoming edge stream Enew; Current node memory {sv : v ∈ V}; Most recent neighbors
{N (v) : v ∈ V}; Vertex feature vectors {fv : v ∈ V}; Edge feature vectors {fe : e ∈ E}
Output :Embeddings of all affected nodes {n}
/* Process Enew in batches chronologically */
1 foreach batch {e(u, v, fe, te)} ∈ Enew do
/* compute combined messages */
2 {mu} = {COMB({MSG(m), m ∈ N ′
(u)}};
3 {mv} = {COMB({MSG(m), m ∈ N ′
(v)}};
/* update node memory */
4 {su} = {UPDT(mu, su)};
5 {sv} = {UPDT(mv, sv)};
/* Identify affected nodes */
6 Obtain {n} using Algorithm 10;
/* compute node embeddings */
7 {hn} = {GNN((sn, fn), {(sz, fz), z ∈ N (n)})};
8 return {hn}
9 end foreach
are not updated until they are directly involved in future graph events, creating a large algorithm latency
unacceptable for latency-critical applications.
The key in ViTeGNN-lat is rapidly identifying {n}. Since the embedding of a node is aggregated from
its most recent neighbors, when the node memory of a node v is changed, the affected nodes are the
nodes whose most recent neighbors contain this node v. Straightforwardly, when a new edge e(u, v, fe, te)
appears, we can first update N (u) and N (v). Then, the target nodes {n} can be identified by scanning
u, v ∈ {N (n) : n ∈ V}. However, this requires access to the most recent neighbor list of all nodes, which
is expensive on large-scale graphs. Here, we propose an efficient dynamic length array-based solution.
Since the numbers of affected nodes are different on different nodes at different times, for each node v, we
maintain a dynamic length array N −1
(v) and an auxiliary array N
−1
valid(v). N −1
(v) contains the indices of
the candidate nodes whose embeddings are possibly affected by a new graph event on v. N
−1
valid(v) enables
lazy update to N −1
(v) and indicates whether the candidate is valid or not. Algorithm 10 shows the process
to identify affected nodes {n} for a batch of incoming edges. When the current array capacity is full (line 5),
we first try to remove all invalid nodes. If removing the invalid nodes does not create enough space, we
108
Algorithm 10: ViTeGNN-thpt Inference
/* ▷ Response to embedding queries */
Input :An incoming edge stream Enew; Cached embeddings {h
′
v
: v ∈ V}
Output :Requested embeddings {hn}
1 return {h
′
n}
/* ▷ Offline update */
Input :An incoming edge stream Enew; Current node memory {sv : v ∈ V}; Most recent
neighbors {N (v) : v ∈ V}; Vertex feature vectors {fv : v ∈ V}; Edge feature vectors
{fe : e ∈ E}; Cached embeddings {h
′
v
: v ∈ V}
/* Process Enew in batches chronologically */
2 foreach batch {e(u, v, fe, te)} ∈ Enew do
/* update node neighbors */
3 {N (u)} = {UpdateNeighbor(N (u), v)};
4 {N (v)} = {UpdateNeighbor(N (v), u)};
/* compute combined messages */
5 {mu} = {COMB({MSG(m), m ∈ N ′
(u)}};
6 {mv} = {COMB({MSG(m), m ∈ N ′
(v)}};
/* update node memory */
7 {su} = {UPDT(mu, su)};
8 {sv} = {UPDT(mv, sv)};
/* update cached embeddings periodically */
9 if {h
′
v} expired then
10 foreach batch {v} ∈ V do
11 {h
′
v} = {GNN((sv, fv), {(sz, fz), z ∈ N (v)})};
12 end foreach
13 end if
14 end foreach
allocate a new space with double the original capacity and copy the array to the new location. The N
−1
valid
arrays are maintained similarly to the N −1
arrays. To avoid duplicate nodes, we return the affected nodes
of the current batch as a set.
Algorithm 9 shows the inference process of ViTeGNN-lat. For an incoming batch of new edges, we first
update the node memory of directly involved nodes. Then, we identify all affected nodes and compute their
node embeddings.
109
5.6.1.2 ViTeGNN-thpt
ViTeGNN-thpt is a throughput-optimized TGNN inference algorithm designed for applications that operate
on large-scale dynamic graphs like social network graphs. ViTeGNN-thpt maintains a node embedding
cache where requests for node embeddings are directly served by the cached embeddings. The node
neighbor update, node memory update, and node embedding computation process are all performed offline
to maximize throughput. Specifically, to avoid extra memory to cache the incoming edge stream, we update
the node neighbors and node memory immediately after the graph events. The cached node embeddings
are updated periodically using the up-to-date node memory. For example, for content recommendation on
user-item graphs, the node memory is updated in real-time as users make purchases but the user and item
embeddings are updated once a day or once a week.
Algorithm 14 shows the inference process of ViTeGNN-thpt. The periodic update to the node memory
significantly reduces the computation complexity, since the computation complexity is dominated by the
GNN part to aggregate information from most recent neighbors. Note that the period to re-compute the
node embeddings could be either static or dynamic according to the frequency of the incoming edge streams.
This offline embedding update process could be performed when the system is idle or even be computed by
another device.
5.6.1.3 Latency and Throughput Analysis
We analyze the latency and throughput of ViTeGNN-lat, ViTeGNN-bal, and ViTeGNN-thpt, using their
definitions in Section 5.4. Denote BE as the batch size, in number of edges. The choice of batch size affects not
only the throughput and latency but also the accuracy [79]. In this work, we use a pre-determined optimal
batch size. For better understanding, we use lowercase t to denote the time related to the computation and
uppercase T to denote the time not related to the computation.
110
Table 5.2: Latency (algorithm, computation, and total) and throughput of the three inference modes of
ViTeGNN.
ViTeGNN-lat
Algorithm Latency 0
Computation Latency tneigh + tmem + t
lat
find + t
lat
gnn
Total Latency tneigh + tmem + t
lat
find + t
lat
gnn
Throughput BE /(tneigh + tmem + t
lat
find + t
lat
gnn)
ViTeGNN-bal
Algorithm Latency 0 to max(Trepeat)
Computation Latency tneigh + tmem + t
bal
gnn
Total Latency tneigh + tmem + t
bal
gnn to max(Trepeat) + tneigh + tmem + t
bal
gnn
Throughput BE /(tneigh + tmem + t
bal
gnn)
ViTeGNN-thpt
Algorithm Latency Tinterval
Computation Latency t
thpt
lookup
Total Latency Tinterval + t
thpt
lookup
Throughput BE /(tneigh + tmem)
Table 5.2 shows the latency (algorithm and computation latency) and throughput comparison of
ViTeGNN-lat, -bal, and -thpt. For ViTeGNN-lat, there is no algorithm latency since the node embeddings
of all affected nodes are updated immediately after an event happens. Note that we do not include the
latency due to the batch processing of the incoming edge stream, as the trade-off between performance and
batch size is out of the scope of this work. The computation latency of ViTeGNN-lat includes the time to
update the most recent neighbors tneigh, to update the node memory tmem, to find all affected nodes t
lat
find,
and to compute their embeddings t
lat
gnn. We do not use the superscript “lat” for tneigh and tmem since these
times are the same in ViTeGNN-lat, ViTeGNN-bal, and ViTeGNN-thpt. Since these operations are all done
online, the throughput of ViTeGNN-lat is simply BE divided by the computation latency. For ViTeGNN-bal,
the algorithm latency ranges from 0 to max(Trepeat), where {Trepeat} refers to the times between two
consecutive interactions on the nodes, since the node embedding of a node is only updated when there is a
new interaction on that node. The computation latency of ViTeGNN-bal includes tneigh, tmem, and the time
to update the embeddings of the source and destination nodes t
bal
gnn. t
bal
gnn is usually significantly smaller than
111
t
lat
gnn, as the number of nodes to compute the embeddings is order-of-magnitude smaller in ViTeGNN-lat
than ViTeGNN-bal. Similarly, the throughput of ViTeGNN-bal is BE divided by the computation latency. For
ViTeGNN-thpt, the algorithm latency is the user-set time interval to recompute the cached node embeddings
Tinterval. The computation latency is only the time to look up the cached embedding table t
thpt
lookup. Since the
node memory is updated in an online fashion, the throughput of ViTeGNN-thpt is BE divided by the time
of all online operations tneigh and tmem.
5.6.2 Model Optimizations
We use our case study on M-TGNN inference to identify three key points in designing an optimized inference
system:
1. GNN computation accounts for more than 80% of the total time and is the bottleneck on a single
CPU core with more than 50% of the time spent on computing the attention scores. It is linear in the
number of supporting temporal neighbors.
2. The time encoding maps a scalar time interval to a vector which is further multiplied with weight
matrices Wir, Wiz, Win, Wq, Wk, and Wv. These vector-matrix multiplications can be removed if
we can reverse the computation order.
3. On a massively parallel architecture, the key bottleneck is fetching and updating the node memory
and messages of the supporting nodes from and to external memory.
Based on these key points, we propose an optimized TGNN model by exploiting FPGA-specific features.
FPGAs consist of massive on-chip memory – Block RAMs (BRAMs) and Ultra RAMs (URAMs) that allow
fast random memory access in high bandwidth. The built-in DSPs can perform large numbers of arithmetic
operations in every cycle. Based on these features, we propose a simplified temporal attention mechanism
(Section 5.6.2.1) and a temporal neighbor pruning strategy (Section 5.6.2.2) which reduce the execution time
112
in DSPs and enable data pre-fetching. In addition, the DSPs of FPGAs can be programmed into computation
arrays that enable fine-grained parallelism for batched processing. We replace the time encoding and the
subsequent vector-matrix multiplications with look-up tables (LUTs) (Section 5.6.2.3) which are stored in
the programmable on-chip memory of FPGAs for fast accesses.
5.6.2.1 Simplified Temporal Attention
The traditional temporal attention mechanism requires vector-vector multiplication among neighbors,
which consumes a lot of DSPs on FPGAs. We note that temporal neighbors in dynamic graphs can be
naturally ordered by timestamp. We leverage this unique characteristic to design a simplified attention
aggregator that operates on fixed-length lists of n timestamp-sorted (not necessarily unique) temporal
neighbors. Given a node u at timestamp t
u with n sorted temporal neighbors at respective timestamps
t
v0 ≤ t
v1 ≤ · · · ≤ t
vn−1
, we compute its attention score as
α
′
(u) = Softmax(a + Wt∆t
u
), (5.2)
where a is a learnable constant attention vector shared among all nodes and Wt
is a learnable weight
matrix that maps node-specific time difference ∆t
u = [t
u − t
v0
, · · · , tu − t
vn−1
] to respective offsets of
the attention logits. The intuition is that on a dynamic graph, attention scores should be sensitive to the
chronology of neighbors. Since each node u has a specific neighbor interaction frequency, we use this
node-specific offset to produce its attention score. Our simplified attention mechanism eliminates the
vector-vector multiplication operations, which dramatically saves the DSP usage on FPGAs.
To learn a and Wt
, we apply a simplified knowledge distillation [41] setup under which we train student
models (with our simplified temporal attention aggregators) under both self-supervision from temporal
edges and supervision from a teacher model with the vanilla temporal attention aggregator. We add an
113
additional soft cross-entropy loss la between the simplified attention logits α′ = a+Wt∆t and the vanilla
attention logits α to encourage the student models to mimic the behavior of the teacher model
La = −
X
v
Softmax(α
′
(v)/T) · Softmax(α(v)/T), (5.3)
where T is the temperature that controls how much the student model learns from the teacher model.
5.6.2.2 Temporal Neighbor Pruning
The Transformer attention mechanism computes the attention scores after the computation of the keys K
and queries Q. In contrast, our simplified temporal attention mechanism computes the attention scores
only using the time difference ∆t of the temporal neighbors as the input. This allows models to quickly
determine which temporal neighbor is more important before fetching the hidden features from them all.
Although the amount of computation and number of memory accesses are the same for each temporal
neighbor, the neighbors with higher attention scores contribute more to the output, which naturally leads
to our attention score-based temporal neighbor pruning strategy. Under the simplified attention mechanism
where only the values V need to be computed, performing temporal neighbor pruning directly leads to a
linear reduction in computation and memory accesses. For a given pruning budget (number of temporal
neighbors to aggregate from), after computing the logits of the simplified attention scores, we apply the
softmax function only on the temporal neighbors with top logit values and compute their V . For multi-head
attention, the memory accesses of one temporal neighbor can only be avoided if this neighbor is pruned
in all attention heads. To ensure the same neighbors are pruned in each attention head, we perform the
neighbor selection process by using the average attention logits in all attention heads.
114
0 5 10 15 20 25
0
1
2
3
4
5
·104
∆t (days)
Frequency
0 5 10 15 20 25
0
1
2
3
4
5
·105
∆t (days)
Figure 5.1: Frequency of input ∆t of the time encoder on the Wikipedia (left) and Reddit (right) datasets.
5.6.2.3 Time Encoding Look-Up-Table
The time encoder maps scalar time frames to vectors. These vectors are later multiplied with the weight
matrices in the GRU memory updater and the GNN aggregator. This whole process accounts for around
30% of the total computation in the TGNN model with our simplified attention mechanism. This can
be completely avoided if the computation order is reversed and the vector-matrix multiplication is precomputed. However, the time encoding process involves a nonlinear trigonometric function that does not
permit pre-computation. To solve this problem, we replace the time encoding process with lookup table
(LUT) operations which transform the nonlinear operations to piece-wise linear ones. We analyze the input
∆t of the time encoder and observe that it follows the power law where most inputs are close to 0. Based
on the intuition that the output vectors should distinguish different lengths of time frames, we divide the
range of the input ∆t by 128 intervals with an equal number of ∆t occurrences in each interval. The output
time encoding vectors in each interval are stored in one entry in the LUT, which is learned in the training
process. At inference, we pre-compute the product of each entry in the LUTs with the weight matrices
and store them in the fast on-chip memory. Our LUT-based time encoder can directly output the hidden
features after weight application for any given time frame within 1 clock cycle.
115
FPGA External DDR memory Vertex Memory
External DDR Memory of FPGA On-chip High Bandwidth memory Vertex Neighbor List Vertex Inverse Neighbor List
Edge stream
DMA Edge Parser Neighbor Update Unit (NUU)
Memory controller
Vertex Memory
Loader
Compute Unit (CU)
Vertex Memory
Updater
Input
Buffer
MAC Arrays
FPGA On-chip High Bandwidth memory
Vertex
Neighbor List
Vertex Inverse
Neighbor List
Neighbor Update Unit (NUU)
DMA
Edge
Parser
Edge
stream
Vertex
Memory
Loader
Vertex
Memory
Updater Compute Unit (CU)
Compute Unit (CU)
Hardware Mapping on Alveo U280
Vertex Embedding SLR0 SLR1 SLR2
Aggregation
Module
Output
Buffer Activation Module
Time Encoding
Lookup Table
Figure 5.2: Overview of hardware architecture (left) and the mapping of the architecture on the Xilinx
Alveo U280 board (right).
5.6.3 Hardware Mapping and Optimizations
5.6.3.1 Overview of Hardware Architecture
In Figure 5.2, the left-hand side shows the overview of the proposed hardware architecture, and the righthand side shows the hardware mapping on an FPGA platform. The input edge stream is sent from the host
processor to the Direct Memory Access Unit (DMA) of FPGA. Then, the Edge Parser parses the input edges
and sends the edge information to the Neighbor Update Unit (NUU) to update vertex neighbor information.
The vertex memory {su : u ∈ G} and vertex embedding are stored in the External DDR Memory of FPGA,
which usually has a large capacity and is extendable. The vertex neighbor list {N (u) : u ∈ G} and vertex
inverse neighbor list {(N
−1
valid(u), N −1
(u)) : u ∈ G} are stored in the on-chip high bandwidth memory
(HBM). We employ HBM because (1) HBM provides multiple independent memory channels that provide
massive memory bandwidth (e.g., 460 GB/s). As operations of updating neighbor lists are memory-bound,
HBM can enable fast and efficient neighbor list updating. (2) HBM has a significantly larger memory
116
Neighbor List
Shift Unit
Neighbor List
Compact Unit
True True False True False False True True
0 0 0 1 1 2 3 3
True True True False False True True False
0 0 0 0 1 2 2 2
N/A
True True True True True False False False
0 0 0 0 0 0 1 2
N/A N/A N/A
Prefix-Sum
Valid Array
Valid Array
Valid Array
Prefix-Sum
Prefix-Sum
Neighbors
Neighbors
Neighbors
Stage 1
Stage 2
Neighbor List
Scan Unit Arbiter
Figure 5.3: Diagram of Neighbor Update Unit (NUU).
capacity (e.g., 8 GB on Xilinx Alveo U280) than the on-chip static memory of FPGA, which can store the
neighbor lists of a large number of nodes.
The Neighbor Update Unit (NUU) is connected to HBM to perform neighbor list updating operations
of three inference modes (e.g., Algorithm 10). The Vertex Memory Loader and Vertex Memory Updater
perform data loading and storing from/to the external DDR memory of FPGA. There are multiple parallel
Computation Units (CUs). Each Compute Unit performs the computation of the GRU cell, calculates the
simplified temporal attention, and uses the obtained attention scores to calculate the embedding vectors of
the nodes.
5.6.3.2 Neighbor Update Unit (NUU)
The Neighbor Update Unit consists of three components: (1) Neighbor List Shift Unit, (2) Neighbor List
Compact Unit, and (3) Neighbor List Scan Unit.
Neighbor List Shift Unit: For an incoming edge (u, v), the neighbor lists of u and v are updated, where u
becomes the most recent neighbor of v in N (v), and v becomes the most recent neighbor of u in N (u).
117
To update N (v) and N (u), a Neighbor List Shift Unit has a sequence of shifting registers which insert the
most recent neighbors to N (v) and N (u), and shift out the oldest neighbors in N (v) and N (u). Suppose
the length of the shifting registers is Pshift, the Neighbor List Shift Unit takes ⌈
|N (v)|
Pshift
⌉ cycles to update the
neighbor list of the node.
Neighbor List Compact Unit: In line 4-5 of Algorithm 10, when the inverse neighbor lists N −1
(v)
and N −1
(u) become full, the invalid nodes are removed from N −1
(v) and N −1
(u). The validness of a
node in N −1
(v) and N −1
(u) is indicated by N
−1
valid(v) and N
−1
valid(v), respectively. We call the process of
removing invalid nodes as array compact. To perform array compact operation, the Neighbor List Compact
Unit is organized as multiple stages, as shown in Figure 5.3. Suppose that during each cycle, Pcompact
nodes are loaded from N −1
(v) and N
−1
valid(v). In Stage i, the valid array indicates the validness of the
corresponding nodes. The Prefix-sum array indicates the number of invalid nodes before the current node.
Use Prefix-sum[j] to denote the j
th element in the Prefix-sum array. At i
th stage, if the (i − 1)th bit of
Prefix-sum[j] is equal to 1, the j
th element is shifting left by 2
(i−1) positions. To efficiently perform array
compact operation, the Neighbor List Compact Unit has log2
(Pcompact) stages. It takes ⌈
|N −1
(v)|
Pcompact
⌉ clock
cycles to process N −1
(v) and N
−1
valid(v) (|N −1
(v)| = |N −1
valid(v)|).
Neighbor List Scan Unit: The Neighbor List Scan Unit perform unset operation for N
−1
valid(u) to set the
node uremoved to be invalid. To this end, the Neighbor List Scan Unit is organized as Pscan parallel compactors
to compare each node in N −1
(u) with uremoved. If the node index is equal to uremoved, it will be unset as
false. Note that multiple nodes can equal uremoved concurrently, and we only need to unset one entry. To this
end, we implement a hardware Arbiter to ensure that only one entry is unset simultaneously. When one
entry is unset in N
−1
valid(·), the rest of the list N
−1
valid(u) can be skipped. Suppose the computation parallelism
of Neighbor List Scant Unit is Pscan. It takes at most ⌈
|N −1
(u)|
Pscan
⌉ to process N −1
(u).
118
Setting Pshift, Pcompact, and Pscan: There are several considerations for setting Pshift, Pcompact, and Pscan:
(1) To match the throughput of Shift Unit, Compact Unit, and Scant unit, we can let Pshift = Pcompact = Pscan.
(2) To be hardware-friendly, we can let Pshift, Pcompact, and Pscan be the power of 2.
5.6.3.3 Compute Unit (CU)
There are multiple parallel Compute Units (CUs), each performing updating of vertex memory and generating
the vertex embedding. Each CU has an input buffer and an output buffer to store the input and output data,
respectively. Each computation Unit has (1) SMAC parallel Multiply-Accumulate Array, each having the
size of Sg1 × Sg2, (2) an Aggregation Module to perform the feature aggregation when calculating vertex
embedding, (3) an Activation Module to perform element-wise activation function, and a Time Encoding
Lookup Table to store the precomputed time encoding vectors.
Vertex Memory Updating: The updated vertex memory is calculated using a GRU cell. In GRU, there
are an Update Gate, a Reset Gate, a Memory Gate, and a Merging Gate. The Update Gate, Reset Gate,
and Memory Gate involve matrix multiplication between the vertex messages and the weight matrices.
These matrix multiplications are efficiently executed by the Multiply-Accumulate Arrays of the Compute
Unit. The element-wise activation function (Activation Module) sigmoid(·) and tanh(·) are implemented
as sigmoid(x) = 1/(1 + hls::exp(−x)) and tanh(x) = (1 − hls::exp(−2x))/(1 + hls::exp(2x)) where
hls::exp(·) is from the FPGA Vendor library.
Vertex Embedding Generation: The Compute Unit generates the vertex embedding vectors using a
one-layer GNN model. The Aggregation Module performs feature aggregation: hv = aggregate{α(u) · su :
u ∈ N (v)∪ {v}}. The Multiply-Accumulate Arrays and Activation Module perform feature transformation
hv = transform(hv, sv,Wv).
119
flag vid cmp
flag vid cmp
flag vid cmp
· · · · · · cmp
flag vid cmp
flag vid cmp
Vid2 Vid1
commit
pointer
write
pointer1
write
pointer2
Figure 5.4: Updater using a fully-associative cache with rotating pointers (Ncu = 2).
5.6.3.4 Vertex Memory Updater
The function of Vertex Memory Updater is to (1) receive the updated vertex memory from the computation
units, (2) write the vertex memory back to the external memory, (3) ensure the chronological order of the
updated vertex memory, and (4) eliminate the redundant vertex memory updating. To ensure chronological
order, the affected vertices are assigned to the CUs in a Round-Robin style. Similarly, the Updater receives
the updated vertex memory from the CUs in the same Round-Robin order. Figure 5.4 shows the diagram of
the Vertex Memory Updater. The Updater is organized as a fully-associative cache with rotating pointers.
Each cache line stores one vertex memory, the index of the vertex, and a flag bit that indicates whether
the current cache line is valid. When the Updater receives the vertex information from multiple CUs
concurrently, the chronological order of the multiple vertex information is ensured by the relative position
of the write pointers. Each write pointer points to the write position of a CU. The commit pointer scans
multiple consecutive cache lines at a time to check if a valid cache link can be sent back to the external
memory. When multiple new updated vertex information are received by the Updater, their vertex indices
are compared with vertex index (vid) of each cache line. If an uncommitted cache line has the same vertex
index with new vertex information, this uncommitted cache line will be invalidated.
120
5.6.4 Performance Model-Guided Dynamic Mode Selection
Performance prediction is usually easier and more accurate on FPGA than GPU due to the less indeterministic
scheduling and execution. Taking advantage of this, we first build accurate performance models for three
inference modes ViTeGNN-lat/-bal/-thpt. Then, we discuss the approach to dynamically select the best
inference mode (ViTeGNN-auto), guided by these performance models. We define the following notations:
• Algorithm hyperparameters: The length of the vertex feature vector, message, memory, most
recent neighbor list, and embedding of a vertex are denoted as ffeat, fmail, fmem, dmr = |N (v)|, and
femb, respectively. The size of each data is Zd bytes.
• Hardware hyperparameters: Data parallelism of Neighbor List Shift Unit, Compact Unit, Scan Unit:
Pshift, Pcompact, and Pscan. Suppose the Neighbor Update Unit (NUU) has Snuu copies of the Neighbor
List Shift Unit, Compact Unit, and Scan Unit. We determine Snuu as the number of memory banks of
HBM and Pshift = Pcompact = Pscan = Pnuu. We select Pnuu to match the data rate of a HBM memory
bank. Scu denotes the number of CUs. SMAC denotes the number of Multiply-Accumulate Arrays in
a Compute Unit and Sg1 × Sg2 denotes the computation parallelism of each Multiply-Accumulate
Array. FAM denotes the computation parallelism of the Aggregation Module.
• Application hyperparameters: We use G(V, E) to denote the graph, where |V| denote the total
number of vertices and |E| denote the total number of edges. We use BE to denote the number of
edges of a batch. We use d (dmr ≤ d) to denote the average degree of the graph.
5.6.4.1 ViTeGNN-lat
The input is a batch of incoming edges Enew with the batch size |Enew| = BE . The execution of ViTeGNN-lat
involves four steps: (1) step 1: update the neighbor lists of the vertices that are incident on Enew, denoted as
Vnew (where |Vnew| = 2BE ), (2) step 2: update the node memory of each vertex in Vnew, (3) step 3: calculate
121
the affected vertex list denoted as Vaffected, where |Vaffected| = 2dBE , and (4) step 4: calculate the vertex
embedding of the affected vertices. The latency of each step is denoted as tneigh (step 1), tmem (step 2), t
lat
find
(step 3), and t
lat
gnn(step 4), respectively. The latency of each step can be derived as:
tneigh = ⌈
2BE
Snuu
⌉ × ⌈ dmr
Pnuu
⌉ × 1
Ffreq
tmem = 3 × ⌈ 2BE
Sg1SMAC
⌉ × fmail × ⌈fmem
Sg2
⌉ × 1
Ffreq
t
lat
find = ⌈
2BE
Snuu
⌉ × ⌈ d
Pnuu
⌉ × 1
Ffreq
t
lat
gnn = max(
2dBE dmr(ffeat + fmem)
SAM
,
⌈
2dBE
Sg1SMAC
⌉ × (fmem + ffeat) × ⌈femb
Sg2
⌉ × 1
Ffreq
(5.4)
Note that in t
lat
gnn, we use 2dBE to approximate the number of affected vertices|Vaffected|. In real-world applications, |Vaffected| can vary with incoming edges, leading to inaccurate performance estimation. To facilitate accurate estimation of |Vaffected|, we record the |Vaffected| of prior l batches:
{|Vaffected|t−l
, |Vaffected|t−l+1, ..., |Vaffected|t−1}. Then, for current batch, |Vaffected|t
is estimated through
1
l
Pt−l
i=t−1
|Vaffected|i
. Therefore, t
lat
gnn is predicted through:
t
lat
gnn = max(
|Vaffected|tdmr(ffeat + fmem)
SAM
,
⌈
|Vaffected|t
Sg1SMAC
⌉ × (fmem + ffeat) × ⌈femb
Sg2
⌉ × 1
Ffreq
(5.5)
Finally, the latency and throughput of ViTeGNN-lat can be expressed as:
Θ
compute
lat = tneigh + tmem + t
lat
find + t
lat
gnn
Φlat =
BE
tneigh + tmem + t
lat
find + t
lat
gnn
(5.6)
122
5.6.4.2 ViTeGNN-bal
Different from ViTeGNN-lat, ViTeGNN-bal does not include step 3 and only calculates the embedding of
vertices in Vnew. The latency of ViTeGNN-bal consists of (1) tneigh: the latency of updating the neighbor list
of the vertices in Vnew, (2) tmem: the latency of updating the node memory of Vnew, and (3) t
bal
gnn: the latency
of calculating the embedding of Vnew. While the tneigh and tmem are the same as Equation 5.4, t
bal
gnn can be
expressed as:
t
lat
gnn = max(
2BE dmr(ffeat + fmem)
SAM
,
⌈
2BE
Sg1SMAC
⌉ × (fmem + ffeat) × ⌈femb
Sg2
⌉ × 1
Ffreq
(5.7)
Then, the latency and throughput of ViTeGNN-bal can be expressed as:
Θ
compute
bal = tneigh + tmem + t
bal
gnn
Φbal =
BE
tneigh + tmem + t
bal
gnn
(5.8)
5.6.4.3 ViTeGNN-thpt
ViTeGNN-thpt directly returns the vertex embedding using cached embedding vectors stored in the external
DDR memory of FPGA. We use BW to denote the memory bandwidth of FPGA external DDR memory.
The latency of returns the vertex embedding can be derived as:
Θ
compute
thpt =
2BE fembZd
BW
(5.9)
While ViTeGNN-thpt directly returns the vertex embedding for the incoming edge batch Enew, it also
concurrently updates the vertex neighbor list and vertex memory, which has the latency of tneigh + tmem.
123
The latency of tneigh and tmem are the same as ViTeGNN-lat and ViTeGNN-bal. Therefore, the throughput
of ViTeGNN-thpt can be expressed as:
Φthpt =
BE
tneigh + tmem
(5.10)
5.6.4.4 ViTeGNN-auto
Denote the time duration of an incoming batch of BE edges as tB. In real-world TGNN applications, tB may
vary significantly. For example, the item recommending system in an online retail store receives multiple
orders of magnitude numbers of incoming user-item pairs during a sale event like black Friday. A fraud
detection system in a trading platform also needs to handle periodic changes in trading volumes. This
motivates the development of a dynamic inference algorithm that can achieve the best total latency while
satisfying the actual throughput requirement.
For a given batch, the inference system can keep up with the current throughput if the predicted
throughput Φ ≥ BE . Since t
lat
gnn > tbal
gnn, the throughput of ViTeGNN-lat, ViTeGNN-bal, and ViTeGNN-thpt
satisfies Φlat < Φbal < Φthpt. For latency, since Tinterval is far greater than Trepeat and the computation
latency, Θthpt is lager than Θbal and Θlat. For the comparison between Θbal and Θlat, they share a common
overhead to update the node memory of the directly involved nodes tneigh and tmem. The algorithm latency
for ViTeGNN-bal is 0 for the directly involved nodes and Trepeat for the others. For ViTeGNN-lat, since we
further break the identified affected nodes into smaller batches, we can put the directly involved nodes in the
first batch to ensure they have similar latency as in ViTeGNN-bal. Therefore, the latency of ViTeGNN-lat,
ViTeGNN-bal, and ViTeGNN-thpt also satisfies Θlat > Θbal > Θthpt.
Enabled by our flexible hardware design, we propose ViTeGNN-auto that automatically selects the best
inference mode at runtime. ViTeGNN-auto adopts a simple yet effective strategy that selects the inference
mode with the lowest latency that keeps up with the throughput of the incoming edge stream. Specifically, at
124
Table 5.3: Dataset statistic. The max(t) column shows the maximum edge timestamp (minimum edge
timestamp is 0 in all datasets). |dv| and |de| show the dimensions of node features and edge features,
respectively.
|V | |E| max(t) |dv| |de|
Wikipedia 9,227 157,474 2.7e6 - 172
Reddit 10,984 672,447 2.7e6 - 172
MOOC 7,144 411,749 2.6e7 - -
Flights 13,169 1,927,145 1.0e7 - -
GDELT 8,160 3,795,261 1.6e8 413 130
the beginning of each batch, the throughput of the current batch could be calculated by Φbatch = BE /tbatch,
where tbatch is the time duration of the current batch. Then, ViTeGNN-auto decides to apply ViTeGNN-lat,
ViTeGNN-bal, and ViTeGNN-thpt if Φbatch ≤ Φlat, Φlat < Φbatch ≤ Φbal, and Φbatch > Φbal, respectively.
Since their total latency is strictly increasing, ViTeGNN-auto achieves the minimum total latency in all
batches.
5.7 Experiments
We evaluate the performance of ViTeGNN with the temporal link prediction task and dynamic node
classification task on five datasets:
• Wikipedia [56] is a bipartite user-internet page graph with graph events of users editing Wikipedia
pages. The nodes represent users and Wikipedia pages while the edges represent the editing. The
edges have features extracted from the editing text. The temporal link prediction task is to predict
whether a user will edit a Wikipedia page. The dynamic node classification task is to predict whether
a user is banned from editing.
• Reddit [56] is a bipartite user-Reddit graph with graph events of users posting to sub-Reddits. The
nodes represent users and sub-Reddits while the edges represent the posts. The edges have features
extracted from the posts. The temporal link prediction task is to predict whether a user will post on a
sub-Reddit. The dynamic node classification task is to predict whether a user is banned from posting.
125
• GDELT [131] is a temporal knowledge graph tracking global events. The nodes represent actors
while the edges represent global events. The nodes have features extracted from the CAMEO codes
attached to the actors. The edges also have features extracted from the CAMEO codes attached to
the events. The temporal link prediction task is to predict whether there will be an event between
two actors. The dynamic node classification task is to predict the locations (countries) of the actors.
• MOOC [56] is a bipartite user-course action graph with graph events of users interacting with class
items (i.e., watching a video, answering a question). The MOOC graph is featureless. The temporal
link prediction task is to predict whether a user will interact with a class item. There is no dynamic
node classification task due to missing node-level labels.
• Flight [73] is a traffic graphs representing airline routes. The nodes represent airports while the
edges represent flights between two airports. The Flight graph is featureless. The temporal link
prediction task is to predict whether there will be a flight between two airports. There is no dynamic
node classification task due to missing node-level labels.
Table 5.3 shows the statistics of the five datasets. We follow the original training, validation, and test
sets split on all five datasets. The TGNN models are trained self-supervised using the temporal edges
on the training set. In the training process, we reverse the order to update node memory and compute
node embeddings to avoid the information leak problem [79]. For the temporal link prediction task, at
inference, we predict the next interaction using the node memory updated by the current interaction.
For the dynamic node classification task, since information leak is not an issue, we simply use the node
embeddings computed by the most up-to-date node memory.
Table 5.4 shows the specifications of the hardware platforms used in this work. We compare the
performance of ViTeGNN with state-of-the-art data center CPU and GPU, which both use more advanced
processes and draw more power than our FPGA. Note that the selected FPGA platform Xilinx Alveo U280 is
126
Table 5.4: Specifications of the hardware platforms.
CPU GPU FPGA
Device Dual AMD
EPYC 9654
Nvidia
6000 Ada
Xilinx
Alevo U280
Peak
Performance 10.7 TFLOPs 91.1 TFLOPs 0.5 TFLOPs
Technology TSMC 5nm TSMC 5nm TSMC 16nm
Frequency 2.4-3.55 Ghz 0.91-2.59 Ghz 200 Mhz
TDP 720W 300W 50W
# of Units 2 Sockets 1 Die 3 Dies (SLRs)
Resources
Per Unit
96 Cores
192 Threads
18176 CUDA
cores
434K LUTs
3008 DSPs
672 BRAMs
320 URAMs
External
Memory 1.5 TB DDR5 48 GB GDDR6 8 GB HBM
32 GB DDR4
Memory
Bandwidth 461 GB/s 960 GB/s 460 GB/s (HBM)
38 GB/s (DDR4)
Table 5.5: FPGA resource utilization.
LUTs DSPs BRAMs URAMs
Used 667K 6080 1214 640
Available 1303K 9204 2016 960
Utilization (%) 52% 66% 60% 66%
the state-of-the-art HBM-enabled FPGA platform, which have massive computation resources and memory
bandwidth.
5.7.1 FPGA Implementation
We implement the hardware design of ViTeGNN on Xilinx Alveo U280, which has three Super Logic Regions
(SLRs) denoted as SLR0, SLR1, and SLR2, respectively. SLR0 is connected to the high bandwidth memory
(HBM) which has 32 parallel memory channels, each having the peak memory bandwidth around 14.4 GB/s.
The HBM can deliver up to 460 GB/s memory bandwidth and has total capacity of 8 GB. Moreover, Alveo
U280 board has two external DDR memory banks, which can deliver 38 GB/s peak memory bandwidth
127
SLR2
SLR1
SLR0
HBM HBM
Compute Unit
Compute Unit
Neighbor Update Unit
Figure 5.5: Device map on Xilinx Alveo U280. The yellow boxes are on-chip high bandwidth memory of
FPGA that are directly connected to SLR0.
and has total capacity of 32 GB. We implement our design using high level synthesis (HLS). We implement
the Neighbor Update Unit (NUU) on the SLR0 of Alveo U280 because SLR0 is directly connected to HBM,
leading to low-latency memory access. For Neighbor Update Unit, we set Snuu = 32 and Pnuu = 16. Snuu is
set as 32 because the HBM has 32 parallel memory channels. pnuu is set as 16 to match the data rate of a
channel because each channel can output 16 32-bit data elements per cycle. On each of SLR1 and SLR2,
we implement a compute unit. Each compute unit has two Multiply-Accumulate Arrays with each array
having the size of 16 × 16 (Sg1 = 16 and Sg2 = 16). Each compute unit also has an Aggregation Module
with the parallelism SFAM = 16 × 2. We perform synthesis and place-route using Xilinx Vitis v2022.2. The
device map is shown in Figure 5.5. The FPGA resource utilization is reported in Table 5.5.
128
Table 5.6: Accuracy (temporal link prediction and dynamic node classification) of ViTeGNN compared with
the baseline TGNN. The “Comp.” and “Comm.” columns denote the ratio of computation and communication
compared with the corresponding baseline models (-lat compares with -lat, etc.). The numbers in the brackets
denote the accuracy differences compared to the Baseline(-bal) accuracy. Model Comp. Comm. Wikipedia Reddit MOOC Flights GDELT
Link
Prediction
Baseline-lat - - 0.9900 (+0.0016) 0.9978 (+0.0002) 0.9959 (+0.0117) 0.9857 (+0.0045) 0.9939 (+0.0027)
ViTeGNN-lat 15% 42% 0.9887 (+0.0003) 0.9971 (
−0.0005) 0.9840 (
−0.0002) 0.9826 (+0.0014) 0.9912 (
±0.0000)
Baseline(-bal) - - 0.9884 (
±0.0000) 0.9976 (
±0.0000) 0.9842 (
±0.0000) 0.9812 (
±0.0000) 0.9912 (
±0.0000)
ViTeGNN-bal 20% 51% 0.9867 (
−0.0015) 0.9969 (
−0.0007) 0.9761 (
−0.0081) 0.9781 (
−0.0031) 0.9864 (
−0.0048)
Baseline-thpt - - 0.9621 (
−0.0263) 0.9762 (
−0.0214) 0.9527 (
−0.0316) 0.9629 (
−0.0183) 0.9755 (
−0.0157)
ViTeGNN-thpt 79% 100% 0.9589 (
−0.0395) 0.9733 (
−0.0243) 0.9512 (
−0.0330) 0.9597 (
−0.0215) 0.9703 (
−0.0209)
Node
Classification
Baseline-lat - - 0.8833 (
±0.0000) 0.6378 (
±0.0000) - - 0.1527 (
±0.0000)
ViTeGNN-lat 15% 42% 0.8821 (
−0.0012) 0.6298 (
−0.0080) - - 0.1520 (
−0.0007)
Baseline(-bal) - - 0.8833 (
±0.0000) 0.6378 (
±0.0000) - - 0.1527 (
±0.0000)
ViTeGNN-bal 20% 51% 0.8821 (
−0.0012) 0.6298 (
−0.0080) - - 0.1520 (
−0.0007)
Baseline-thpt - - 0.8469 (
−0.0364) 0.6012 (
−0.0366) - - 0.1513 (
−0.0014)
ViTeGNN-thpt 79% 100% 0.8461 (
−0.0372) 0.5999 (
−0.0379) - - 0.1504 (
−0.0023)
129
5.7.2 Accuracy
We first evaluate the effect on the accuracy of the proposed model and hardware optimizations. For the
model parameters, we follow TGN-attn [79] to use a one-layer self-attention TGNN with a hidden dimension
of 100. The dimension of the node memory and time encoding is also 100. For the training hyper-parameters,
we follow the settings in TGL [131] to set the training batch size to 600, the learning rate to 0.0001, and the
dropout rate to 0.1. We keep 4 neighbors in the simplified attention mechanism and set the temperature T
to 1 in the knowledge distillation setup. The simplified model is fine-tuned under the knowledge distillation
setup. For inference, we set the batch size to be 50, 200, and 200 for ViTeGNN-lat, ViTeGNN-bal, and
ViTeGNN-thpt, respectively. Following the setup in TGN [79] and TGL [131], we report the accuracy results
in Average Precision (AP) for both the temporal link prediction and dynamic node classification tasks,
except in F1-Micro for the dynamic node classification task on the GDELT dataset.
Current memory-based TGNN inference [56, 79, 131] follows the conventional inference algorithm that
updates the dynamic node embeddings of only the directly involved nodes (similar to the -bal variant in
ViTeGNN). However, in the temporal link prediction tasks, they compute the dynamic node embeddings of
the negative nodes on the fly. To ensure a fair comparison, we implement similar algorithms for the baseline
and denote them as Baseline-lat, Baseline(-bal), and Baseline-thpt. The difference between the baseline
models and the ViTeGNN models is that ViTeGNN uses the simplified model while the baseline uses the
original TGN-attn model. Table 5.6 shows the computation complexity and accuracy of three algorithms
on the baseline and simplified ViTeGNN models. ViTeGNN-lat/bal/thpt achieve an average computation
complexity reduction of 15/20/79% on five datasets, respectively. For the link prediction task, the gap
between the -lat and -bal inference algorithms is small, while the gap between the -bal and -thpt inference
algorithms is large. Compared with the baselines, ViTeGNN has an average accuracy loss of 0.0039, 0.0036,
and 0.0052 on the -lat, -bal, and -thpt algorithms, respectively. For the dynamic node classification task,
since all the nodes to classify have ongoing graph events, there is no accuracy difference between the -lat
130
0 500 1,000
100
101
102
Wikipedia
Latency (ms)
ViTeGNN-lat
0 500 1,000
100
101
102
Latency (ms)
ViTeGNN-bal
0 500 1,000
0
50
100
Throughput
(kE/s)
ViTeGNN-thpt
0 500 1,000
100
101
102
103
Reddit
Latency (ms)
0 500 1,000
100
101
102
Latency (ms)
0 500 1,000
0
20
40
60
Throughput
(kE/s)
0 500 1,000
100
101
102
103
MOOC
Latency (ms)
0 500 1,000
100
101
102
Latency (ms)
0 500 1,000
0
20
40
60
80
Throughput
(kE/s)
0 500 1,000
100
101
102
103
Flights
Latency (ms)
0 500 1,000
100
101
102
Latency (ms)
0 500 1,000
0
20
40
60
80
Throughput
(kE/s)
0 500 1,000
100
101
102
103
Batch size (#edges)
GDELT
Latency (ms)
0 500 1,000
100
101
102
Batch size (#edges)
Latency (ms)
0 500 1,000
0
20
40
60
Batch size (#edges)
Throughput
(kE/s)
CPU GPU FPGA
Figure 5.6: Performance of ViTeGNN (computation latency and throughput) compared with the CPU and
GPU baselines under different batch sizes. For ViTeGNN-bal and its corresponding baselines, we only show
the computation latency, as the throughput is inversely proportional to the computation latency.
and -bal inference algorithms. Compared with the baselines, ViTeGNN has an average accuracy loss of
0.0038.
5.7.3 Latency and Throughput
We then compare the computation latency and throughput of ViTeGNN. Since the model and hardware
optimizations in ViTeGNN are designed explicitly for FPGAs, we use the original model TGN-attn on the
CPU and GPU. To identify the target nodes to inference in ViTeGNN-lat (Algorithm 10), we implement
131
a similar algorithm on CPU using a parallel hash table. Figure 5.6 shows the computation latency and
throughput under various batch sizes (in number of incoming edges) of ViTeGNN compared with CPU
and GPU. Note that although the computation latency for ViTeGNN-lat is higher than ViTeGNN-bal, since
ViTeGNN-lat has zero algorithm latency, the total latency of ViTeGNN-lat is still the lowest. Our FPGA
implementation achieves an average of 53.9/26.0/16.1× speedup over CPU and 8.2/4.0/2.5× speedup
over GPU for ViTeGNN-lat/-bal/-thpt, respectively. The reported average speedup is a geometric mean
across all the datasets. From the experiment results (Figure 5.6), we have several observations: (1) Our
FPGA implementation has higher speedup than CPU/GPU when the batch size is small because our FPGA
exploits fine-grained data parallelism. In contrast, CPU/GPU exploits coarse-grained thread parallelism,
leading to low computation resource utilization when the batch size is small. (2) For ViTeGNN-lat, our FPGA
implementation achieves higher speedup than CPU/GPU compared with ViTeGNN-bal/-thpt. Because
our FPGA implementation can efficiently execute the affected node identification (Algorithm 10) through
our customized Neighbor Update Unit and high bandwidth memory. In contrast, CPU and GPU need to
use complexity data structures for executing affected node identification, leading to large overhead. (3)
Our FPGA implementation has the customized Vertex Memory Updater to efficiently resolve the data race.
In contrast, the CPU and GPU need to use expensive atomic operations to avoid data race for updating
vertex memory. (4) For ViTeGNN-thpt, the GPU has higher throughput than FPGA when the batch size is
large on Wikipedia and Reddit, because ViTeGNN-thpt only involves neighbor list updating and vertex
memory updating. Vertex memory updating mainly involves matrix multiplication which can be efficiently
executed on GPU as GPU has much higher peak performance. Nevertheless, FPGA outperforms GPU on
smaller batch sizes for ViTeGNN-thpt. FPGA outperforms GPU on ViTeGNN-lat/-bal because our FPGA
implementation can efficiently execute affected node identification (Algorithm 10) and GNN inference. To
summarize, the obtained speedup of ViTeGNN on FPGA is due to our model-architecture co-design that
results in low complexity and takes advantage of the FPGA features.
132
0 500 1,000
100
101
Throughput (kE/s)
Wikipedia
0 500 1,000
100
101
Reddit
0 500 1,000
100
101
102
Batch size (#edges)
MOOC
0 500 1,000
100
101
102
Batch size (#edges)
Throughput (kE/s)
Flights
0 500 1,000
100
101
Batch size (#edges)
GDLET
Actual Predicted
ViTeGNN-lat
ViTeGNN-bal
ViTeGNN-thpt
Figure 5.7: Actual and predicted throughput of ViTeGNN on FGPA under different batch sizes.
5.7.4 Performance Model
We evaluate our performance model by comparing the predicted latency and throughput with the experimental results. Figure 5.7 shows the prediction error using our performance model. We define the prediction
error as
Prediction error =
Predicted performance
Real performance − 1
× 100%. (5.11)
On average, the prediction error ranges from 9.4% − 13% for ViTeGNN-lat/-bal/-thpt. We attribute the
prediction error to two reasons: (1) the fine-grained hardware pipelines generated by the Xilinx Vitis
have some flushing & draining cycles that are not included in the performance model. This is hard to
estimate since the extra cycles are usually decided by the platforms and the version of the compiler. (2) The
refreshing behavior of the DDR memory is hard to predict, which leads to periodic extra memory latency.
The accurate performance prediction is the key to ViTeGNN-auto. Based on our error analysis, we can
increase the margin when ViTeGNN-auto decides which inference mode to use given the actual throughput
Φbatch.
133
Table 5.7: Computation complexity per node embedding, communication complexity per node embedding,
and the temporal link prediction accuracy of the original models, optimized models, and their variants.
The “+SAT” row denotes adding our simplified attention mechanism. The “+LUT” row denotes adding our
LUT-based time encoder. The “+NP(6/4/2)” rows denote adding neighbor pruning with 6/4/2 left neighbors,
respectively. We show the results with accumulated optimizations row by row on each dataset.
ViTeGNN-lat ViTeGNN-bal ViTeGNN-thpt
kMEM kMAC Accuracy kMEM kMAC Accuracy kMEM kMAC Accuracy
Model # % # % AP # % # % AP # % # % AP
Wikipedia
Baseline 21.2 100% 6107.5 100% 0.9900 3.2 100% 854.6 100% 0.9884 0.5 100% 59.9 100% 0.9621
+SAT 21.2 100% 3163.6 52% 0.9821 3.2 100% 467.7 55% 0.9804 0.5 100% 59.9 100% 0.9567
+LUT 21.2 100% 2303.3 38% 0.9891 3.2 100% 345.6 40% 0.9868 0.5 100% 49.4 82% 0.9608
+NP(6) 12.9 61% 1434.0 23% 0.9891 2.1 66% 231.3 27% 0.9867 0.5 100% 49.4 82% 0.9597
+NP(4) 8.8 42% 999.3 16% 0.9887 1.6 50% 174.2 20% 0.9867 0.5 100% 49.4 82% 0.9589
+NP(2) 4.6 22% 564.7 9% 0.9878 1.0 31% 117.1 14% 0.9821 0.5 100% 49.4 82% 0.9569
Reddit
Baseline 92.6 100% 26991.7 100% 0.9978 3.2 100% 854.6 100% 0.9976 0.5 100% 59.9 100% 0.9762
+SAT 92.6 100% 13881.6 51% 0.9967 3.2 100% 467.7 55% 0.9964 0.5 100% 59.9 100% 0.9744
+LUT 92.6 100% 10086.6 37% 0.9978 3.2 100% 345.6 40% 0.9971 0.5 100% 49.4 82% 0.9762
+NP(6) 55.8 60% 6215.4 23% 0.9971 2.1 66% 231.3 27% 0.9969 0.5 100% 49.4 82% 0.9751
+NP(4) 37.3 40% 4279.8 16% 0.9971 1.6 50% 174.2 20% 0.9962 0.5 100% 49.4 82% 0.9733
+NP(2) 18.9 20% 2344.2 9% 0.9948 1.0 31% 117.1 14% 0.9931 0.5 100% 49.4 82% 0.9667
MOOC
Baseline 27.3 100% 11823.7 100% 0.9959 1.3 100% 478.8 100% 0.9842 0.3 100% 41.8 100% 0.9527
+SAT 27.3 100% 6215.9 53% 0.9834 1.3 100% 270.8 57% 0.9793 0.3 100% 41.8 100% 0.9495
+LUT 27.3 100% 3148.0 27% 0.9871 1.3 100% 146.9 31% 0.9837 0.3 100% 31.3 75% 0.9516
+NP(6) 16.5 60% 2015.6 17% 0.9861 0.9 69% 104.9 22% 0.9799 0.3 100% 31.3 75% 0.9516
+NP(4) 11.1 41% 1449.4 12% 0.9840 0.7 54% 83.9 18% 0.9761 0.3 100% 31.3 75% 0.9512
+NP(2) 5.7 21% 883.3 7% 0.9811 0.5 38% 62.9 13% 0.9743 0.3 100% 31.3 75% 0.9805
Flights
Baseline 20.5 100% 8872.3 100% 0.9857 1.3 100% 478.8 100% 0.9812 0.3 100% 41.8 100% 0.9629
+SAT 20.5 100% 4669.2 53% 0.9837 1.3 100% 270.8 57% 0.9786 0.3 100% 41.8 100% 0.9620
+LUT 20.5 100% 2367.2 27% 0.9861 1.3 100% 146.9 31% 0.9812 0.3 100% 31.3 75% 0.9632
+NP(6) 12.4 60% 1518.5 17% 0.9841 0.9 69% 104.9 22% 0.9795 0.3 100% 31.3 75% 0.9618
+NP(4) 8.4 41% 1094.2 12% 0.9826 0.7 54% 83.9 18% 0.9781 0.3 100% 31.3 75% 0.9597
+NP(2) 4.3 21% 669.8 8% 0.9810 0.5 38% 62.9 13% 0.9753 0.3 100% 31.3 75% 0.9571
GDELT
Baseline 148.9 100% 17348.7 100% 0.9939 6.9 100% 804.2 100% 0.9912 0.4 100% 55.5 100% 0.9755
+SAT 148.9 100% 9421.5 54% 0.9919 6.9 100% 461.0 57% 0.9891 0.4 100% 55.5 100% 0.9725
+LUT 148.9 100% 6821.8 39% 0.9916 6.9 100% 338.4 42% 0.9884 0.4 100% 45.0 81% 0.9719
+NP(6) 89.5 60% 4590.6 26% 0.9916 4.3 62% 241.8 30% 0.9869 0.4 100% 45.0 81% 0.9708
+NP(4) 59.8 40% 3475.0 20% 0.9912 3.0 43% 193.5 24% 0.9864 0.4 100% 45.0 81% 0.9703
+NP(2) 30.1 20% 2359.3 14% 0.9909 1.7 25% 145.2 18% 0.9845 0.4 100% 45.0 81% 0.9653
5.7.5 Ablation Study
We evaluate the effects of the proposed model optimizations on the accuracy and the reduction of communication and computation. Table 5.7 shows the computation complexity (in kMACs) per node embedding,
communication complexity (in kMEMs) per node embeddings, and the temporal link prediction accuracy of
the original model and different variants of ViTeGNN in the three inference modes. Our simplified attention
mechanism and neighbor pruning with more than two remaining neighbors lead to a slight drop in the
134
temporal link prediction accuracy. Note that our LUT-based time encoder achieves better accuracy than
the original learnable time encoding on all datasets except similar accuracy on the GDELT dataset. This is
consistent with the recent discovery of non-smooth landscapes in the loss plane of the original learnable
time encoder, causing the optimizer to get stuck in local minimums [22]. Since pruning 8 neighbors (NP(2))
leads to a significant accuracy drop, we prune 6 neighbors to balance the runtime and the accuracy. For
ViTeGNN-lat/-bal, our neighbor pruning technique reduces the communication complexity near-linear with
the portion of the pruned neighbors since fetching the node memory and features of historical neighbors is
the bottleneck in communication. Our simplified attention mechanism reduces half of the computation
complexity in ViTeGNN-lat/-bal. Our LUT-based time encoder further reduces around 15% computation
complexity, while our neighbor pruning technique near-linearly reduces the remaining complexity with
respect to the number of neighbors. For ViTeGNN-thpt, since only the node memory needs to be updated
online, the reduction of complexity is not significant.
135
Chapter 6
Conclusion and Future Works
6.1 Summary of Contributions
This dissertation scales up temporal graph learning through model-algorithm-system co-design. To overcome the four challenges: C1 prevalent noise in real-world data, C2 irregular memory access, C3 complex
temporal dependencies, and C4 high computation complexity, we completed four major works: the noiseresistant TGNN TASER, the scalable TGNN training solution TGL and DistTGL, and the versatile TGNN
inference solution ViTeGNN. Table 6.1 shows the main contributions in the model, algorithm, and system
of these works.
Specifically, this dissertation leads to the following advancements in TGNNs:
• We made TGNNs more robust and noise-resistant. We achieved an average accuracy improvement of
2.3% in MRR on five widely-used datasets, compared with state-of-the-art TGNN models. We also
speeded up the computation of these robust and noise-resistant TGNNs by an average of 5.1× on a
single GPU.
• We made TGNNs training accurate, scalable, and user-friendly. Users can compose various TGNN
variants by writing simple configurations. On a single machine with multiple GPUs, we made it
possible to train TGNNs on billion-scale graphs, achieving an average of 2.3× speedup compared with
136
Model improvement Algorithm improvement System improvement
TASER • Adaptive neighbor sampling • Adaptive mini-batch selection
• Dynamic GPU cache
• GPU neighbor finding
TGL • Unified TGNN model • CPU neighbor sampling
• Multi-GPU training system
• Random chunk scheduling
DistTGL • Static memory enhancement • Epoch/memory parallelism
• Distributed GPU training system
• Guidelines for optimal configurations
ViTeGNN • Simplified temporal attention
• Multi-mode inference • FPGA inference accelerator
• Temporal neighbor pruning
Table 6.1: Summary of completed work in this dissertation.
state-of-the-art methods. On distributed GPU clusters, we outperformed state-of-the-art methods by
an average of 14.5% in accuracy and 10.2× in training throughput.
• We made TGNN inference versatile, accurate, and fast. Our FPGA accelerator supports automatic
inference mode selection and achieved more than 16.1/2.5× speedup compared with state-of-the-art
CPU and GPU with less than 1/6 power consumption.
• We open-sourced all the codes and datasets in this dissertation, fostering collaboration with other
researchers and accelerating greater breakthroughs.
6.2 Future Works
6.2.1 In-Depth TGNN Benchmarking
Recently, a new stream of research has emerged, focusing on the critical evaluation and fair comparison
of these TGNN models [74, 46, 45]. Despite ongoing efforts to improve models and evaluation within
the domain of TGNNs, we identify two critical issues. First, the evaluation of TGNNs often lacks a
comprehensive exploration of the design space. TGNN models typically share similar sets of design choices
as previous works, encompassing several modules with multiple options for each. However, existing model
advancement works pay little attention to searching the design space and are satisfied with incomplete
success in limited spaces. Consequently, module advancements may not bring full model performance with
137
the other parts of the architecture carelessly designed. At the same time, evaluation works predominantly
focus on comparing these proposed models in their original form, but ignore their possible better variants.
Second, benchmarking is not performed in a unified framework with reasonable optimization. For example,
some evaluation works [11, 46] sequentially sample neighbors for every node as in original TGN [79] code
implementation, creating large overheads for training and inference. The unoptimized code frameworks
obscure accurate performance evaluation and create a distorted view of the comparative efficiency of
different modules. As a result, there is a limited understanding of TGNN design space, hindering the
development of truly optimized models.
With the efficient training and inference frameworks in this dissertation, we can address many important
research questions in TGNNs, including the efficiency and cost-effectiveness of module designs (What
module designs, such as the choice between lightweight and more complex neighbor aggregators, strike the
optimal balance between effectiveness and resource efficiency?), the universality of module effectiveness
across datasets (Do the best-performing modules on some datasets, like specific types of node memory,
maintain their effectiveness universally across all datasets, or is module performance dependent on dataset
characteristics?), and the interplay between different modules in the model (Does the integration of various
modules enhance or undermine performance? For example, does the combination of node memory and a
deeper neighbor sampling strategy integrate the improvements offered by each module?). These detailed
and fair comparisons will provide deeper insights into model performance and suggest future directions.
6.2.2 TGNNs on Heterogeneous Platforms
The large-scale adoption of distributed heterogeneous systems to perform analytics in many science and
engineering disciplines has led to an exponential increase in workloads [5, 65]. With the end of Moore’s
law in sight, these heterogeneous systems are being augmented with FPGAs, domain-specific accelerators,
and integrated with advanced memory technologies, including high bandwidth memory, cache coherent
138
interconnects, etc., in addition to GPUs to achieve scalable application performance [109]. Optimization
strategies that target individual devices are inadequate to realize high-performance designs on these
emerging heterogeneous architectures. Novel memory and integrated optimizations, partitioning, and
mapping techniques are required to exploit the high bandwidth and heterogeneous computational resources.
To further scale up TGNNs, one important future work is to develop heterogeneity-aware hardware
mapping methodology to accelerate state-of-the-art TGNN algorithms and models and develop domainspecific end-to-end solution variants for key Temporal Graph ML applications.
6.2.3 Multimodal TGNNs
Recently, large language models have been making impressive strides in many important tasks in natural
language processing, including chat bot [113], reasoning [101], and code generation [67]. The raw features
of dynamic graphs often include texts and images. For example, posts on social networks and reviews in
online retailers both contain raw texts and images. In current dynamic graph datasets, these raw features are
first encoded to low-dimensional vectors to serve as the node or edge features, which hinders the immense
potential for analyzing data that incorporates different information types. Multimodal TGNNs that allow
nodes or edges to contain text and image data enable them to learn complete systems in an end-to-end
fashion. This has significant implications for various fields. Imagine an intelligent system in a self-driving
car that not only processes visual data but also incorporates weather forecasts and past traffic patterns
to predict future road conditions. Similarly, healthcare could benefit from models that analyze medical
images alongside a patient’s medical history to provide more accurate diagnoses. The keys to multimodel
TGNNs are to develop efficient training algorithms and to address challenges like data synchronization
across modalities, which would be a promising future direction to work on.
139
Bibliography
[1] Alibaba. Euler-2.0. https://github.com/alibaba/euler. 2020.
[2] AMD Alevo U280 FPGA. url: https://www.xilinx.com/products/boards-and-kits/alveo/u280.html.
[3] AMD EPYC Server Processors. url: https://www.amd.com/en/processors/epyc-server-cpu-family.
[4] Idan Amir, Tomer Koren, and Roi Livni. “SGD generalizes better than GD (and regularization
doesn’t help)”. In: Conference on Learning Theory. PMLR. 2021, pp. 63–92.
[5] Prashanth B Bhat, Cauligi S Raghavendra, and Viktor K Prasanna. “Efficient collective
communication in distributed heterogeneous systems”. In: Proceedings. 19th IEEE International
Conference on Distributed Computing Systems (Cat. No. 99CB37003). IEEE. 1999, pp. 15–24.
[6] Ingwer Borg and Patrick Groenen. “Modern multidimensional scaling: Theory and applications”. In:
Journal of Educational Measurement 40.3 (2003), pp. 277–280.
[7] Shaked Brody, Uri Alon, and Eran Yahav. “How Attentive are Graph Attention Networks?” In:
International Conference on Learning Representations. 2021.
[8] Shaked Brody, Uri Alon, and Eran Yahav. “How Attentive are Graph Attention Networks?” In:
International Conference on Learning Representations. 2022. url:
https://openreview.net/forum?id=F72ximsx7C1.
[9] Zhenkun Cai, Xiao Yan, Yidi Wu, Kaihao Ma, James Cheng, and Fan Yu. “DGCL: An Efficient
Communication Library for Distributed GNN Training”. In: Proceedings of the Sixteenth European
Conference on Computer Systems. EuroSys ’21. Online Event, United Kingdom: Association for
Computing Machinery, 2021, pp. 130–144. isbn: 9781450383349. doi: 10.1145/3447786.3456233.
[10] Venkatesan T Chakaravarthy, Shivmaran S Pandian, Saurabh Raje, Yogish Sabharwal,
Toyotaro Suzumura, and Shashanka Ubaru. “Efficient scaling of dynamic graph neural networks”.
In: Proceedings of the International Conference for High Performance Computing, Networking, Storage
and Analysis. 2021, pp. 1–15.
[11] Hanqiu Chen, Yahya Alhinai, Yihan Jiang, Eunjee Na, and Cong Hao. “Bottleneck analysis of
dynamic graph neural network inference on cpu and gpu”. In: 2022 IEEE International Symposium
on Workload Characterization (IISWC). IEEE. 2022, pp. 130–145.
140
[12] Hongjiang Chen, Pengfei Jiao, Huijun Tang, and Huaming Wu. “Temporal Graph Representation
Learning with Adaptive Augmentation Contrastive”. In: Joint European Conference on Machine
Learning and Knowledge Discovery in Databases. Springer. 2023, pp. 683–699.
[13] Jianfei Chen, Jun Zhu, and Le Song. “Stochastic Training of Graph Convolutional Networks with
Variance Reduction.” In: ICML. 2018, pp. 941–949.
[14] Jie Chen, Tengfei Ma, and Cao Xiao. “FastGCN: Fast Learning with Graph Convolutional Networks
via Importance Sampling”. In: International Conference on Learning Representations (ICLR). 2018.
[15] Marcus Chen, Ivor W Tsang, Mingkui Tan, and Tat Jen Cham. “A unified feature selection
framework for graph embedding on high dimensional data”. In: IEEE Transactions on Knowledge
and Data Engineering 27.6 (2014), pp. 1465–1477.
[16] Xinshi Chen, Yan Zhu, Haowen Xu, Mengyang Liu, Liang Xiong, Muhan Zhang, and Le Song.
Efficient Dynamic Graph Representation Learning at Scale. 2021. arXiv: 2112.07768 [cs.LG].
[17] Dawei Cheng, Xiaoyang Wang, Ying Zhang, and Liqing Zhang. “Graph neural network for fraud
detection via spatial-temporal attention”. In: IEEE Transactions on Knowledge and Data Engineering
34.8 (2020), pp. 3800–3813.
[18] Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui Hsieh. “Cluster-GCN: An
Efficient Algorithm for Training Deep and Large Graph Convolutional Networks”. In: CoRR
abs/1905.07953 (2019). arXiv: 1905.07953. url: http://arxiv.org/abs/1905.07953.
[19] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. “Empirical evaluation of
gated recurrent neural networks on sequence modeling”. In: arXiv preprint arXiv:1412.3555 (2014).
[20] Jason Cong, Zhenman Fang, Michael Lo, Hanrui Wang, Jingxian Xu, and Shaochong Zhang.
“Understanding performance differences of FPGAs and GPUs”. In: 2018 IEEE 26th Annual
International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE. 2018,
pp. 93–96.
[21] Weilin Cong, Rana Forsati, Mahmut Kandemir, and Mehrdad Mahdavi. “Minimal Variance
Sampling with Provable Guarantees for Fast Training of Graph Neural Networks”. In: Proceedings of
the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. KDD ’20.
Virtual Event, CA, USA: Association for Computing Machinery, 2020, pp. 1393–1403. isbn:
9781450379984. doi: 10.1145/3394486.3403192.
[22] Weilin Cong, Si Zhang, Jian Kang, Baichuan Yuan, Hao Wu, Xin Zhou, Hanghang Tong, and
Mehrdad Mahdavi. “Do We Really Need Complicated Model Architectures For Temporal
Networks?” In: ICLR. 2023.
[23] Eleonora D’Arnese, Davide Conficconi, Marco D Santambrogio, and Donatella Sciuto.
“Reconfigurable architectures: The shift from general systems to domain specific solutions”. In:
Emerging Computing: From Devices to Systems: Looking Beyond Moore and Von Neumann. Springer,
2022, pp. 435–456.
141
[24] John Davies, Dieter Fensel, and Frank Van Harmelen. Towards the semantic web: ontology-driven
knowledge management. John Wiley & Sons, 2003.
[25] Stacy Jo Dixon. Facebook mau worldwide 2023. 2024. url: https:
//www.statista.com/statistics/264810/number-of-monthly-active-facebook-users-worldwide/.
[26] Jialin Dong, Da Zheng, Lin F Yang, and George Karypis. “Global neighbor sampling for mixed
CPU-GPU training on giant graphs”. In: Proceedings of the 27th ACM SIGKDD Conference on
Knowledge Discovery & Data Mining. 2021, pp. 289–299.
[27] Ziwei Fan, Zhiwei Liu, Jiawei Zhang, Yun Xiong, Lei Zheng, and Philip S. Yu. “Continuous-Time
Sequential Recommendation with Temporal Graph Collaborative Transformer”. In: Proceedings of
the 30th ACM International Conference on Information & Knowledge Management. CIKM ’21. Virtual
Event, Queensland, Australia: Association for Computing Machinery, 2021, pp. 433–442. isbn:
9781450384469. doi: 10.1145/3459637.3482242.
[28] Matthias Fey and Jan E. Lenssen. “Fast Graph Representation Learning with PyTorch Geometric”.
In: ICLR Workshop on Representation Learning on Graphs and Manifolds. 2019.
[29] Friendster social network and ground-truth communities.
https://snap.stanford.edu/data/com-Friendster.html.
[30] Mohammed Ali Al-Garadi, Kasturi Dewi Varathan, Sri Devi Ravana, Ejaz Ahmed, Ghulam Mujtaba,
Muhammad Usman Shahid Khan, and Samee U Khan. “Analysis of online social network
connections for identification of influential users: Survey and open research issues”. In: ACM
Computing Surveys (CSUR) 51.1 (2018), pp. 1–37.
[31] Liyu Gong and Qiang Cheng. “Exploiting Edge Features for Graph Neural Networks”. In: 2019
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019, pp. 9203–9211. doi:
10.1109/CVPR.2019.00943.
[32] Palash Goyal, Sujit Rokka Chhetri, Ninareh Mehrabi, Emilio Ferrara, and Arquimedes Canedo.
“DynamicGEM: A Library for Dynamic Graph Embedding Methods”. In: arXiv preprint
arXiv:1811.10734 (2018).
[33] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola,
Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, Large Minibatch SGD: Training ImageNet
in 1 Hour. 2017. arXiv: 1706.02677 [cs.CV].
[34] O. Green, J. Fox, J. Young, J. Shirako, and D. Bader. “Performance Impact of Memory Channels on
Sparse and Irregular Algorithms”. In: 2019 IEEE/ACM 9th Workshop on Irregular Applications:
Architectures and Algorithms (IA3). Los Alamitos, CA, USA: IEEE Computer Society, 2019, pp. 67–70.
doi: 10.1109/IA349570.2019.00016.
[35] Derek Greene, Donal Doyle, and Padraig Cunningham. “Tracking the evolution of communities in
dynamic social networks”. In: 2010 international conference on advances in social networks analysis
and mining. IEEE. 2010, pp. 176–183.
142
[36] Aditya Grover and Jure Leskovec. “node2vec: Scalable feature learning for networks”. In:
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining. ACM. 2016, pp. 855–864.
[37] Ehsan Hajiramezanali, Arman Hasanzadeh, Krishna Narayanan, Nick Duffield, Mingyuan Zhou,
and Xiaoning Qian. “Variational graph recurrent neural networks”. In: Advances in Neural
Information Processing Systems. 2019, pp. 10700–10710.
[38] Will Hamilton, Zhitao Ying, and Jure Leskovec. “Inductive Representation Learning on Large
Graphs”. In: Advances in Neural Information Processing Systems 30. 2017, pp. 1024–1034.
[39] F. Harary and G. Gupta. “Dynamic graph models”. In: Mathematical and Computer Modelling 25.7
(1997), pp. 79–87. issn: 0895-7177. doi: https://doi.org/10.1016/S0895-7177(97)00050-2.
[40] Xiaofei He and Partha Niyogi. “Locality preserving projections”. In: Advances in Neural Information
Processing Systems. 2004, pp. 153–160.
[41] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. “Distilling the Knowledge in a Neural Network”. In:
stat (2015), p. 9.
[42] Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-Term Memory”. In: Neural Comput. 9.8
(1997), pp. 1735–1780. issn: 0899-7667. doi: 10.1162/neco.1997.9.8.1735.
[43] Weihua Hu, Matthias Fey, Hongyu Ren, Maho Nakata, Yuxiao Dong, and Jure Leskovec. “OGB-LSC:
A Large-Scale Challenge for Machine Learning on Graphs”. In: arXiv preprint arXiv:2103.09430
(2021).
[44] Chengying Huan, Shuaiwen Leon Song, Santosh Pandey, Hang Liu, Yongchao Liu, Baptiste Lepers,
Changhua He, Kang Chen, Jinlei Jiang, and Yongwei Wu. “TEA: A General-Purpose Temporal
Graph Random Walk Engine”. In: Proceedings of the Eighteenth European Conference on Computer
Systems. EuroSys ’23. Rome, Italy: Association for Computing Machinery, 2023, pp. 182–198. isbn:
9781450394871. doi: 10.1145/3552326.3567491.
[45] Qiang Huang, Jiawei Jiang, Xi Susie Rao, Ce Zhang, Zhichao Han, Zitao Zhang, Xin Wang,
Yongjun He, Quanqing Xu, Yang Zhao, et al. “BenchTemp: A General Benchmark for Evaluating
Temporal Graph Neural Networks”. In: arXiv preprint arXiv:2308.16385 (2023).
[46] Shenyang Huang, Farimah Poursafaei, Jacob Danovitch, Matthias Fey, Weihua Hu, Emanuele Rossi,
Jure Leskovec, Michael Bronstein, Guillaume Rabusseau, and Reihaneh Rabbany. “Temporal graph
benchmark for machine learning on temporal graphs”. In: Advances in Neural Information
Processing Systems 36 (2024).
[47] Wenbing Huang, Tong Zhang, Yu Rong, and Junzhou Huang. “Adaptive sampling towards fast
graph representation learning”. In: Advances in neural information processing systems 31 (2018).
[48] Wenzel Jakob, Jason Rhinelander, and Dean Moldovan. pybind11 — Seamless operability between
C++11 and Python. https://github.com/pybind/pybind11. 2016.
143
[49] Abhinav Jangda, Sandeep Polisetty, Arjun Guha, and Marco Serafini. “Accelerating Graph Sampling
for Graph Machine Learning Using GPUs”. In: Proceedings of the Sixteenth European Conference on
Computer Systems. EuroSys ’21. Online Event, United Kingdom: Association for Computing
Machinery, 2021, pp. 311–326. isbn: 9781450383349. doi: 10.1145/3447786.3456244.
[50] Ming Jin, Yuan-Fang Li, and Shirui Pan. “Neural Temporal Walks: Motif-Aware Representation
Learning on Continuous-Time Dynamic Graphs”. In: Advances in Neural Information Processing
Systems. Ed. by Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho. 2022. url:
https://openreview.net/forum?id=NqbktPUkZf7.
[51] Woojeong Jin, Meng Qu, Xisen Jin, and Xiang Ren. “Recurrent Event Network: Autoregressive
Structure Inferenceover Temporal Knowledge Graphs”. In: Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational
Linguistics, 2020, pp. 6669–6683. doi: 10.18653/v1/2020.emnlp-main.541.
[52] George Karypis and Vipin Kumar. “METIS: A software package for partitioning unstructured
graphs, partitioning meshes, and computing fill-reducing orderings of sparse matrices”. In: (1997).
[53] Seyed Mehran Kazemi, Rishab Goel, Kshitij Jain, Ivan Kobyzev, Akshay Sethi, Peter Forsyth, and
Pascal Poupart. “Representation Learning for Dynamic Graphs: A Survey.” In: J. Mach. Learn. Res.
21.70 (2020), pp. 1–73.
[54] Thomas N. Kipf and Max Welling. “Semi-Supervised Classification with Graph Convolutional
Networks”. In: CoRR abs/1609.02907 (2016). arXiv: 1609.02907. url:
http://arxiv.org/abs/1609.02907.
[55] Pushmeet Kohli and Philip HS Torr. “Dynamic graph cuts and their applications in computer
vision”. In: Computer Vision: Detection, Recognition and Reconstruction. Springer, 2010, pp. 51–108.
[56] Srijan Kumar, Xikun Zhang, and Jure Leskovec. “Predicting Dynamic Embedding Trajectory in
Temporal Interaction Networks”. In: Proceedings of the 25th ACM SIGKDD international conference
on Knowledge discovery and data mining. ACM. 2019.
[57] Srijan Kumar, Xikun Zhang, and Jure Leskovec. “Predicting Dynamic Embedding Trajectory in
Temporal Interaction Networks”. In: Proceedings of the 25th ACM SIGKDD International Conference
on Knowledge Discovery & Data Mining. KDD ’19. Anchorage, AK, USA: Association for Computing
Machinery, 2019, pp. 1269–1278. isbn: 9781450362016. doi: 10.1145/3292500.3330895.
[58] Kalev Leetaru and Philip A. Schrodt. “GDELT: Global data on events, location, and tone”. In: ISA
Annual Convention (2013). url:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.686.6605.
[59] Jintang Li, Sheng Tian, Ruofan Wu, Liang Zhu, Welong Zhao, Changhua Meng, Liang Chen,
Zibin Zheng, and Hongzhi Yin. “Less Can Be More: Unsupervised Graph Pruning for Large-scale
Dynamic Graphs”. In: arXiv preprint arXiv:2305.10673 (2023).
144
[60] Xiaorui Liu, Jiayuan Ding, Wei Jin, Han Xu, Yao Ma, Zitao Liu, and Jiliang Tang. “Graph Neural
Networks with Adaptive Residual”. In: Advances in Neural Information Processing Systems. Ed. by
A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan. 2021. url:
https://openreview.net/forum?id=hfkER_KJiNw.
[61] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy,
Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. “RoBERTa: A Robustly Optimized BERT
Pretraining Approach”. In: arXiv preprint arXiv:1907.11692 (2019).
[62] Ziqi Liu, Zhengwei Wu, Zhiqiang Zhang, Jun Zhou, Shuang Yang, Le Song, and Yuan Qi. “Bandit
samplers for training graph neural networks”. In: Advances in Neural Information Processing
Systems 33 (2020), pp. 6878–6888.
[63] Yuhong Luo and Pan Li. “Neighborhood-aware Scalable Temporal Network Representation
Learning”. In: Learning on Graphs Conference (2022).
[64] Seung Won Min, Kun Wu, Mert Hidayetoglu, Jinjun Xiong, Xiang Song, and Wen-mei Hwu. “Graph
neural network training and data tiering”. In: Proceedings of the 28th ACM SIGKDD Conference on
Knowledge Discovery and Data Mining. 2022, pp. 3555–3565.
[65] Ravi Mirchandaney, Don Towsley, and John A Stankovic. “Adaptive load sharing in heterogeneous
distributed systems”. In: Journal of parallel and distributed computing 9.4 (1990), pp. 331–346.
[66] Andy Nguyen, Ahmed E. Helal, Fabio Checconi, Jan Laukemann, Jesmin Jahan Tithi, Yongseok Soh,
Teresa Ranadive, Fabrizio Petrini, and Jee W. Choi. “Efficient, out-of-memory sparse MTTKRP on
massively parallel architectures”. In: Proceedings of the 36th ACM International Conference on
Supercomputing. ICS ’22. Virtual Event: Association for Computing Machinery, 2022. isbn:
9781450392815. doi: 10.1145/3524059.3532363.
[67] Ansong Ni, Srini Iyer, Dragomir Radev, Veselin Stoyanov, Wen-tau Yih, Sida Wang, and
Xi Victoria Lin. “Lever: Learning to verify language-to-code generation with execution”. In:
International Conference on Machine Learning. PMLR. 2023, pp. 26106–26128.
[68] NVIDIA H100 Tensor Core GPU. url: https://www.nvidia.com/en-us/data-center/h100/.
[69] NVIDIA Unveils Next-Generation GH200 Grace Hopper Superchip Platform for Era of Accelerated
Computing and Generative AI. 2024. url:
https://nvidianews.nvidia.com/news/gh200-grace-hopper-superchip-with-hbm3e-memory.
[70] Santosh Pandey, Lingda Li, Adolfy Hoisie, Xiaoye S Li, and Hang Liu. “C-SAW: A framework for
graph sampling and random walk on GPUs”. In: SC20: International Conference for High
Performance Computing, Networking, Storage and Analysis. IEEE. 2020, pp. 1–15.
[71] Aldo Pareja, Giacomo Domeniconi, Jie Chen, Tengfei Ma, Toyotaro Suzumura, Hiroki Kanezashi,
Tim Kaler, Tao B. Schardl, and Charles E. Leiserson. “EvolveGCN: Evolving Graph Convolutional
Networks for Dynamic Graphs”. In: Proceedings of the Thirty-Fourth AAAI Conference on Artificial
Intelligence. 2020.
145
[72] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,
Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf,
Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy,
Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. “PyTorch: An Imperative Style,
High-Performance Deep Learning Library”. In: Advances in Neural Information Processing Systems
32. Ed. by H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett. Curran
Associates, Inc., 2019, pp. 8024–8035. url: http://papers.neurips.cc/paper/9015-pytorch-animperative-style-high-performance-deep-learning-library.pdf.
[73] Farimah Poursafaei, Andy Huang, Kellin Pelrine, and Reihaneh Rabbany. “Towards Better
Evaluation for Dynamic Link Prediction”. In: Thirty-sixth Conference on Neural Information
Processing Systems Datasets and Benchmarks Track. 2022. url:
https://openreview.net/forum?id=1GVpwr2Tfdg.
[74] Farimah Poursafaei, Shenyang Huang, Kellin Pelrine, and Reihaneh Rabbany. “Towards better
evaluation for dynamic link prediction”. In: Advances in Neural Information Processing Systems 35
(2022), pp. 32928–32941.
[75] Markus Puschel and José MF Moura. “Algebraic signal processing theory: Foundation and 1-D
time”. In: IEEE Transactions on Signal Processing 56.8 (2008), pp. 3572–3585.
[76] Ali Rahimi and Benjamin Recht. “Random features for large-scale kernel machines”. In: Advances in
neural information processing systems 20 (2007).
[77] Hesham Rakha and Aly Tawfik. “Traffic Networks: Dynamic Traffic Routing, Assignment, and
Assessment”. In: Encyclopedia of Complexity and Systems Science. Ed. by Robert A. Meyers. New
York, NY: Springer New York, 2009, pp. 9429–9470. isbn: 978-0-387-30440-3. doi:
10.1007/978-0-387-30440-3_562.
[78] Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying Wei, Wenbing Huang, and Junzhou Huang.
“Self-supervised graph transformer on large-scale molecular data”. In: Advances in neural
information processing systems 33 (2020), pp. 12559–12571.
[79] Emanuele Rossi, Ben Chamberlain, Fabrizio Frasca, Davide Eynard, Federico Monti, and
Michael Bronstein. “Temporal Graph Networks for Deep Learning on Dynamic Graphs”. In: ICML
2020 Workshop on Graph Representation Learning. 2020.
[80] Benedek Rozemberczki, Paul Scherer, Yixuan He, George Panagopoulos, Alexander Riedel,
Maria Astefanoaei, Oliver Kiss, Ferenc Beres, Guzman Lopez, Nicolas Collignon, and Rik Sarkar.
“PyTorch Geometric Temporal: Spatiotemporal Signal Processing with Neural Machine Learning
Models”. In: Proceedings of the 30th ACM International Conference on Information and Knowledge
Management. 2021.
[81] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal representations by
error propagation. Tech. rep. California Univ San Diego La Jolla Inst for Cognitive Science, 1985.
[82] Aliaksei Sandryhaila and José MF Moura. “Discrete signal processing on graphs”. In: IEEE
Transactions on Signal Processing 61.7 (2013), pp. 1644–1656.
146
[83] Aliaksei Sandryhaila and José MF Moura. “Discrete signal processing on graphs: Graph filters”. In:
2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE. 2013,
pp. 6163–6166.
[84] Aravind Sankar, Yanhong Wu, Liang Gou, Wei Zhang, and Hao Yang. “DySAT: Deep Neural
Representation Learning on Dynamic Graphs via Self-Attention Networks”. In: Proceedings of the
13th International Conference on Web Search and Data Mining. 2020, pp. 519–527.
[85] Qi Shen, Shixuan Zhu, Yitong Pang, Yiming Zhang, and Zhihua Wei. “Temporal aware
multi-interest graph neural network for session-based recommendation”. In: Asian Conference on
Machine Learning. PMLR. 2023.
[86] David Shuman, Sunil Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst. “The
Emerging Field of Signal Processing on Graphs: Extending High-Dimensional Data Analysis to
Networks and Other Irregular Domains”. In: IEEE Signal Processing Magazine 3.30 (2013), pp. 83–98.
[87] Michael Siering, Benjamin Clapham, Oliver Engel, and Peter Gomber. “A taxonomy of financial
market manipulations: establishing trust and market integrity in the financialized economy
through automated fraud detection”. In: Journal of Information Technology 32.3 (2017), pp. 251–269.
[88] Aditya Singh, Anubhav Gupta, Hardik Wadhwa, Siddhartha Asthana, and Ankur Arora. “Temporal
debiasing using adversarial loss based gnn architecture for crypto fraud detection”. In: 2021 20th
IEEE International Conference on Machine Learning and Applications (ICMLA). IEEE. 2021,
pp. 391–396.
[89] Sebastian U Stich, Anant Raj, and Martin Jaggi. “Safe adaptive importance sampling”. In: Advances
in Neural Information Processing Systems 30 (2017).
[90] Kate Sukhanova. Surprising Amazon statistics you need to know in 2023. 2023. url:
https://techreport.com/statistics/amazon-statistics/.
[91] Zeyuan Tan, Xiulong Yuan, Congjie He, Man-Kit Sit, Guo Li, Xiaoze Liu, Baole Ai, Kai Zeng,
Peter Pietzuch, and Luo Mai. Quiver: Supporting GPUs for Low-Latency, High-Throughput GNN
Serving with Workload Awareness. 2023. arXiv: 2305.10863 [cs.DC].
[92] Sheng Tian, Ruofan Wu, Leilei Shi, Liang Zhu, and Tao Xiong. “Self-Supervised Representation
Learning on Dynamic Graphs”. In: Proceedings of the 30th ACM International Conference on
Information & Knowledge Management. CIKM ’21. Virtual Event, Queensland, Australia:
Association for Computing Machinery, 2021, pp. 1814–1823. isbn: 9781450384469. doi:
10.1145/3459637.3482389.
[93] Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai,
Thomas Unterthiner, Jessica Yung, Andreas Peter Steiner, Daniel Keysers, Jakob Uszkoreit,
Mario Lucic, and Alexey Dosovitskiy. “MLP-Mixer: An all-MLP Architecture for Vision”. In:
Advances in Neural Information Processing Systems. Ed. by A. Beygelzimer, Y. Dauphin, P. Liang, and
J. Wortman Vaughan. 2021. url: https://openreview.net/forum?id=EI2KOXKdnP.
147
[94] Rakshit Trivedi, Hanjun Dai, Yichen Wang, and Le Song. “Know-Evolve: Deep Temporal Reasoning
for Dynamic Knowledge Graphs”. In: Proceedings of the 34th International Conference on Machine
Learning - Volume 70. ICML’17. Sydney, NSW, Australia: JMLR.org, 2017, pp. 3462–3471.
[95] Rakshit Trivedi, Mehrdad Farajtabar, Prasenjeet Biswal, and Hongyuan Zha. “DyRep: Learning
Representations over Dynamic Graphs”. In: International Conference on Learning Representations.
2019. url: https://openreview.net/forum?id=HyePrhR5KX.
[96] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. “Attention is all you need”. In: Advances in neural information
processing systems 30 (2017).
[97] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Łukasz Kaiser, and Illia Polosukhin. “Attention is All You Need”. In: International Conference on
Neural Information Processing Systems. 2017.
[98] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and
Yoshua Bengio. “Graph attention networks”. In: arXiv preprint arXiv:1710.10903 (2017).
[99] Jesse Vig, Shilad Sen, and John Riedl. “The Tag Genome: Encoding Community Knowledge to
Support Novel Interaction”. In: ACM Trans. Interact. Intell. Syst. 2.3 (2012). issn: 2160-6455. doi:
10.1145/2362394.2362395.
[100] Zhongwei Wan, Xin Liu, Benyou Wang, Jiezhong Qiu, Boyu Li, Ting Guo, Guangyong Chen, and
Yang Wang. “Spatio-temporal Contrastive Learning-enhanced GNNs for Session-based
Recommendation”. In: ACM Transactions on Information Systems 42.2 (2023), pp. 1–26.
[101] Boshi Wang, Xiang Yue, and Huan Sun. “Can ChatGPT defend its belief in truth? evaluating LLM
reasoning via debate”. In: Findings of the Association for Computational Linguistics: EMNLP 2023.
2023, pp. 11865–11881.
[102] Lei Wang, Qiang Yin, Chao Tian, Jianbang Yang, Rong Chen, Wenyuan Yu, Zihang Yao, and
Jingren Zhou. “FlexGraph: a flexible and efficient distributed framework for GNN training”. In:
Proceedings of the Sixteenth European Conference on Computer Systems. 2021, pp. 67–82.
[103] Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song, Jinjing Zhou, Chao Ma,
Lingfan Yu, Yu Gai, Tianjun Xiao, Tong He, George Karypis, Jinyang Li, and Zheng Zhang. “Deep
Graph Library: A Graph-Centric, Highly-Performant Package for Graph Neural Networks”. In:
arXiv preprint arXiv:1909.01315 (2019).
[104] Pengyu Wang, Chao Li, Jing Wang, Taolei Wang, Lu Zhang, Jingwen Leng, Quan Chen, and
Minyi Guo. “Skywalker: Efficient Alias-Method-Based Graph Sampling and Random Walk on
GPUs”. In: 2021 30th International Conference on Parallel Architectures and Compilation Techniques
(PACT). 2021, pp. 304–317. doi: 10.1109/PACT52795.2021.00029.
148
[105] Xuhong Wang, Ding Lyu, Mengjian Li, Yang Xia, Qi Yang, Xinwen Wang, Xinguang Wang,
Ping Cui, Yupu Yang, Bowen Sun, and Zhenyu Guo. “APAN: Asynchronous Propagation Attention
Network for Real-Time Temporal Graph Embedding”. In: Proceedings of the 2021 International
Conference on Management of Data. SIGMOD/PODS ’21. Virtual Event, China: Association for
Computing Machinery, 2021, pp. 2628–2638. isbn: 9781450383431. doi: 10.1145/3448016.3457564.
[106] Yanbang Wang, Yen-Yu Chang, Yunyu Liu, Jure Leskovec, and Pan Li. “Inductive Representation
Learning in Temporal Networks via Causal Anonymous Walks”. In: International Conference on
Learning Representations. 2021. url: https://openreview.net/forum?id=KYPz4YsCPj.
[107] Yiwei Wang, Yujun Cai, Yuxuan Liang, Henghui Ding, Changhu Wang, and Bryan Hooi.
Time-Aware Neighbor Sampling for Temporal Graph Networks. 2021. arXiv: 2112.09845 [cs.SI].
[108] Yufeng Wang and Charith Mendis. “TGOpt: Redundancy-Aware Optimizations for Temporal Graph
Attention Networks”. In: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles
and Practice of Parallel Programming (2023).
[109] Jagath Weerasinghe, Francois Abel, Christoph Hagleitner, and Andreas Herkersdorf. “Enabling
FPGAs in hyperscale data centers”. In: Ubiquitous Intelligence and Computing and 2015 IEEE 12th
Intl Conf on Autonomic and Trusted Computing and 2015 IEEE 15th Intl Conf on Scalable Computing
and Communications and Its Associated Workshops (UIC-ATC-ScalCom), 2015 IEEE 12th Intl Conf on.
IEEE. 2015, pp. 1078–1086.
[110] Douglas Brent West et al. Introduction to graph theory. Vol. 2. Prentice hall Upper Saddle River, 2001.
[111] “Whitepaper Nvidia ® Nvlink Tm High-speed Interconnect: Application Performance”. In: url:
https://api.semanticscholar.org/CorpusID:18764353.
[112] Ronald J Williams. “Simple statistical gradient-following algorithms for connectionist
reinforcement learning”. In: Machine learning 8 (1992), pp. 229–256.
[113] Tianyu Wu, Shizhu He, Jingping Liu, Siqi Sun, Kang Liu, Qing-Long Han, and Yang Tang. “A brief
overview of ChatGPT: The history, status quo and potential future development”. In: IEEE/CAA
Journal of Automatica Sinica 10.5 (2023), pp. 1122–1136.
[114] Feng Xia, Zhen Chen, Wei Wang, Jing Li, and Laurence T Yang. “MVCWalker: Random walk-based
most valuable collaborators recommendation exploiting academic factors”. In: IEEE Transactions on
Emerging Topics in Computing 2.3 (2014), pp. 364–375.
[115] Feng Xia, Jiaying Liu, Hansong Nie, Yonghao Fu, Liangtian Wan, and Xiangjie Kong. “Random
Walks: A Review of Algorithms and Applications”. In: IEEE Transactions on Emerging Topics in
Computational Intelligence 4.2 (2019), pp. 95–107.
[116] Da Xu, Chuanwei Ruan, Evren Korpeoglu, Sushant Kumar, and Kannan Achan. “Inductive
representation learning on temporal graphs”. In: International Conference on Learning
Representations. 2020.
149
[117] Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and
Stefanie Jegelka. “Representation Learning on Graphs with Jumping Knowledge Networks”. In:
Proceedings of the 35th International Conference on Machine Learning. Ed. by Jennifer Dy and
Andreas Krause. Vol. 80. Proceedings of Machine Learning Research. PMLR, 2018, pp. 5453–5462.
url: https://proceedings.mlr.press/v80/xu18c.html.
[118] Shuicheng Yan, Dong Xu, Benyu Zhang, Hong-Jiang Zhang, Qiang Yang, and Stephen Lin. “Graph
embedding and extensions: A general framework for dimensionality reduction”. In: IEEE
Transactions on Pattern Analysis & Machine Intelligence 1 (2007), pp. 40–51.
[119] Liang Yang, Junhua Gu, Chuan Wang, Xiaochun Cao, Lu Zhai, Di Jin, and Yuanfang Guo. “Toward
unsupervised graph neural network: Interactive clustering and embedding via optimal transport”.
In: 2020 IEEE international conference on data mining (ICDM). IEEE. 2020, pp. 1358–1363.
[120] Minji Yoon, Théophile Gervet, Baoxu Shi, Sufeng Niu, Qi He, and Jaewon Yang.
“Performance-adaptive sampling strategy towards fast and accurate graph neural networks”. In:
Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2021,
pp. 2046–2056.
[121] Jiaxuan You, Tianyu Du, and Jure Leskovec. “ROLAND: Graph Learning Framework for Dynamic
Graphs”. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data
Mining. 2022, pp. 2358–2366.
[122] Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, and Viktor Prasanna.
“GraphSAINT: Graph Sampling Based Inductive Learning Method”. In: International Conference on
Learning Representations. 2020. url: https://openreview.net/forum?id=BJe8pkHFwS.
[123] Qingru Zhang, David Wipf, Quan Gan, and Le Song. “A biased graph neural network sampler with
near-optimal regret”. In: Advances in Neural Information Processing Systems 34 (2021),
pp. 8833–8844.
[124] Chenguang Zheng, Hongzhi Chen, Yuxuan Cheng, Zhezheng Song, Yifan Wu, Changji Li,
James Cheng, Hao Yang, and Shuai Zhang. “ByteGNN: Efficient Graph Neural Network Training at
Large Scale”. In: Proc. VLDB Endow. 15.6 (2022), pp. 1228–1242. issn: 2150-8097. doi:
10.14778/3514061.3514069.
[125] Da Zheng, Chao Ma, Minjie Wang, Jinjing Zhou, Qidong Su, Xiang Song, Quan Gan, Zheng Zhang,
and George Karypis. “DistDGL: Distributed Graph Neural Network Training for Billion-Scale
Graphs”. In: 2020 IEEE/ACM 10th Workshop on Irregular Applications: Architectures and Algorithms
(IA3) (2020). doi: 10.1109/ia351965.2020.00011.
[126] Da Zheng, Xiang Song, Chengru Yang, Dominique LaSalle, and George Karypis. “Distributed
Hybrid CPU and GPU Training for Graph Neural Networks on Billion-Scale Heterogeneous
Graphs”. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data
Mining. KDD ’22. Washington DC, USA: Association for Computing Machinery, 2022,
pp. 4582–4591. isbn: 9781450393850. doi: 10.1145/3534678.3539177.
150
[127] Yanping Zheng, Hanzhi Wang, Zhewei Wei, Jiajun Liu, and Sibo Wang. “Instant Graph Neural
Networks for Dynamic Graphs”. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge
Discovery and Data Mining. KDD ’22. Washington DC, USA: Association for Computing Machinery,
2022, pp. 2605–2615. isbn: 9781450393850. doi: 10.1145/3534678.3539352.
[128] Yanping Zheng, Zhewei Wei, and Jiajun Liu. “Decoupled Graph Neural Networks for Large
Dynamic Graphs”. In: Proc. VLDB Endow. 16.9 (2023), pp. 2239–2247. issn: 2150-8097. doi:
10.14778/3598581.3598595.
[129] Hongkuan Zhou, Rajgopal Kannan, Ananthram Swami, and Viktor Prasanna. “HTNet: Dynamic
WLAN Performance Prediction using Heterogenous Temporal GNN”. In: IEEE INFOCOM 2023 -
IEEE Conference on Computer Communications (2023). doi: 10.1109/infocom53939.2023.10229047.
[130] Hongkuan Zhou, Bingyi Zhang, Rajgopal Kannan, Viktor Prasanna, and Carl Busart.
“Model-Architecture Co-Design for High Performance Temporal GNN Inference on FPGA”. In: 2022
IEEE International Parallel and Distributed Processing Symposium (IPDPS). 2022, pp. 1108–1117. doi:
10.1109/IPDPS53621.2022.00111.
[131] Hongkuan Zhou, Da Zheng, Israt Nisa, Vasileios Ioannidis, Xiang Song, and George Karypis. “TGL:
A General Framework for Temporal GNN Training on Billion-Scale Graphs”. In: Proc. VLDB Endow.
15.8 (2022), pp. 1572–1580. issn: 2150-8097. doi: 10.14778/3529337.3529342.
[132] Rong Zhu. “Gradient-based sampling: An adaptive importance sampling for least-squares”. In:
Advances in neural information processing systems 29 (2016).
[133] Difan Zou, Ziniu Hu, Yewen Wang, Song Jiang, Yizhou Sun, and Quanquan Gu. “Layer-dependent
importance sampling for training deep and large graph convolutional networks”. In: Advances in
neural information processing systems 32 (2019).
151
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Scaling up deep graph learning: efficient algorithms, expressive models and fast acceleration
PDF
Hardware-software codesign for accelerating graph neural networks on FPGA
PDF
Algorithm and system co-optimization of graph and machine learning systems
PDF
Efficient graph learning: theory and performance evaluation
PDF
Dynamic graph analytics for cyber systems security applications
PDF
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
PDF
Fast and label-efficient graph representation learning
PDF
Accelerating reinforcement learning using heterogeneous platforms: co-designing hardware, algorithm, and system solutions
PDF
Graph embedding algorithms for attributed and temporal graphs
PDF
Exploiting variable task granularities for scalable and efficient parallel graph analytics
PDF
Scaling recommendation models with data-aware architectures and hardware efficient implementations
PDF
Novel graph representation of program algorithmic foundations for heterogeneous computing architectures
PDF
Building straggler-resilient and private machine learning systems in the cloud
PDF
Graph machine learning for hardware security and security of graph machine learning: attacks and defenses
PDF
Efficient control optimization in subsurface flow systems with machine learning surrogate models
PDF
Human motion data analysis and compression using graph based techniques
PDF
Learning logical abstractions from sequential data
PDF
A function approximation view of database operations for efficient, accurate, privacy-preserving & robust query answering with theoretical guarantees
PDF
Simulation and machine learning at exascale
PDF
Modeling and predicting with spatial‐temporal social networks
Asset Metadata
Creator
Zhou, Hongkuan
(author)
Core Title
Scaling up temporal graph learning: powerful models, efficient algorithms, and optimized systems
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Degree Conferral Date
2024-05
Publication Date
06/10/2024
Defense Date
05/17/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
machine learning on graphs,OAI-PMH Harvest,temporal graph neural networks
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Prasanna, Viktor (
committee chair
), Chugg, Keith (
committee member
), Kannan, Rajgopal (
committee member
), Raghothaman, Mukund (
committee member
)
Creator Email
honkuaz@usc.edu,tedzhouhk@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113992541
Unique identifier
UC113992541
Identifier
etd-ZhouHongku-13072.pdf (filename)
Legacy Identifier
etd-ZhouHongku-13072
Document Type
Dissertation
Format
theses (aat)
Rights
Zhou, Hongkuan
Internet Media Type
application/pdf
Type
texts
Source
20240610-usctheses-batch-1166
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
machine learning on graphs
temporal graph neural networks