Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Hardware-software codesign for accelerating graph neural networks on FPGA
(USC Thesis Other)
Hardware-software codesign for accelerating graph neural networks on FPGA
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Hardware-software Codesign for Accelerating Graph Neural Networks on FPGA
by
Bingyi Zhang
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER ENGINEERING)
August 2024
Copyright 2024 Bingyi Zhang
Dedication
To my parents for their sacrifices and support.
ii
Acknowledgements
Pursuing a Ph.D. degree is a long and demanding journey. Throughout these five years, I have encountered
many individuals who have provided invaluable support. First and foremost, I would like to express my
deepest gratitude to my Ph.D. advisor, Professor Viktor Prasanna, who has been my guiding shepherd in
the field of high-performance computing. Professor Prasanna’s extensive knowledge and patient guidance,
especially in academic writing, have been instrumental in my development. His meticulous attention to
detail has been crucial in helping me produce numerous high-quality submissions. I would also like to
extend my sincere thanks to Professor Rajgopal Kannan, who has offered numerous insightful research
ideas and provided guidance during my summer internship. From Professor Kannan, I learned how to
tackle challenging research problems and structure compelling research papers. His mentorship has been
invaluable in shaping my approach to research and academic writing.
I would like to express my heartfelt gratitude to my qualifying exam committee members: Professor
Robin Jia, Professor Cauligi Raghavendra, and Professor Pierluigi Nuzzo. Their constructive feedback and
insightful suggestions have been invaluable to my ongoing research projects, significantly shaping their
direction and quality. I also wish to extend my sincere thanks to Professor Paul Bogdan and Professor
Weihang Wang for their interest in my Ph.D. thesis topic. Their encouragement and support have been a
great source of motivation for me, reinforcing my determination to complete my Ph.D. thesis. Their belief
in the importance of my research has been immensely uplifting and has driven me to strive for excellence
in my work.
iii
I would like to extend my gratitude to Professor Jun Han for his invaluable mentorship during my
studies at Fudan University. Professor Han has been a torchbearer for me, guiding me along the path of
academic research. His valuable guidance and unwavering support played a pivotal role in my decision
to pursue a Ph.D. degree. Without his mentorship, I would not have had the confidence and determination to embark on this challenging journey. I would also like to thank Professor Xiaoyang Zeng for his
support during my time at Fudan University. Professor Zeng’s passion for research has always been a profound source of inspiration for me. His enthusiasm and dedication to advancing knowledge have deeply
influenced my own approach to research and have motivated me to strive for excellence in my academic
endeavors.
I would like to express my gratitude to Professor Victor Adamchik and Professor Shahriar Shamsian
for giving me the opportunity to teach the course CS570: Analysis of Algorithms. This teaching experience has significantly enhanced my teaching skills and deepened my understanding of algorithm analysis,
which has proven to be invaluable during my job search. I would also like to extend my thanks to Professor Marco Paolieri for allowing me to assist in teaching the course CS356: Introduction to Computer
Systems. Throughout this experience, I was deeply impressed by his exceptional programming skills and
his professional dedication to teaching. His commitment to student learning has inspired me and greatly
influenced my approach to education and mentorship.
Additionally, I would like to extend my heartfelt thanks to all my colleagues for their tremendous
support throughout my Ph.D. journey, including Sanmukh R. Kuppannagari, Ajitesh Srivastava, Hanging Zeng, Chi Zhang, Kartik Lakhotia, Yuan Meng, Ta-Yang Wang, Sasindu Wijeratne, Tian Ye, Pengmiao
Zhang, Hongkuan Zhou, Yang Yang, Jason Lin, Ömer Faruk Akgül, Samuel Wiggins, Nikunj Gupta, Sachini
Wickramasinghe, Zhihan Xu, Yuhong Liu, Yuxin Yang, Dhruv Parikh, Gangda Deng, Xu Wang, Neelesh
Gupta, Jacob Fein-Ashley, Harshıtha B R. First and foremost, I want to express my gratitude to all the members of Prasanna’s lab for fostering a joyful and collaborative environment. This supportive atmosphere
iv
has been instrumental in my academic and personal growth. I am particularly grateful to Hanqing Zeng for
his mentorship on several research articles. His patient guidance has greatly improved my writing skills.
I also want to thank Sanmukh Rao Kuppannagari for his mentorship and invaluable discussions regarding
both research and career planning. His insights have been incredibly beneficial. My sincere thanks also go
to Hongkuan, Sasindu, Tian, Jason, Dhruv, and Sachini for their collaboration on various research articles.
Their professionalism and dedication have been crucial to the success of our submissions. I wish them all
the best in their future careers and look forward to seeing their continued success.
I would like to express my sincere gratitude to my friend Hanchen Ye. We have been best friends since
our undergraduate days at Fudan University, bonding over our shared interest in AI hardware acceleration. Our parallel journeys, from pursuing our bachelor’s degrees to applying for Ph.D. positions in the
United States, have strengthened our connection. Throughout our Ph.D. endeavors, we have consistently
supported and encouraged each other by sharing our experiences and challenges. Hanchen’s passion for
research and his rigorous approach have always been a source of inspiration for me. I wish him all the
best in his future career. I would like to thank my friends, Hanyuan Xiao and Wen Lin. My Ph.D. journal
has become colorful due to their appearance. I would like to thank my friends Chen Li (Eight Hundred),
Kewei Chen, and Chenglong Hu (HCL). Although we are not in the same place, our friendships are always the support to me. I would like to express my gratitude to my friend Chenzhong Yin. When I first
arrived in Los Angeles, I knew nothing about living in the United States. It was Chenzhong who taught
me everything I needed to know.
Finally, I would like to extend my heartfelt gratitude to my parents, Yun Ou and Xinlang Zhang. Their
unwavering support has been the cornerstone of my five-year journey through my Ph.D., allowing me to
focus on my studies without the burden of financial worries. Beyond their material support, they have been
my rock during times of difficulty, offering steadfast emotional support and encouragement. Their selfless
love and constant belief in me have been invaluable. My parents’ unwavering faith in my abilities has
v
motivated me to overcome numerous challenges and strive for excellence. They have always been there
to celebrate my successes and console me in moments of disappointment, providing a sense of stability
and comfort that has been crucial to my well-being. I am profoundly grateful for their presence in my life.
Their sacrifices and enduring support have made this achievement possible, and I dedicate this milestone
to them with all my love and appreciation.
vi
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Graph Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 GNN-based Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Challenges and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Thesis Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.7 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Chapter 2: Background and Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1 Graph Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 Basics of Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.2 GNN Inference versus Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.3 GNN-based Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Field Programmable Gate Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Acceleration of Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.1 Computing Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.2 Compiler Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.3 Runtime System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.4 Hardware Accelerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Chapter 3: GraphAGILE: Accelerator-compiler Codesign for High Performance GNN Inference . . 26
3.1 Computation Kernels and Primitives in GNN Inference . . . . . . . . . . . . . . . . . . . . 26
3.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Intermediate Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.1 Intermediate Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Compiler Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.1 IR Generation Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
vii
3.4.2 Compilation Step 1: Computation Order Optimization . . . . . . . . . . . . . . . . 35
3.4.3 Compilation Step 2: Layer Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.4.4 Compilation Step 3: Data Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4.5 Compilation Step 4: Kernel Mapping and Task Scheduling . . . . . . . . . . . . . . 40
3.5 Hardware Accelerator Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5.1 Overview of Accelerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5.2 Data Format and Data Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5.3 Instruction Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5.3.1 High-level Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.5.3.2 Microcode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.5.4 Hardware Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.5.4.1 Adaptive Computation Kernel . . . . . . . . . . . . . . . . . . . . . . . . 49
3.5.4.2 Parallel On-chip Memory Access . . . . . . . . . . . . . . . . . . . . . . 52
3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.6.1 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.6.2 Baselines, Benchmarks, Performance Metrics . . . . . . . . . . . . . . . . . . . . . 55
3.6.3 Execution Time and Size of Binary File . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.6.4 Impact of Compiler Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.6.5 Cross Platform Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.6.6 Comparison with the State-of-the-art Accelerators . . . . . . . . . . . . . . . . . . 61
Chapter 4: Dynasparse: Accelerating GNN Inference through Dynamic Sparsity Exploitation . . . 64
4.1 Data Sparsity in GNN inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2 GNN Acceleration Based on Data Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.4 Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4.1 Intermediate Representation (IR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4.2 Compilation Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4.3 Data Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4.4 Execution Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.5 Accelerator Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5.1 Data Format and Data Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5.2 Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.5.2.1 Agile Computation Module (ACM) . . . . . . . . . . . . . . . . . . . . . 74
4.5.2.2 Auxiliary Hardware Module (AHM) . . . . . . . . . . . . . . . . . . . . . 77
4.5.2.3 Double Buffering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.6 Runtime System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.6.1 Performance Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.6.2 Dynamic Kernel-to-primitive Mapping . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.6.3 Task Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.7 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.8 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.8.1 Benchmarks and Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.8.2 Impact of Dynamic K2P Mapping Strategy . . . . . . . . . . . . . . . . . . . . . . . 86
4.8.3 Analysis of Compiler and Runtime System . . . . . . . . . . . . . . . . . . . . . . . 88
4.8.4 Comparison with the State-of-the-art . . . . . . . . . . . . . . . . . . . . . . . . . . 90
viii
Chapter 5: GCV-Turbo: End-to-end Acceleration of GNN-based Computer Vision Tasks . . . . . . 93
5.1 GNN-based Computer vision Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.2 Profiling and Analysis of GNN-based CV . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.3.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.3.2 Overview of GCV-Turbo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.4 Hardware Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.4.1 Computation Primitives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.4.2 Instruction Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.5 Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.5.1 Intermediate Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.5.2 Compilation Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.5.3 Compiler Optimizations for GNN-based CV tasks . . . . . . . . . . . . . . . . . . . 105
5.5.3.1 Data Manipulation Layer Generation . . . . . . . . . . . . . . . . . . . . 106
5.5.3.2 Layer fusion for DM layer . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.5.3.3 Uniform Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.5.3.4 Data Layout Centric Mapping . . . . . . . . . . . . . . . . . . . . . . . . 107
5.5.3.5 Sparsity-aware Primitive Mapping . . . . . . . . . . . . . . . . . . . . . . 109
5.6 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.7.1 Benchmarks, Baselines, and Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.7.2 Comparison with CPU and GPU Implementations . . . . . . . . . . . . . . . . . . . 113
5.7.2.1 Evaluation on Scope 3 (GNN-based CV tasks) . . . . . . . . . . . . . . . 114
5.7.2.2 Evaluation on Scope 1 (CNNs) . . . . . . . . . . . . . . . . . . . . . . . . 115
5.7.2.3 Evaluation on Scope 2 (GNNs) . . . . . . . . . . . . . . . . . . . . . . . . 116
5.7.3 Impact of Compiler Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.7.4 Comparison with State-of-the-art Accelerators . . . . . . . . . . . . . . . . . . . . 117
5.7.4.1 Comparison with CNN Domain-specific Accelerators (DSAs) . . . . . . . 118
5.7.4.2 Comparison with GNN Accelerators . . . . . . . . . . . . . . . . . . . . . 119
Chapter 6: Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.1 Broader Impacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.2.1 Accelerating Graph Neural Network on Heterogeneous Platforms . . . . . . . . . . 123
6.2.2 Accelerating Hybrid Machine Learning Models in Computer Vision Tasks . . . . . 124
6.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
ix
List of Tables
2.1 Notations in graph neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1 Intermediate representation (IR) of a computation layer . . . . . . . . . . . . . . . . . . . . 30
3.2 Hardware resource utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3 Specifications of platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.4 Evaluated GNN models in the experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.5 Dataset Statistics (GraphAGILE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.6 End-to-End latency, latency of compilation, latency of hardware execution . . . . . . . . . 58
3.7 Size (MB) of the generated binary files [Row 1-8], and the size (MB) of input graphs [Row 9] 59
3.8 Advantages of GraphAGILE over the state-of-the-art work . . . . . . . . . . . . . . . . . . 61
3.9 Comparison of TLoH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.1 Meta data of a kernel in the IR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2 Buffer (data format) [data layout] requirement to store the input/output matrices for
executing Z = X × Y in the three execution modes . . . . . . . . . . . . . . . . . . . . . . 75
4.3 Performance model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.4 Specifications of platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.5 Dataset Statistics (Dynasparse) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.6 Latency (ms) on the unpruned GNN models . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.7 Average speedup (geometric mean) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.8 Preprocessing time of the compiler (ms) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
x
4.9 Comparison of latency with the state-of-the-art GNN accelerators (using GCN model) . . . 91
5.1 Scope of various accelerators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.2 Average speedup achieved by GCV-Turbo over various baselines within their specialized
scopes. Each entry represents the performance of GCV-Turbo divided by the performance
of the respective baseline. “Not supported" means that the scope is not supported by the
baseline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.3 Details of evaluated GNN-based CV tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.4 Statistics of the graphs in GNN-based CV tasks . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.5 Specifications of platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.6 Model size in GNN-based CV tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.7 Speedup (batch-size-one latency) of GCV-Turbo over GPU on various portions of the
GNN-based CV tasks. For layout transformation, the speedup is ∞ because GCV-Turbo
completely eliminates its overhead. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.8 Speedup (batch-size-one latency) over CPU and GPU on various CNNs . . . . . . . . . . . 116
5.9 Speedup over CPU/GPU across various GNNs and graph datasets. [] denotes the speedup
of GCV-Turbo over CPU, while () denotes the speedup of GCV-Turbo over GPU. . . . . . . 117
5.10 Specifications of CNN/GNN accelerators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.11 Comparison of inference throughput (images/second) with CNN DSAs on various CNN
models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.12 Comparison of hardware execution latency (ms) with state-of-the-art GNN accelerators . . 119
xi
List of Figures
1.1 Examples of GNN-based applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Performance metric – latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1 GNN computation abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Modularized GNNs in GraphGym [117] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Diagram of Field Programmable Gate Array (FPGA) . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Diagram of CPU and GPU architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 GNN training acceleration on CPU and Multi-FPGA Heterogeneous Platform [77] . . . . . 22
3.1 System overview of GraphAGILE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 GraphAGILE hardware system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Computation graph of the GNN in Listing 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Data partitioning and memory mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.5 Overview of hardware architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.6 GraphAGILE high-level instruction fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.7 GEMM between a block of feature matrix HB (stored in Feature Buffer) and a block of
weight matrix WB (stored in Weight Buffer) . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.8 Adaptive Computation Kernel (when psys = 8) with ISN and DSN. The interconnections
among ALUs are specified in Figure 3.9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.9 Datapath of GEMM mode, SpDMM mode, SDDMM mode . . . . . . . . . . . . . . . . . . . 50
3.10 Device map on Xilinx Alveo U250 FPGA board . . . . . . . . . . . . . . . . . . . . . . . . . 54
xii
3.11 Impact of computation order optimization on the latency of hardware execution (LoH) TLoH 57
3.12 Impact of layer fusion on the latency of hardware execution (LoH) TLoH . . . . . . . . . . . 57
3.13 Impact of computation and communication overlapping on the latency of hardware
execution (LoH) TLoH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.14 Comparison of end-to-end latency TE2E with DGL . . . . . . . . . . . . . . . . . . . . . . . 61
3.15 Comparison of end-to-end latency TE2E with PyG. Note that PyG-CPU cannot execute AP
due to out of memory. PyG-GPU cannot execute RE, YE, and AP due to out of memory.
Therefore, these results are not shown in the Figure. . . . . . . . . . . . . . . . . . . . . . . 61
4.1 Density and the visualization of graph adjacency matrix A of various graphs [49] . . . . . 65
4.2 Density of the feature matrices in the GCN model [66] . . . . . . . . . . . . . . . . . . . . 65
4.3 Overview of the proposed system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.4 Proposed workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.5 Illustration of data and model partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.6 Diagram of a Computation Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.7 Various execution modes of a Computation Core . . . . . . . . . . . . . . . . . . . . . . . . 73
4.8 Transforming dense format to sparse format . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.9 Layout (FPGA chip) and resource utilization of the proposed design on Xilinx Alveo U250.
The Computation Cores (CC0-CC6) are represented using different colors. . . . . . . . . . 84
4.10 IR of various GNN layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.11 Speedup of Dynamic over S1 when there are various sparsity (%) in the GNN weight
matrices (X-axis) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.12 Speedup of Dynamic over S2 when there are various sparsity (%) in the GNN weight
matrices (X-axis) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.13 Overhead of runtime system on unpruned GNNs . . . . . . . . . . . . . . . . . . . . . . . . 89
4.14 Speedup over the CPU and GPU platforms (Some results are not shown due to out of
memory on CPU/GPU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.1 Examples of GNN-based CV tasks [43, 16, 142, 128] . . . . . . . . . . . . . . . . . . . . . . 94
xiii
5.2 Breakdown analysis of GNN-based CV tasks on state-of-the-art GPU (RTX A5000). The
details of the models and datasets are elaborated in Section 5.6. . . . . . . . . . . . . . . . . 96
5.3 Overview of GCV-Turbo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.4 Workflow of GCV-Turbo using the skeleton-based human action recognition [128] as an
example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.5 Architecture of hardware accelerator, and the basic computation primitives supported by
a PE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.6 Data Manipulation between CNN and GNN layers . . . . . . . . . . . . . . . . . . . . . . . 106
5.7 Mapping a Conv layer to matrix operations . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.8 Device map on Alveo U250 FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.9 Speedup (latency reduction) over CPU and GPU on GNN-based CV tasks . . . . . . . . . . 115
5.10 Proportion of hardware execution latency of various portions (CNN portion and GNN
portion) on GCV-Turbo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.1 Diagram of AMD-Xilinx ACAP platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.2 Diagram of AMD Ryzen AI platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
xiv
Abstract
Graph Neural Networks (GNNs) have revolutionized many real-world applications where data can be represented as graphs. These applications include recommendation systems, social networks, traffic prediction, computer vision tasks (e.g., the tasks in autonomous driving), etc. Many applications require
high-performance execution (e.g., low latency or high throughput) of GNNs. To this end, several stateof-the-art libraries (e.g., PyTorch Geometric and Deep Graph Library) have been developed for executing
GNNs on general-purpose processors, including CPUs and GPGPUs. However, current GNN libraries on
general-purpose processors achieve sub-optimal performance due to several challenges: 1. Irregular data
structures: graphs in real-world applications are highly unstructured, with uneven degree distribution.
Such irregularity leads to complex data access patterns. 2. Heterogeneous computation kernels: GNNs involve both sparse computation kernels (e.g., sparse-dense matrix multiplication) and dense computation
kernels (dense-dense matrix multiplication). While general-purpose processors are efficient for dense computations, their data path and memory hierarchy are inefficient for sparse computations. 3. Dynamic data
sparsity: In many applications, the graph connectivity and the data sparsity of vertex features are unknown before executing the GNN model. Therefore, a GNN system needs to deal with the data sparsity of
the graphs dynamically. Such dynamic data sparsity makes it difficult for the compiler and runtime system
to generate an optimal execution scheme for GNNs. 4. Mixture of models: Some GNN-based applications
use the combined strength of different machine learning models. For example, GNN-based computer vision
xv
tasks utilize a mixture of convolutional neural networks (CNNs) and GNN models. Such a combination
leads to complex data flow.
In this dissertation, we address the above challenges through novel hardware-software codesigns on
Field Programmable Gate Array (FPGA). First, to address the challenges of irregular data structures and
heterogeneous computation kernels, we develop a hardware-software codesign on FPGA for GNN inference, named GraphAGILE. GraphAGILE incorporates a compiler and an accelerator design on FPGA. For
a given input GNN model and an input graph, the compiler translates them into the intermediate representation (IR). Then, the compiler performs several compiler optimizations and generates a sequence
of instructions for hardware execution. The hardware accelerator on FPGA executes various computation
kernels through flexible customized data paths and memory organization. Second, we propose Dynasparse,
an efficient codesign of runtime system and hardware to exploit the dynamic sparsity in GNN inference.
The hardware design of Dynasparse has a flexible data path to execute the computation kernels of various
data sparsity. The runtime system utilizes a theoretical performance model to dynamically map a GNN
computation kernel to the computation primitive based on data sparsity. Third, we propose GCV-Turbo,
a hardware-software codesign accelerating GNN-based computer vision (CV) tasks. GNN-based CV tasks
involve a mixture of GNN layers and CNN layers. To this end, GCV-Turbo incorporates novel compiler
optimizations to orchestrate the dataflow of two types of models. GCV-Turbo further incorporates novel
hardware mechanisms to facilitate efficient data layout transformation between different layers. Our implementations based on our codesign methodology achieve superior performance on various GNN-based
applications.
xvi
Chapter 1
Introduction
1.1 Graph Neural Network
Graphs are a fundamental data structure used to represent data in many real-world applications. A graph
G consists of a set of vertices V and a set of edges E. In recommendation systems, users and items can be
represented as vertices, while interactions between users and items form the edges. In social networks,
each person is a vertex, and relationships between individuals represent the edges. In traffic networks,
intersections are mapped to vertices, and roads are edges connecting these intersections. For proteins,
amino acids can be represented as nodes, with edges constructed based on the strength of interactions
between them. In point clouds, each sampled data point is a vertex, and each vertex connects to its k (k >
0) nearest neighbors to form a graph. These are just a few examples, but the applications are numerous
and varied.
Recently, graph neural networks (GNNs) [66, 53, 107, 126] have emerged as a revolutionary technique
for machine learning on graphs. GNNs have outperformed many traditional techniques, such as graph2vec
[88]. A GNN model f(·) operates on an input graph G(V, E, X) to generate embedding vectors h
out
v ∈ V
for each vertex. A GNN model consists of multiple layers, each performing message passing on the graph
and updating the embeddings of the vertices. The generated embedding vectors can then be used for
various downstream applications, such as node classification, link prediction, and graph classification.
1
1.2 GNN-based Applications
Graph neural networks have been applied to many real-world applications (as shown in Figure 1.1), including recommendation systems [5], scientific computing (e.g., physics [101], chemistry [30], biology [99]),
traffic networks [145], computer vision tasks [112], etc. In these applications, the data either have inherent
graph structure (e.g., recommendation systems) or can be manually constructed as graphs (e.g., GNN for
image segmentation [142]). In recommendation systems [41], GNNs generate embedding that captures
the structural and feature information of each vertex and its neighbors. Then, the embedding vectors are
utilized for recommendation generation. In scientific computing, such as molecular property prediction
[9], molecules are represented as graphs. GNNs learn representations of molecules based on graph structure and atomic features, enabling the prediction of molecular properties. In traffic predictions [7], GNNs
learn the embeddings that capture both spatial and temporal information of the traffic networks. The
generated embeddings can be used to make predictions for future traffic conditions, such as traffic flow,
travel times, and incident probabilities. Notably, GNNs are also widely used in computer vision tasks, such
as few-shot learning image classification [42], multi-label image classification [16], image segmentation
[142], skeleton-based human action recognition [128].
Many GNN-based applications are data-intensive and computation-intensive and require high performance processing. In recommendation systems, such as Pinterest [132], Amazon products [105], and Tabao
[150], the graph consists of billions of vertices and edges. The recommendation systems need to generate
embeddings in high throughput periodically. For traffic prediction, such as Google Map [29], GNN is used
to predict the estimated time of arrival in real time. Low-latency inference is crucial to facilitate timely and
accurate prediction. In scientific computing, GNNs are utilized to discover new particles on large hadron
collider [84]. The system uses GNN to process billions of images of high-energy collisions every second
2
Recommender System Social Network Traffic Prediction
Circuit Design Computer Vision Protein Network
Figure 1.1: Examples of GNN-based applications
and decides which images to keep. High-performance GNN inference is crucial to processing large volumes of data. In computer vision (CV) systems, such as autonomous driving cars, GNN-based CV tasks
facilitate many novel CV tasks. Low-latency GNN inference is crucial to ensure safety.
1.3 Challenges and Motivation
To enable the execution of GNNs on general-purpose processors, several generic GNN libraries have been
developed, such as PyTorch Geometric (PyG) [91] and Deep Graph Library (DGL) [111]. However, these
libraries often achieve suboptimal performance. High-performance execution of GNNs faces several challenges:
Challenge 1 irregular data structures: Graphs in real-world applications are often unstructured
with uneven degree distributions. For example, many real-world graphs follow a power-law degree distribution, where a small number of vertices have a high degree. This irregularity poses challenges for hardware execution. Firstly, the uneven degree distribution can cause workload imbalance on general-purpose
processors, leading to underutilization of computational resources and reduced sustained performance.
3
Secondly, unlike regular data storage formats used for structured tensors (e.g., vectors, matrices), irregular graph data structures require complex storage formats (e.g., Coordinate format, compressed sparse
row (CSR), or compressed row storage (CRS)). This results in irregular memory access patterns and low
memory bandwidth utilization.
Challenge 2 heterogeneous computation kernels: GNNs involve both sparse computation kernels
(e.g., sparse-dense matrix multiplication, sampled dense-dense matrix multiplication) and dense computation kernels (e.g., general matrix-matrix multiplication). These kernels exhibit heterogeneous computation
and memory access patterns. While general-purpose processors can efficiently handle dense computation
kernels through careful data reuse, their complex memory hierarchy and data path are inefficient for sparse
computations. For instance, a typical central processing unit (CPU) features a multi-level cache hierarchy,
where sparse computation kernels often lead to poor data reuse. General-purpose graphics processing
units (GPGPUs) are organized using a single instruction multiple data (SIMD) architecture. However, the
irregular computation patterns of sparse kernels are unsuitable to be executed on SIMD architecture, resulting in inefficiencies.
Challenge 3 dynamic data sparsity: For executing a machine learning model, such as a convolutional neural network (CNN), the machine learning system typically performs a compilation step first. This
process takes the computation graph of the model as input and generates an optimized execution scheme
for the target hardware platform (CPU, GPU). The compilation step relies on metadata of the input and
model, such as the shape of the input and the shape of the weight matrices. In the case of a CNN model,
the input is often an image with a fixed shape (e.g., for image classification tasks) or a predictable shape
(e.g., for image segmentation tasks). However, for a GNN model, the metadata of the input graph is unpredictable in many applications. Different graphs can have varying numbers of vertices and edges. Even
graphs with the same number of vertices and edges can have very different connectivity. This variability
4
makes it challenging for the compiler to generate an optimized execution scheme for GNNs at compile
time.
Challenge 4 mixture of models: Some GNN-based applications, such as those in computer vision,
utilize a combination of CNN and GNN models to leverage the strengths of both models. CNNs are adept
at capturing local patterns in data, while GNNs excel at learning relationships. However, this combination
presents several challenges for hardware acceleration: 1. Divergent Computation and Memory Access
Patterns: Different layers in CNNs and GNNs have very different computation and memory access patterns. The machine learning compiler must carefully schedule data flow to efficiently execute both types
of models. 2. Interleaved Layers: In GNN-based computer vision tasks, CNN and GNN layers can be interleaved. These different types of layers have distinct data layouts and formats, leading to significant
overhead for transforming the data layout and format. 3. Cache Organization: General-purpose processors have complex cache organizations, which can cause significant latency when executing GNN layers.
This high latency is unsuitable for latency-sensitive applications, such as autonomous driving. Efficiently
integrating and optimizing these mixed-model applications requires addressing these challenges to ensure
performance and responsiveness.
The above challenges significantly hurdle the practical deployment of GNN-based applications. Existing works focus on developing software libraries [91, 111, 57, 131] on general purpose processors, or
develop dedicated hardware accelerators [45, 46, 127, 102, 2, 149, 144, 65] for GNN acceleration. However,
prior works only partially address the challenges. The software libraries [91, 111, 57, 131] usually incorporate optimized implementations for sparse computation of GNNs. While these software libraries [91, 111,
57, 131] lead to improved performance on general purpose processors (GPPs), they are hard to overcome
the inherited limitations of GPPs. The complex cache hierarchy of GPPs leads to poor data reuse, and the
fixed data path of GPPs is not flexible enough to execute the heterogeneous computation kernels of GNNs.
The prior hardware accelerators [45, 46, 127, 102, 2, 149, 144, 65] are designed either for specific GNN
5
models or specific input graphs. They are not flexible enough to support various GNN models or input
graphs. Some works [102, 73] develop design automation tools for generating optimized hardware accelerators for specific GNN models or input graphs. Regenerating hardware accelerators incurs significant
overhead due to hardware synthesis and place-route. Moreover, existing hardware accelerators ignore the
mixture of models in many GNN-based tasks, limiting their real-world applications. Given the limitation
of existing works, there is an urgent need for hardware-software co-design to comprehensively address
the challenges in accelerating GNN-based applications.
1.4 Thesis Scope
As graph neural networks (GNNs) gain increasing importance in various real-world applications, the demand for accelerating GNN-based applications becomes urgent. Previous efforts in GNN acceleration have
primarily focused on two fronts: refining software frameworks for general-purpose processors and designing specialized hardware architectures tailored to specific GNN models or input graphs. On the one hand,
optimizing software frameworks for general-purpose processors (GPPs) often yields only modest performance improvements due to inherent limitations of general-purpose processors such as complex cache
hierarchies. On the other hand, existing hardware accelerators for GNNs are typically tailored to specific
models or input graphs and often lack corresponding software optimizations. Moreover, those hardware
accelerator designs cannot support a broad range of GNN models.
The scope of this dissertation concentrates on a hardware-software codesign approach to accelerate full
graph neural network inference, where the GNN model operates on the complete input graph. In hardware
design, we aim to develop a unified architecture capable of supporting a wide range of GNN models by
accommodating the heterogeneous computation kernels inherent in GNNs. Meanwhile, in software design,
our objective is to create an efficient compiler that optimizes the computation graph of GNN inference and
performs accelerator-specific hardware mapping for our developed hardware accelerator. Additionally,
6
we intend to devise an efficient runtime system that dynamically exploits data sparsity in GNN inference.
Through collaborative software and hardware codesign, our approach aims to support a broad range of
GNN models while achieving high performance.
1.5 Metrics
Corresponding to the above challenges, the objective of this dissertation is to achieve high performance for
GNN-based applications through the hardware-software codesign. We define the following performance
metrics for the codesign:
Latency: For evaluating the machine learning system of GNNs, the following kinds of latency (as
shown in Figure 1.2) are considered:
• End-to-end latency: End-to-end latency refers to the duration from the time when the input model
and input graph are given to the time when the results are obtained. For executing a GNN model, the
end-to-end latency usually involves the preprocessing latency, data movement latency, and hardware
execution latency.
• Preprocessing latency: For executing a GNN model, the preprocessing latency is included because
different input graphs can have different numbers of vertices and edges and different connectivity.
Preprocessing (e.g., compilation) is required to process the input graph and generate instructions for
hardware execution.
• Hardware execution latency: Hardware execution latency refers to the duration to execute the model
inference on the target hardware platform. It is the duration from the time when the input and
hardware instructions are stored in the memory to the time the results are obtained.
7
Note that there is data communication latency because the preprocessing stage and hardware execution
stage may be executed on different hardware platforms. Therefore, data communication is required between different hardware platforms
Input graph
is given
Embeddings
generated by GNN
End-to-end latency
Preprocessing
latency
Hardware execution
latency
Data
communication
latency
Time Line
Figure 1.2: Performance metric – latency
Energy Efficiency: In this dissertation, we target full graph inference where the GNN model operates on the complete input graph. Therefore, the energy efficiency refers to the energy consumption for
executing the GNN inference for the complete input graph. It is defined as energy consumption per input
graph (J/graph).
1.6 Contributions
In this dissertation, we propose hardware-software codesigns for high-performance graph neural networkbased applications. My contributions are summarized as follows:
Contribution 1: We propose GraphAGILE, a comprehensive hardware-software codesign for high performance GNN inference. GraphAGILE consists of a compiler and a hardware accelerator on FPGA. The
compiler takes the user-defined GNN model (written in PyTorch Geometric [36]) and graph metadata as
input and transforms it into a sequence of instructions. The compiler performs several optimizations during the compilation to reduce the computation complexity and external memory traffic. The accelerator
has flexible data path to execute various computation primitives in GNNs. Our contributions include:
8
• We propose GraphAGILE, a comprehensive hardware-software codesign for high-performance GNN
inference. The techniques used in GraphAGILE address two main challenges: (1) irregular data
structures and (2) heterogeneous computation kernels, which have been elaborated in Section 1.3.
• The compiler design of GraphAGILE consists of (1) generic intermediate representations (IR) that can
represent the computation graphs of various GNN models and (3) several compiler optimizations,
including:
– computation order optimization that automatically reorders the computation graph to reduce
the total computation complexity.
– layer fusion that merges adjacent layers to communicate the inter-layer results through on-chip
memory, which reduces the total volume of external memory communication.
– graph partitioning that optimizes the intra-layer and inter-layer data communication under a
given on-chip memory constraint.
– kernel mapping and task scheduling that hide data communication latency and achieve dynamic load balance.
• The hardware design GraphAGILE consists of (1) an instruction set that bridges the gap between
software design and the hardware accelerator, (2) a flexible data path and memory organization that
can execute various basic computation primitives using the same set of computation resources.
Contribution 2: We propose Dynasparse, a hardware-software codesign, which can efficiently exploit
the dynamic data sparsity in GNN inference. For the hardware design, we use Field Programmable Gate
Array (FPGA) as the target hardware platform. The programmability of FPGA allows us to (1) develop a
customized data path and memory organization to support various computation primitives, (2) develop
efficient hardware mechanism for sparsity profiling and transformation of data format and data layout,
9
and (3) implement a lightweight and customized soft processor to perform dynamic kernel-to-primitive
mapping at runtime. We summarize our main contributions as follows:
• We develop a complete system on FPGA with the following innovations in hardware design:
– a novel hardware architecture, named Agile Computation Module, consisting of multiple Computation Cores with flexible data path and memory organization that can execute computation primitives of various data sparsity, including GEMM (dense-dense matrix multiplication),
SpDMM (sparse-dense matrix multiplication) and SPMM (sparse-sparse matrix multiplication).
– an efficient hardware mechanism that supports fast sparsity profiling and data format/layout
transformation.
• We propose a soft processor and develop a runtime system on the soft processor to enable dynamic
sparsity exploitation, including:
– dynamic kernel-to-primitive (K2P) mapping strategy that automatically selects the optimal
computation primitive for a given kernel based on an analytical performance model.
– task scheduling strategy that manages the execution of the computation primitives on the accelerator to achieve load balance across multiple Computation Cores in the FPGA accelerator.
• We implement the proposed codesign on a state-of-the-art FPGA, Xilinx Alveo U250. For various
GNN models and input graphs, the proposed accelerator and the dynamic kernel-to-primitive mapping reduce the inference latency by 3.73× on the average compared with the static mapping strategies employed in the state-of-the-art GNN accelerators. Compared with state-of-the-art CPU (GPU)
implementations, Dynasparse achieves up to 56.9× (2.37×) speedup in end-to-end latency. Compared with state-of-the-art FPGA implementations, Dynasparse achieves 2.7× speedup in accelerator execution latency.
10
Contribution 3: We propose GCV-Turbo, a hardware-software codesign for GNN-based computer vision
(CV) tasks. In this contribution, we identify a new and important application domain – GNN-based CV,
which utilizes both GNN models and CNN models. We identify the challenges (Section 1.3) for accelerating
GNN-based CV from both the hardware level and software level. For the hardware design, we employ
the resource-sharing techniques that the computation kernels in CNNs and GNNs share the same set of
hardware resources. This maximizes the hardware utilization and improves the performance of inference.
For the software design, we develop several compiler optimizations to optimize the computation graph of
the GNN-based CV task. Our results include:
• We propose GCV-Turbo, the first domain-specific accelerator for end-to-end acceleration of GNNbased CV tasks. This addresses the challenges that, in GNN-based CV tasks, a mixture of models can
be used, leading to complex dataflow.
• We design a novel hardware architecture with a flexible data path and memory organization capable
of executing various computation kernels in CNN and GNN layers using the same set of hardware resources. This resource-sharing strategy maximizes the hardware utilization on FPGA and improves
the inference performance.
• We develop a customized compiler for end-to-end optimizations that reduce inference latency of
GNN-based CV tasks, including (1) optimizations for data manipulation between CNN layers and
GNN layers, (2) data layout centric mapping, (3) sparsity-aware computation primitive mapping.
• We deploy the proposed codesign on a state-of-the-art FPGA board and evaluate the codesign on six
representative GNN-based CV tasks. GCV-Turbo achieves significant latency reduction compared
with the state-of-the-art implementation on general-purpose processors.
11
1.7 Organization
The rest of the dissertation is organized as follows: In Chapter 2, we review the background of the
graph neural networks and field programmable gate array (FPGA). In Chapter 3, we cover the proposed
accelerator-compiler codesign for high-performance GNN inference, named GraphAGILE. In Chapter 4, we
introduce Dynasparse, the codesign of a runtime system and hardware for exploiting dynamic data sparsity. In Chapter 5, we elaborate on the proposed codesign GCV-Turbo, which is developed for GNN-based
computer vision tasks. In Chapter 6, we conclude the dissertation and propose several future directions.
12
Chapter 2
Background and Related Works
2.1 Graph Neural Network
In this Section, we will first review the basics of graph neural networks (GNN) in Section 2.1.1. We discuss the difference between GNN inference and training from acceleration perspective in Section 2.1.2.
We introduce the GNN-based applications in Section 2.1.3. Finally, we cover the background of Field Programmable Gate Array (FPGA).
2.1.1 Basics of Graph Neural Networks
Graph Neural Networks (GNNs) have revolutionized many real-world applications where data structure
can be represented as graph. An input graph to a GNN is denoted as G(V, E, X0
), where V represents
the set of vertices and E represents the set of edges. X0 ∈ R
|V|×f denotes the vertex feature matrix with
each row being the feature vector of a vertex. A GNN model consists of a stack of GNN layers. Each GNN
layer performs message passing on G where each vertex aggregates information from its neighbors. Thus,
a multi-layer GNN model recursively performs such message passing on multi-hop neighbors. According
to [36, 111], a GNN layer can be abstracted as:
Edge-wise : ml
e = ϕ(h
l−1
u
, h
l−1
v
, wl−1
e
), ∀e(u, v) ∈ E (2.1)
13
Node-wise : h
l
v = ψ(h
l−1
v
, ρ({ml
e
: e(u, v) ∈ E})) (2.2)
where ϕ() is the message function. Each edge uses ϕ() to generate a message by combining the edge
weight w
l−1
e with the features of its incident vertices. ψ() is the update function. Each vertex uses ψ()
to update its features by aggregating the incoming messages using the reduction function ρ(). In GNNs,
the message/update functions are parameterized by neural network modules [53], such as Multi-layer
Perception.
ℎ ℎ
1
2
3
ℎ
Edgewise Nodewise
Figure 2.1: GNN computation abstraction
Table 2.1: Notations in graph neural networks
Notation Description Notation Description
G(V, E, X0
) input graph vi i
th vertex
V set of vertices e(i, j) edge from vi to vj
E set of edges L number of GNN layers
h
l
i
feature vector of vi at layer l ml
i
aggregated message by vertex vi
We introduce some well-known GNN models as follows:
14
GCN [66]: Graph Convolutional Network (GCN) is proposed in [66]. Each GCN layer is defined as
ml
i = Sum nαji · h
l−1
j
: j ∈ N (i) ∪ {i}
o
h
l
i = ReLU
ml
iWl
(2.3)
where l denotes the l
th layer, αji = √
1
D(j)·D(i)
(D(j) is the degree of vj ), Wl denotes the weight matrix
of layer l, and N (i) denotes the set of neighbors of vi
.
GAT [107]: Graph Attention Network (GAT) is proposed in [107] that is inspired by the successful
application of attention mechanism in sequence-based tasks. Equipped with the attention mechanism,
GAT is expressed as:
ml
i
(k) = Sum
α
k
ij × h
l−1
j
: j ∈ N (i)
[
{i}
h
l
i = ∥
K
k=1σ
Wl
(k)ml
i
(k)
(2.4)
In addition, GAT applies the attention mechanism to calculate edge weight αij dynamically:
αij =
exp (LReLU (⟨aatt, [Watthi
||Watthj ]⟩))
P
k∈N (i)
exp (LReLU (⟨aatt, [Watthi
||Watthk]⟩)) (2.5)
where aatt is an attention vector, Watt is an attention matrix, and ⟨,⟩ is the vector inner product operator.
GIN: Graph isomorphism network is proposed in [126], and the authors prove that GIN is as powerful
as Weisfeiler-Lehman graph isomorphism test [114]. GIN has the following expression:
ml
i = Sum
1 + ϵℓ[i = j] × h
l−1
j
: j ∈ N (i) ∪ {i}
h
l
i = ReLU
ml
iWl + b
l
(2.6)
15
In addition, many other GNN models (e.g., GIN [126]) have been proposed following the recursive
message-passing paradigm. Recently, [133] proposes the GraphGym library [133] and defines the general
design space of a GNN, as shown in Figure 2.2. The design space includes intra-layer design and inter-layer
design, where the intra-layer design follows the message-passing paradigm defined in Equations 2.1, 2.2,
the inter-layer design adds the residual connections across the GNN layers.
Figure 2.2: Modularized GNNs in GraphGym [117]
2.1.2 GNN Inference versus Training
The deployment of GNN usually involves two phases – training and inference:
GNN training: During training, the Graph Neural Network (GNN) learns the parameters (weights) of
its layers through supervised, semi-supervised, or unsupervised learning. This process generally involves,
(1) sampling the subgraphs from the graph of the application to form the input graph, (2) the forward
propagation of the input graph through the network, (3) computing the loss function, and (4) then backpropagating the gradients to update the parameters using optimization algorithms like gradient descent.
16
The goal of the training is to optimize the network’s parameters to effectively generalize to unseen data
and make accurate predictions.
GNN inference: Inference, on the other hand, involves using the trained GNN to make predictions
on new, unseen data. During inference, the input graph is fed forward through the trained network, and
the output is obtained without updating the network’s parameters. The trained GNN applies the learned
patterns and relationships from the training data to make predictions or perform tasks on new data.
GNN training and inference have different computation characteristics and challenges, which usually
have different performance metrics and require different techniques for acceleration. In this dissertation,
we focus on the acceleration of GNN inference.
2.1.3 GNN-based Applications
Graph neural networks (GNNs) have been widely used in various application domains [147, 119, 51, 143,
110], including but not limited to graph mining, physics, chemistry, biology, knowledge graph, combinatorial optimization, traffic network, recommendation system, etc. GNNs become popular due to their
ability to effectively model data with complex relationships and dependencies. For social network analysis,
GNNs are used for tasks such as node classification, community detection, link prediction, and recommendation systems. For bioinformatics, GNNs have been applied for protein-protein interaction prediction,
drug discovery, protein function prediction, and analyzing molecular structures. In the recommendation
system, GNNs are utilized to model user-item interactions in various platforms such as e-commerce, social
media, and content streaming services. In knowledge graphs, GNNs are used to reason over knowledge
graphs, which represent structured information about entities and their relationships. In natural language
processing (NLP), GNNs are increasingly being integrated into NLP tasks like document classification,
semantic parsing, named entity recognition, and sentiment analysis, especially when data can be represented as graphs such as dependency trees or co-occurrence graphs. For Financial Fraud Detection, GNNs
17
are employed for fraud detection in financial transactions by modeling the complex relationships between
different entities involved. In computer vision, GNNs are used for tasks like image segmentation, object
tracking, and scene understanding, where data can be represented as graphs of pixels or regions. In traffic
prediction, GNNs are applied in transportation networks to model traffic patterns, predict congestion, and
optimize traffic flow.
From a model architectural perspective, the models used in GNN-based applications can be categorized
into two kinds: GNN-only models and GNN-based hybrid models. GNN-only models involve only GNN
layers, such as the models used in recommendation systems and social networks. GNN-based hybrid
models not only have GNN layers but also consist of other types of layers, such as the convolutional layers
used in convolutional neural networks (CNNs). For example, GNN-based computer vision tasks utilize the
combination of CNN layers and GNN layers, which enables many novel computer vision tasks [8, 61, 92].
In this dissertation, contribution 1 (GraphAGILE, Section 3) and contribution 2 (Dynasparse, Section
4) target the applications which use GNN-only models. Contribution 3 (GCV-Turbo, Section 5) targets the
GNN-based computer vision tasks which utilize GNN-based hybrid models.
2.2 Field Programmable Gate Array
Field Programmable Gate Array (FPGA) has been widely used for accelerating various applications [21, 28],
such as machine learning (ML) [130, 50], bioinformatics [58], data processing, graph processing [148], image processing [35], networking [60], scientific computing [89], security, video processing, etc. As shown
in Figure 2.3, an FPGA board has a set of programmable hardware resources, including Look-up Tables
(LUTs), digital signal processing units (DSPs), block random access memory (BRAM), programmable interconnections, etc. The programmability of FPGA allows the users to implement customized hardware
data paths for their target applications. [141, 50, 6] utilize the FPGA for accelerating deep convolutional
18
neural networks. [148, 14, 56] accelerate a set of graph analytics algorithms on FPGA, such as PageRank, Sparse-matrix vector multiplication (SPMV), single source shortest path (SSSP), etc. [20, 18] develop
customized hardware modules on FPGA for image processing. Compared with general-purpose processors (CPU and GPGPU), developing customized dataflow architecture on FPGA enables low latency and
energy-efficient processing. Moreover, designing an Application Specific Integrated Circuit (ASIC) for an
application is time-consuming, leading to a long time-to-market. In contrast, FPGA allows quick deployment of the customized design, and its programmability enables the reconfiguration of updated designs.
Interconnect
Long wire
Short wire
Q
Q SET
CLR
D
Q
Q SET
CLR
D
Logic Cell
0
1
.
.
0
1
k
DSP blocks
SRAM
SRAM
SRAM
SRAM SRAM
SRAM
Figure 2.3: Diagram of Field Programmable Gate Array (FPGA)
19
2.3 Acceleration of Graph Neural Networks
Acceleration of graph neural networks (GNNs) is an ongoing research that has gained significant attention. In Section 2.3.1, we introduce the computing platforms for GNN acceleration used by the research
community. In Section 2.3.2 and Section 2.3.3, we discuss the existing software optimizations for GNN
acceleration, including the optimizations for compilers and runtime systems. In Section 2.3.4, we cover
the existing hardware accelerator designs for GNNs.
2.3.1 Computing Platforms
Existing works for accelerating graph neural networks are operated on various hardware computing platforms. These hardware platforms include (1) CPU (Figure 2.4), (2) general purpose graphic processing unit
(GPGPU) (Figure 2.4), (3) FPGA, (4) heterogeneous platforms, and (5) distributed platforms.
GNNs on CPU: There are several existing frameworks supporting GNNs on multi-core CPU, including
PyTorch Geometric (PyG) [36], Deep Graph Library (DGL) [39], Graphite [47], and Argo [75]. These
frameworks are built upon the low-level math kernel libraries (e.g., Intel MKL [109]) on the CPU. Accelerating GNNs on multi-core CPUs brings several benefits: (1) the graphs of real-world applications are
usually stored in the external memory of the multi-core CPU. Therefore, the multi-core CPU can access
the input graph in low latency. (2) multi-core CPU is suitable for complex graph sampling processes using GNN training. Nevertheless, the multi-core CPU has several limitations: (1) multi-core CPU has low
peak performance and memory bandwidth while GNN inference is computation intensive and requires
large memory traffic, (2) multi-core CPU has complex cache hierarchy leading to poor data reuse and large
memory access latency.
GNNs on GPGPU: General purpose GPUs (GPGPUs) are popular platforms for accelerating GNNs. Because GPUs have massive computation parallelism and large external memory traffic. There are many
state-of-the-art frameworks for accelerating GNNs on GPGPUs, including PyTorch Geometric (PyG) [36],
20
Deep Graph Library (DGL) [39], GNNAdvisor [113], Gnnmark [4], Pagraph [80], fuseGNN [17], Ge-spmm
[57], etc. These frameworks usually utilize the optimized kernel libraries (e.g., CUBLASS [3]) or build
customized kernel functions on GPUs using parallel programming languages (e.g., CUDA). While GPUs
lead to improved performance compared with multi-core platforms, GPUs still have several limitations: (1)
GNNs have irregular computation patterns. The data parallelism (single instruction multiple data, SIMD)
utilized by GPUs can lead to hardware underutilization. (2) GPUs have complex cache hierarchy, leading
to poor on-chip data reuse.
Diagram of CPU architecture Diagram of GPGPU architecture
Figure 2.4: Diagram of CPU and GPU architecture
GNNs on FPGA: A series of studies [73, 102] have advanced Graph Neural Networks (GNNs) on FPGA
platforms. Leveraging FPGA’s programmability, these studies tailor data paths and memory configurations
to optimize GNN inference kernels. Notable examples, such as Deepburning-GL [73] and FlowGNN [102],
are emblematic of design automation frameworks. These frameworks, given an input graph and GNN
model, automatically generate optimized accelerators on FPGA. However, they necessitate regeneration
of accelerators for varied input graphs and models. This process involves FPGA synthesis, placement,
routing, and bitstream reloading, incurring significant overhead. There remains an urgent need for a
21
comprehensive software-hardware codesign that offers flexibility across a wide array of input graphs and
GNN models.
GNNs on heterogeneous platforms: Heterogeneous platforms, such CPU+GPU platform and
CPU+FPGA platform, are widely used for GNN training. GNN training involves various computation kernels, including graph sampling, feature aggregation, and feature update. Graph sampling involves complex
operations, such as random number generation, which are suitable for CPU execution. Feature aggregation feature updates are computationally intensive and are suitable to be executed on GPU and FPGA. For
example, GraphACT [135], HP-GNN [78], and HyScale-GNN [76] are state-of-the-art frameworks for GNN
training acceleration on CPU+FPGA heterogeneous platform. Additionally, the state-of-the-art libraries,
including PyTorch Geometric (PyG) [36], Deep Graph Library (DGL) [39], provide support for GNN training on CPU+GPU heterogeneous platforms.
Figure 2.5: GNN training acceleration on CPU and Multi-FPGA Heterogeneous Platform [77]
GNNs on distributed platforms: Distributed platforms, integrating multiple computation nodes, are
well-suited for large-scale Graph Neural Network (GNN) training. Numerous frameworks facilitate GNN
training on distributed platforms, including DistDGL [146], DistGNN [85], P3 [40], Aligraph [150], among
others. To harness computation parallelism across multiple nodes, these frameworks typically employ
various graph partition strategies. These strategies aim to partition the input graph on each node efficiently
while minimizing inter-node data communication.
22
2.3.2 Compiler Optimizations
For efficient execution of Graph Neural Networks (GNNs), various compiler optimizations are proposed
in existing software libraries. PyTorch Geometric (PyG) [36] integrates PyTorch’s compilation framework,
leveraging TorchDynamo to capture GNN computation graphs and synchronize computation kernels with
PyTorch operators defined in the PrimTorch library [120]. In the compilation process, PyG utilizes TorchInductor as the default deep learning compiler, generating swift code tailored for backend accelerators. For
training, PyG employs Ahead-of-Time (AOT) autograd to capture the backpropagation computation graph.
SparseTIR [131] introduces a novel intermediate representation (IR) with adaptable formats and transformation techniques to accelerate sparse operators in GNNs. By decomposing formats and specifying
schedules, SparseTIR achieves notable speedups ranging from 1.08 to 1.52× for GNN training on general
purpose processors. Additionally, SparseTIR develops a performance-tuning system that explores the parameter space of potential formats and transformations. FusedMM [97] consolidates the two computation
kernels of GNNs – sampled dense-dense matrix multiplication (SDDMM) and sparse dense matrix multiplication (SpMM) – resulting in accelerated GNN training and reduced GPU memory usage. Moreover,
Seastar [121] and Graphiler [124] compile user-defined message-passing functions into their intermediate representations (IR) and subsequently optimize the IR to emit target-specific, template-based code. In
summary, the compiler optimizations employed by GNN frameworks primarily target general-purpose
processors such as CPUs and GPGPUs. However, due to the inherent limitation of the general-purpose
processors (e.g., complex cache hierarchy), these compiler optimizations achieve limited performance improvement.
2.3.3 Runtime System Design
The input graphs for Graph Neural Networks (GNNs) can vary in terms of the number of vertices and
edges, presenting a challenge for compilers due to the unknown metadata at compile time. Consequently,
23
achieving data-dependent optimizations becomes difficult. Hence, many GNN frameworks resort to runtime system optimizations to enhance performance. GNNAdvisor [113] pioneers customized GNN kernels
and CUDA runtime for optimizing GNN workloads dynamically. It incorporates runtime optimizations
such as node renumbering and memory customization, specifically tailored to enhance GNN performance
on GPUs. Similarly, Aligraph [150] introduces several runtime optimizations, including caching critical
vertices, to boost memory performance dynamically. In summary, previous runtime optimizations have
primarily focused on optimizing memory access for GNNs, such as renumbering vertices to enhance data
locality and caching crucial vertices. However, none of the existing approaches address the challenge of
dynamic data sparsity in GNN inference, thereby limiting their potential for performance improvement.
2.3.4 Hardware Accelerator
The research community has proposed various hardware accelerators for GNN inference. Existing GNN
accelerators can be categorized following the taxonomy below:
Training accelerator vs inference accelerator: A GNN-based application comprises two main phases:
training and inference. Different accelerators are tailored to optimize performance in each phase. During GNN training, tasks such as graph sampling, forward propagation, and backpropagation are executed. Notable accelerators designed for this phase include GraphACT [135], HP-GNN [78], Rubik [13],
and FlashGNN [90]. These accelerators are particularly adept at enhancing sampling-based GNN training
methods, such as subgraph sampling [136] and node sampling [53]. Conversely, GNN inference accelerators are optimized solely for the inference phase, which exclusively involves forward propagation. Among
the notable works in this domain are HyGCN [127], AWB-GCN [45], I-GCN [46], BoostGCN [137], GRIP
[65], GCNAX [71], FlowGNN [102], Engn [74], G-CoS [144], Deepburning-GL [73], among others.
Hybrid accelerator and unified accelerator: Based on the architecture, we can classify the existing
GNN hardware accelerator into hybrid accelerator and unified accelerator. GNNs involve heterogeneous
24
computation kernels – sparse computation kernels (e.g., sparse dense matrix multiplication and sampled dense-dense matrix multiplication) and dense computation kernels. Hybrid accelerators utilize separate computation resources for different computation kernels. Representative hybrid accelerators include
HyGCN [127], BoostGCN [137], FlowGNN [102]. In contrast, a unified accelerator utilizes the same set of
computation resources for different computation kernels in GNN. Representative works include AWB-GCN
[45], I-GCN [46], GCNAX [71]. To execute different kernels on the same set of computation resources, [45,
46, 71] treat both feature aggregation and feature update of GNNs as sparse matrix multiplication (SPMM).
Therefore, their unified accelerator designs can support the execution of GNNs.
25
Chapter 3
GraphAGILE: Accelerator-compiler Codesign for High Performance
GNN Inference
This Chapter introduces our accelerator-compiler codesign, GraphAGILE, for accelerating GNN-based applications, which only utilizes GNN models. GraphAGILE comprises a compiler and a hardware accelerator
design. For the hardware design, we propose a novel hardware architecture that can execute various computation kernels of GNNs. For the compiler design, we propose the general intermediate representations
for GNNs and develop several compiler optimizations to reduce the inference latency. The compiler and
the hardware accelerator work collaboratively to support a broad range of GNN models and achieve high
performance.
3.1 Computation Kernels and Primitives in GNN Inference
The basics of the graph neural networks (GNNs) has been introduced in Section 2.1.1 and the GNN Inference has been introduced in Section 2.1.2. Essentially, GNNs consist of several computation kernels: (1)
feature aggregation that each vertex aggregates the vertex feature vectors from the neighbors, (2) feature
transformation that each vertex feature vector is updated by a multi-layer perceptron (MLP), (3) attention
mechanism that the weight of each edge is calculated based on the dot product the vertex feature vectors.
These computation kernels can be mapped to the basic computation primitives: (1) general dense-dense
26
matrix multiplication (GEMM), (2) sparse-dense matrix multiplication (SpDMM), (3) sparse-sparse matrix
multiplication (SPMM), and (4) sampled dense-dense matrix multiplication (SDDMM). We introduce the
computation primitives as follows:
• General dense-dense matrix multiplication (GEMM): Suppose the input matrices are X ∈ R
m×n
and
Y ∈ R
n×f
and the output matrix is Z ∈ R
m×f
. GEMM is represented as Z = X × Y. For GEMM,
the input matrices X and Y are viewed as dense matrices, where most of the entries in the matrices
are non-zero.
• Sparse-dense matrix multiplication (SpDMM): Similar to GEMM, SpDMM can also be represented
as Z = X × Y, where X ∈ R
m×n
, Y ∈ R
n×f
, and Z ∈ R
m×f
. Different from GEMM, X is viewed
as sparse matrix and Y is viewed as dense matrix. For a sparse matrix, most of the entries are zero.
• Sparse-sparse matrix multiplication (SPMM): SPMM can be represented as Z = X × Y, where
X ∈ R
m×n
, Y ∈ R
n×f
, and Z ∈ R
m×f
. Both input matrices X and Y are viewed as sparse
matrices.
• Sampled dense-dense matrix multiplication (SDDMM): According to [36], in edge-wise computation
(Equation 2.1), many GNN models calculate edge weights using the dot product of the feature vectors
of the source and destination vertices. The above computation process corresponds to SDDMM
operation Z = S ⊙ (XY), where ⊙ is the element-wise multiplication. Moreover, X ∈ R
m×n
,
Y ∈ R
n×f
, S ∈ R
m×f
, and Z ∈ R
m×f
. Sampled means that the required results are sampled
from (XY) based on the non-zero elements in S. For each non-zero element Si,j . we calculate
Zi,j = ⟨Xi
, Yj ⟩, where Xi
is the i
th row of X and Yj is the j
th column of Y. Therefore, the basic
operation in SDDMM is vector inner product.
27
3.2 System Overview
Figure 3.1 depicts the system overview of GraphAGILE, which consists of the compiler, and the hardware
accelerator. Figure 3.2 shows the overview of hardware system.
FPGA Host
Hardware Platform
User-defined GNN Model Input Graph
Specifications of
GNN Model Graph Info. Input Parser
Intermediate Representation (IR)
Computation Order
Optimization
Graph Partition
& Data Tiling
Layer Fusion Kernel Mapping &
Task Scheduling
Compiler Instruction Sequence
Input Graph
GNN Model
FPGA local DDR
Optimizations
Figure 3.1: System overview of GraphAGILE
Target application domain: GraphAGILE targets the inference process of various GNN-based applications, such as recommendation system [53], social media [34], citation networks [66], etc. In the target
applications, the graphs can be very large. For example, a graph in recommendation systems may contain
billions of vertices and edges. GraphAGILE supports a broad range of GNN models, including (1) widely
used GNN models (GCN [66], GraphSAGE [53], GAT [107], GIN [126], SGC [117]), (2) GNN models in the
design space of GraphGym [133]. In addition, GraphAGILE has the potential to be applied to other GNN
models. An instance to GraphAGILE is specified by (1) the specifications of a GNN model, (2) the metadata
of the input graph.
28
Host Processor
Soft Processor
Accelerator
Processing
Element
Processing
Element …
Scheduler
External Memory
Model Bytecode Input Intermediate
Result
Figure 3.2: GraphAGILE hardware system
Hardware platform: The hardware platform consists of an FPGA device, FPGA local DDR memory, and
a host processor. The proposed hardware accelerator is deployed on the FPGA device. FPGA local DDR
memory stores the input graph, the GNN model, and binary files generated by the compiler. The compiler
is executed on the host processor.
Compiler: Users define the GNN using PyTorch Geometric (PyG) library. The inputs to the compiler are
(1) the computation graph of the GNN model generated by PyG, and (2) the input graph. The Input Parser
(Figure 3.2) extracts the specifications of the GNN model and the information of the input graph to generate the Intermediate Representation (IR). After obtaining IR, the compiler performs the four optimization
steps on the GNN computation graph as shown in Figure 3.2. Then, the compiler generates a sequence of
instructions to execute on the hardware accelerator.
3.3 Intermediate Representation
This Section introduces the proposed Intermediate Representation (IR). The intermediate representation
(IR) is generated by the input parser from the user-fined GNN model (using PyTorch Geometric library
[36]) and input graph. An example of a user-defined GNN model is shown in Listing 3.1. The IR represents
29
the high-level abstraction of the computation graph of the input model. Then, the compiler performs
several optimizations and transforms the IR into a sequence of instructions.
1 import torch
2 from torch import Tensor
3 from torch_geometric . nn import GCNConv
4 from torch_geometric . datasets import Planetoid
5
6 dataset = Planetoid ( root =’.’, name =’Cora ’)
7
8 class GCN ( torch . nn . Module ) :
9 def __init__ ( self , in_ch , hidden_ch , out_ch ) :
10 super () . __init__ ()
11 self . conv1 = GCNConv ( in_ch , hidden_ch )
12 self . conv2 = GCNConv ( hidden_ch , out_ch )
13 def forward ( self , x : Tensor , edge_index : Tensor ) :
14 x = self . conv1 (x , edge_index ) . relu ()
15 x = self . conv2 (x , edge_index )
16 return x
17
18 model = GCN ( dataset . num_features , 16 , dataset . num_classes )
Listing 3.1: An user-defined GNN model using PyG [36]
3.3.1 Intermediate Representation
Table 3.1: Intermediate representation (IR) of a computation layer
Layer Type
Aggregate(0), Linear(1),
Vector-Inner(2), Vector-Add(3),
Activation(4), BatchNorm(5)
Layer ID 1,2,3,...
Parent Layer IDs Parent1_ID, ...
Child Layer IDs Child1_ID, ...
Input Dimension fin
Output Dimension fout
# of vertices |V|
# of edges |E|
Aggregation operator Max, Sum, Min, Mean
Activation type ReLU, PReLU, Swish, Exp
Activation enabled True, False
We define a unified Intermediate Representation (IR) for each type of computation layer (Table 3.1).
A GNN layer can be decomposed into a sequence of computation layers. We identify six types of computation layers – Aggregate, Linear, Vector-Inner, Vector-Add, Activation and BatchNorm. The six types of
layers can represent a broad range of models because (1) the key computation kernels of GNNs (SpDMM,
30
GEMM, and SDDMM) can be represented as Aggregate, Linear, or Vector-Inner, (2) the auxiliary kernels
such as non-linear activation, residual connection, batch normalization can be represented using others
lightweight layers (e.g., Vector-Add, Activation, and BatchNorm). The compiler translates the GNN model
to a computation graph, with each node being the IR of a layer. For example, the GNN model [66] in Listing
3.1 is translated to the computation graph in Figure 3.3. The abstraction of each type of computation layer
is described in the following:
Aggregate
Linear
Activation
Aggregate
Linear
self.conv1:
self.conv2:
• Layer Type: Aggregate
• Layer ID: 1
• Input Dimension: 128
• Output Dimension: 128
• Aggregation operator: Sum
• ……
• Layer Type: Linear
• Layer ID: 2
• Input Dimension: 128
• Output Dimension: 16
• ……
• Layer Type: Activation
• Layer ID: 3
• Input Dimension: 16
• Output Dimension: 16
• Activation Type: Relu
• ……
Intermediate Representation
Figure 3.3: Computation graph of the GNN in Listing 3.1
Aggregate layer: The inputs are the vertex feature vectors {h
l−1
i ∈ R
fin : vi ∈ V} and the edges {e : e ∈
E}. The output feature vector of each vertex is calculated by:
h
l
i = AggOp(Aj,i × h
l−1
j
, j ∈ N (i)), h
l
i ∈ R
fout (3.1)
where fin = fout and AggOp() is the element-wise Aggregation Operator defined in Table 3.1 (e.g., Max,
Sum).
31
Linear layer: The inputs are the vertex feature vectors {h
l−1
i ∈ R
fin : vi ∈ V} and weight matrix
W ∈ R
fin×fout . The output feature vector of each vertex is calculated by:
Hout = [h
l
1
; h
l
2
; ...; h
l
|V|] = [h
l−1
1 W; h
l−1
2 W; ...; h
l−1
|V| W]
= [h
l−1
1
; h
l−1
2
; ...; h
l−1
|V| ]W = HinW
(3.2)
where [h
l−1
1
; h
l−1
2
; ...; h
l−1
|V| ] is the input feature matrix Hin and [h
l
1
; h
l
2
; ...; h
l
|V|] is the output feature matrix
Hout.
Vector-Inner layer: The inputs are the vertex feature vectors {h
l−1
i ∈ R
fin : vi ∈ V} and the edges
e(i, j) without edge weight. The output is the weight of each edge calculated by:
e(i, j).weight = ⟨h
l−1
i
, h
l−1
j
⟩, e(i, j) ∈ E (3.3)
Vector-Add layer: The Vector-Add layer adds feature vectors of two layers. This layer can be used to
capture the residue connection design.
Activation layer: The Activation layer applies the element-wise activation function (e.g., ReLU, PReLU,
Swish, Exp) to vertex features or edge weights.
BatchNorm layer: The input is the feature vector of each vertex {h
l−1
i ∈ R
fin : vi ∈ V}. A batch
normalization operation [59] is applied to each vertex feature.
3.4 Compiler Design
This Section introduces the compiler design of GraphAGILE. The compiler reads the user-defined GNN
model and input graph, and generates a sequence of instructions. User defines the GNN model using the
high-level API in PyTorch Geometric (PyG) Library [36], which is a general framework for GNNs. There
32
are two phases for instruction generation – translation phase and optimization phase. In the translation
phase, the Input Parser generates the Intermediate Representation (IR) from the inputs. In the optimization
phase, the compiler performs four-step optimizations and generates the output instruction sequence: (1)
Step 1: the compiler reorders the computation graph based on the theoretical computation complexity.
(2) Step 2: the compiler merges some adjacent layers to communicate intermediate data through on-chip
memory. (3) Step 3: the compiler performs data partitioning based on the available on-chip memory
to optimize off-chip data communication and enable dynamic task scheduling, (4) Step 4: the compiler
maps various kernels to ACK, and performs task scheduling to hide the data communication overhead and
achieve dynamic load balance.
3.4.1 IR Generation Workflow
The proposed intermediate representation consists of two components: LayerIR and ModelIR. LayerIR is
the IR of a computation layer that stores the parameters of a layer, as shown in Table 3.1. ModelIR stores
a list of LayerIRs and represents the computation graph corresponding to the target GNN model and the
input graph. The implementation of LayerIR and ModelIR is demonstrated in Listing 3.2.
During compilation, the compiler first translates each computation layer into a LayerIR. Then, all the
LayerIRs are connected to form a ModelIR, which represents the computation graph of the input GNN
model and the input graph. An example of the IR generation process for the GNN model in Listing 3.1 is
illustrated in Listing 3.3 (Lines 12-39). Note that for illustration, the example in Listing 3.3 is an unfolded
view of the IR generation process. In the actual implementation, the input parser automatically generates
the ModelIR using a for loop. After IR generation, the compiler performs compiler optimizations, as
shown in Listing 3.3 (Lines 42-46).
1 from collections import OrderedDict
2 ## The IR of a computation layer
3 class LayerIR ( object ) :
4 def __init__ ( self ) :
5 self . _layertype = None # Layer Tpe
6 self . _layerid = 0 # Layer ID
33
7 self . _parent_id = [] # Parent Layer IDs
8 self . _child_id = [] # Child Layer IDs
9 self . _fin = 0 # Input Dimension
10 self . _fout = 0 # Output Dimension
11 self . _nv = 0 # # of vertices
12 self . _ne = 0 # # of edges
13 self . _aggoperator = None # AggOp ()
14 self . _act = None # Activation type
15 def setparameter ( self ) :
16 # Setting the parameters for the computation layer
17 def complexity ( self ) :
18 # Return theoretical computation complexity of the computation
layer
19
20 ## The IR of a GNN model
21 class ModelIR ( object ) :
22 def __init__ ( self ) :
23 self . _layers = OrderedDict ()
24 self . _graphs = None
25 self . _numl = 0
26 def addlayers ( self , layer ) :
27 self . _layers [ layer . _layerid ] = layer
28 self . _numl = self . _numl + 1
29 def orderoptize ( self ) :
30 # Step 1: computation order optimization
31 def layerfusion ( self ) :
32 # Step 2: layer fusion
33 def datapartition ( self ) :
34 # Step 3: data partitioning
35 def kernelMapping ( self ) :
36 # Step 4: kernel mapping
37 def taskScheduling ( self ) :
38 # Step 4: task scheduling
39 def codeGeneration ( self ) :
40 # Generating Instruction sequence
Listing 3.2: The implementation of LayerIR and ModelIR
1 dataset = ’Cora ’
2 path = osp . join (’.’, ’data ’, dataset )
3 dataset = Planetoid ( path , dataset , transform = T . NormalizeFeatures () )
4 data = dataset [0]
5
6 nedges = data . edge_index . shape [1]
7 nvertices = data . x . shape [0]
8 nflen = data . x . shape [1]
9 edge_index = data . edge_index
10 edge_index = torch . transpose ( edge_index , 0 , 1)
11
12 ## IR generation
13 GNN1 = ModelIR ()
14 GNN1 . _graphs = data
15
16 aggregate1 = LayerIR ()
17 aggregate1 . setparameter (
18 layertype = ’Aggregate ’, layerid = 1 , parent_id = [] , child_id = [2] ,
fin = nflen , fout = nflen , nv = nvertices , ne = nedges , aggoperator =2 ,
act = None , actenable = False , batchenable = False )
19 GNN1 . addlayers ( aggregate1 )
20
21 linear1 = LayerIR ()
22 linear1 . setparameter (
34
23 layertype = ’Linear ’, layerid = 2 , parent_id = [1] , child_id = [3] ,
fin = nflen , fout = 16 , nv = nvertices , ne = nedges , aggoperator = None ,
act = None , actenable = False , batchenable = False )
24 GNN1 . addlayers ( linear1 )
25
26 activation1 = LayerIR ()
27 activation1 . setparameter (
28 layertype = ’Activation ’, layerid = 3 , parent_id = [2] , child_id =
[4] , fin = 16 , fout = 16 , nv = nvertices , ne = nedges , aggoperator = None ,
act =’ReLU ’, actenable = True , batchenable = False )
29 GNN1 . addlayers ( activation1 )
30
31 aggregate2 = LayerIR ()
32 aggregate2 . setparameter (
33 layertype = ’Aggregate ’, layerid = 4 , parent_id = [3] , child_id = [5] ,
fin = 16 , fout = 16 , nv = nvertices , ne = nedges , aggoperator =2 , act =
None , actenable = False , batchenable = False )
34 GNN1 . addlayers ( aggregate2 )
35
36 linear2 = LayerIR ()
37 linear2 . setparameter (
38 layertype = ’Linear ’, layerid = 5 , parent_id = [4] , child_id = [] , fin
= 16 , fout = 7 , nv = nvertices , ne = nedges , aggoperator = None , act = None ,
actenable = False , batchenable = False )
39 GNN1 . addlayers ( linear2 )
40
41 ## IR optimizations
42 GNN1 . orderoptize () # Step 1: computation order optimization
43 GNN1 . layerfusion () # Step 2: layer fusion
44 GNN1 . datapartition () # Step 3: data partitioning
45 GNN1 . kernelMapping () # Step 4: kernel mapping
46 GNN1 . taskScheduling () # Step 4: task scheduling
47 GNN1 . codeGeneration (’GNN1 .ga ’) # Generating Instruction sequence
Listing 3.3: The example of IR generation
3.4.2 Compilation Step 1: Computation Order Optimization
We design the general rule for the computation order optimization. First, we define the linear operator in
the aggregate layer:
Definition 1 In an Aggregate layer, the aggregation operator AggOp() is a linear operator if AggOp() satisfies the following two properties:
• AggOp(hx + hy) = AggOp(hx) + AggOp(hy) for any hx ∈ R
f and hy ∈ R
f
.
• AggOp(chx) = cAggOp(hx) for any hx ∈ R
f and any constant c.
For example, Sum() is a linear operator while Max() is a non-linear operator.
35
Then, we identify the exchangeability of computation order in Theorem 1:
Theorem 1 For a pair of adjacent Aggregate layer and Linear Layer, if the Aggregation operator AggOp() of
the Aggregate layer is a linear operator, we can exchange the computation order of the Aggregate layer and
Linear Layer.
Proof 1 The computation process of the adjacent Aggregate layer and Linear layer can be expressed as:
h
l
i = AggOp(Aj,i × h
l−1
j × W, j ∈ N (i)) (3.4)
where AggOp() is the aggregation operator of the Aggregate layer and theW is the weight matrix of the Linear
layer. Since the operator AggOp() is a linear operator, the above equation can be written as:
h
l
i = AggOp(Aj,i × h
l−1
j
, j ∈ N (i)) × W (3.5)
Therefore, the computation order of this pair of Aggregate layer and Linear layer can be exchanged without
affecting the final result.
The computation order can affect the total computation complexity. The computation complexity (CC) of
an Aggregate layer is:
CCAggregate(fin, fout, |E|) = 2 · fin · |E|,(fin = fout) (3.6)
The computation complexity (CC) of a Linear layer is:
CCLinear(fin, fout, |V|) = 2 · fin · fout · |V| (3.7)
36
Suppose the feature vector to the Aggregate-Linear pair (An Aggregate layer followed by a Linear layer)
has length f1, the output feature vector has length f2. The computation complexity of this AggregateLinear pair is:
CCAggregate-Linear = 2 · f1 · |E| + 2 · f1 · f2 · |V| (3.8)
If the Aggregate layer and the Linear layer is exchangeable, the computation complexity after the exchange
is:
CCLinear-Aggregate = 2 · f1 · f2 · |V| + 2 · f2 · |E| (3.9)
Theorem 2 Based on Equation (3.8) and (3.9), if f1 > f2, Linear-Aggregate execution order has lower complexity. If f2 > f1 Aggregate-Linear execution order has lower complexity. if f1 = f2, Aggregate-Linear
execution order and Linear-Aggregate execution order have the same computation complexity.
Based on Theorem 1 and Theorem 2, we propose the computation order optimization as shown in Algorithm 1. We iteratively apply Algorithm 1 until no layers can be exchanged.
Algorithm 1 Computation Order Optimization
Input: IR (Task dependency graph T Q) of input GNN model, L: number of layers in IR
Output: Optimized IR (Task dependency graph T Q)
1: for l ← 1 to L do
2: # Sequentially check the following conditions
3: Check: If layer l has only one child layer: layer m
4: Check: If layer m has only one parent layer: layer l
5: Check: If layer l, m is a {Aggregate, Linear} pair
6: Check: If the operator of the Aggregate layer is linear
7: Check: If exchanging layer l, m reduces computation
8: complexity
9: # Perform conditional computation order exchange
10: if all the above conditions are met then
11: Exchange layer l and layer m in IR
3.4.3 Compilation Step 2: Layer Fusion
After computation order optimization, the compiler performs layer fusion consisting of two types: Activation Fusion and BatchNorm Fusion.
37
Activation Fusion: An Activation layer can be merged into its adjacent layer, including Aggregate layer,
Linear layer, Vector-Inner layer, or Vector-Add layer. Through Activation Fusion, no independent Activation layer is required, which eliminates the external memory traffic between this Activation layer and its
adjacent layer.
BatchNorm Fusion: For inference, the coefficients (µ, σ, ϵ, γ, β) in the element-wise batch normalization
operation are fixed:
y =
x − µ
√
σ
2 + ϵ
· γ + β
. Moreover, the batch normalization operation is a linear operator. Therefore, the BatchNorm layer can be
merged with the adjacent Linear layer. The Linear layer incorporates the batch normalization operation
into its weights and bias. After BatchNorm Fusion, the BatchNorm layer is eliminated, which reduces
total computation complexity and external memory traffic. After layer fusion, the number of computation
layers and the computation order of the layers are determined.
3.4.4 Compilation Step 3: Data Partitioning
In real-world applications, input graphs can be very large. The compiler performs data partitioning for each
layer, starting from the first layer to the last layer. We propose the Fiber-Shard data partitioning (Figure 3.4)
to fit the available on-chip memory. In each layer, the graph has an adjacency matrix A ∈ R
|V|×|V| and a
feature matrix H ∈ R
|V|×f
that need to be partitioned. A contains all the edges and is partitioned to shards
along the row dimension. Each shard has N1 rows and is partitioned into subshards, with each subshard
having N1 columns. The edges in a subshard are stored sequentially in DDR memory, and the subshards
in a shard are stored in the contiguous region of DDR memory, as shown in Figure 3.4. The Feature matrix
H is partitioned into fibers along column dimension, and each fiber is assigned N2 columns. Each fiber
is further partitioned into subfibers, and each subfiber has N1 rows. For simplicity, A(i, j) denotes the
subshard j of shard i. H(i, j) denotes the subfiber j of fiber i. The same partitioning configuration (N1,
38
N2) is applied to each layer. The proposed partitioning strategy enables the proposed partition-centric
execution scheme (Algorithm 2, 3, 4), which further ensures that the outputs of a layer maintain the same
partitioning configuration (N1, N2) as the input. Therefore, the outputs of a layer can be directly used as
the input for the next layer since each layer has the same partitioning configuration. Therefore, no data
re-partitioning is required between layers.
Shard 1
Shard 1
Shard 2
Shard 3
Shard 4
Adjacency matrix
……
Shard 1 Shard 2
Fiber
1
Fiber
2
Fiber 1
Sufiber 1:
Sufiber 2:
Sufiber 3:
Sufiber 4:
…
H
Memory space of FPGA DDR
Figure 3.4: Data partitioning and memory mapping
Partition-Centric execution scheme: Based on the Fiber-Shard data partitioning, we propose the partition centric execution scheme that the execution of a layer is decomposed into a sequence of operations
that operate on the data tiles (subshard or subfiber). For example, the execution of an Aggregate layer
is described in Algorithm 2. The proposed partition-centric execution scheme leads to reduced memory
traffic and random memory access. For the detailed theoretical and empirical analysis of executing the
Aggregate layer, please see our previous work [137]. The proposed partition-centric execution scheme
has the following benefits: (1) it enables our block-based kernel mapping where each Tiling Block can be
executed by a PE independently, and there is no data dependency among Tiling Blocks within a layer, and
(2) it enables the unified dynamical task scheduling for each computation layer (Section 3.4.5).
Data partitioning of a Linear layer: A Linear layer performs matrix multiplication of input feature
matrix Hin ∈ R
|V|×fin and weight matrix W ∈ R
fin×fout . Output feature matrix is Hout = HinW.
39
Algorithm 2 Partition-Centric execution scheme of an Aggregate Layer
Input: A, Hin, partitioning configuration (N1, N2)
Output: Hout
1: Execution of an Aggregate layer
2: for i ← 1 to fin
N2
do
3: for j ← 1 to |V |
N1
do
4: if there is an idle PE: PEp then
5: Assign Hout(i, j) to PEp
6: Initialize Hout(i, j)
7: for k ← 1 to |V |
N1
do
8: load A(j, k) to Edge Buffer
9: load Hin(k, i) to Feature Buffer
10: Hout(i, j) ← SpDMM(A(j, k), Hin(k, i))
11: Apply activation if required
12: Store Hout(i, j)
For the Linear layer, we perform the standard block matrix multiplication. For the Linear layer, the data
partitioning keeps the same partitioning configuration as described on Section 3.4.5 for the input feature
matrix Hin and output feature matrix Hout. The basic computation kernel of a Linear layer is the GEMM.
Data partitioning of a Vector-Inn layer: A Vector-Inn layer is to sample the results using adjacent
matrix Ain from (HinHT
in), which is denoted as Ain ⊙ (HinHT
in). The output Aout is the combination
of Ain and the weight value of each non-zero position in Ain. The Vector-Inn layer exploits the same
partitioning strategy (See Section 3.4.5) as the Aggregate Layer. The execution scheme of a Vector-Inn
layer is shown in Algorithm 3.
Data partitioning of a Vector-Add layer: The inputs to a Vector-Add layer are two input feature matrices
of the same size – Hl1
in and Hl2
in. The output feature matrix Hout is the addition of two matrices: Hout =
Hl1
in + Hl2
in. The execution of the Vector-Add layer is shown in Algorithm 4.
3.4.5 Compilation Step 4: Kernel Mapping and Task Scheduling
Kernel Mapping: Through data partitioning, each layer in the IR is expressed as nested loops (e.g., Algorithm 2) according to the proposed partition-centric execution scheme. The compiler maps each layer to a
40
Algorithm 3 Partition-Centric execution scheme of a Vector-Inn Layer
Input: Ain, Hin
Output: Aout
1: Execution of a Vector-Inn layer
2: for i ← 1 to |V |
N1
do
3: for j ← 1 to |V |
N1
do
4: if there is an idle PE: PEp then
5: Assign Aout(i, j) to PEp
6: Initialize Aout(i, j)
7: load Ain(i, j) to Edge Buffer
8: for k ← 1 to fin
N2
do
9: load Hin(i, k) to Feature Buffer
10: load Hin(j, k) to Feature Buffer
11: Z ← SDDMM(Ain(i, j), Hin(i, k), Hin(j, k))
12: Aout(i, j) ← Apply(Z)
13: Apply activation if required
14: Store Aout(i, j)
Algorithm 4 Partition-Centric execution scheme of an Vector-Add Layer
Input: Hl1
in, Hl2
in
Output: Hout
1: Execution of a Vector-Add layer
2: for i ← 1 to fin
N2
do
3: for j ← 1 to |V |
N1
do
4: if there is an idle PE: PEp then
5: Assign Hout(i, j) to PEp
6: load Hl1
in(i, j) to Feature Buffer
7: load Hl2
in(i, j) to Feature Buffer
8: Hout(i, j) ← VecAdd(Hl1
in(i, j), Hl2
in(i, j))
9: Store Hout(i, j)
41
sequence of high-level instructions. The kernel mapping is performed hierarchically. Each layer is mapped
to a block of instructions called Layer Block (e.g., Algorithm 2). Each Layer Block contains a Control and
Scheduling Instruction (CSI) and a set of Tiling Blocks. The Tiling Blocks are generated by unfolding the
outer nested loops of a Layer Block. For example, for an Aggregate layer, the generated CSI contains the
information of Line 2-3 in Algorithm 2, and fin
N2
×
|V |
N1
Tiling Blocks are generated by unfolding the outer
loops. A Tiling Block has an inseparable sequence of high-level instructions that will be executed by a PE.
Algorithm 5 Task Scheduling
Input: A and Hin of input graph; weight matrices; L: number of Layer Blocks.
Output: output embedding of each vertex
for l ← 1 to L do
Load CSI of Layer Block l to Scheduler
for each Tiling block in Layer Block l parallel do
if there is an idle PE: PEp then
Assign this Tiling Block to PEp
PEp Executes this Tiling Block
Wait until all the Tiling Blocks are executed
Task Scheduling: As shown in Algorithm 5, GraphAGILE executes the GNN inference layer by layer.
For each Layer Block, the Scheduler loads the heading Control and Scheduling Instruction (CSI). Then, the
Scheduler assigns the Tiling Blocks to the idle PEs, forming a dynamic load balancing strategy. Each PE
maintains a 1-bit output port to indicate its current status (Idle/Busy). When all the Tiling Blocks within a
layer are completely finished, GraphAGILE starts to execute the next layer. Within each Tiling Block, the
computation instructions and memory read/write instructions are interleaved. Therefore, we exploit the
double buffering technique to overlap the computation and data communication. Specifically, Instruction
Decoder & Control Signal Generator needs to issue new memory read instructions when the old computation instruction is not finished, which may incur Write after Read (WAR) data hazard. Therefore, each
buffer in a PE maintains a hardware mutex implemented as a one-bit register. After a memory read instruction loads data to a buffer, it locks the mutex of this buffer. After the computation instruction finishes
using the data from this buffer, the mutex is unlocked. When a memory read instruction is stalled by a lock,
42
the Instruction Decoder & Control Signal Generator will stop issuing new instructions. Locking/unlocking
the mutex is annotated in the high-level instructions by the compiler. Such annotation is through scanning the data dependency among high-level instructions within each Tiling Block, which has negligible
complexity. After kernel mapping and mutex annotation, the compiler generates the executable file.
3.5 Hardware Accelerator Design
This Section introduces the accelerator design of GraphAGILE. The hardware accelerator of GraphAGILE
is an instruction set architecture that supports software-like programmability. Section 3.5.1 introduces
the overview of the accelerator design. Section 3.5.2 covers the data format and data layout used by the
accelerator. Section 3.5.4 elaborates the details of the hardware architecture.
3.5.1 Overview of Accelerator
Figure 3.5 depicts the proposed hardware architecture. There are Npe Processing Elements (PEs) working
in parallel. At runtime, the Scheduler reads the executable/binary file from the FPGA DDR and assigns
the workload to PEs. Each PE has an Instruction Queue (IQ) to receive the incoming instructions assigned
by the Scheduler. The Instruction Decoder & Control Signal Generator reads the instructions from IQ
and generates the control signals for the hardware modules. Each PE has a Weight Buffer to store the
weight matrices, an Edge Buffer to store the edges, and a Feature Buffer to store the vertex feature vectors.
Each buffer has a data loader&writer that communicates with the FPGA DDR. Each PE has an Adaptive
Computation Kernel, which is the key novelty in our hardware design. The Adaptive Computation Kernel
can execute various computation kernels of GNNs. Moreover, to support dynamic sparsity exploitation,
the accelerator incorporates (1) a Sparsity Profiler to profile the data sparsity of the intermediate results, (2)
Data Layout Management (DLM) Unit and Data Format Management (DFM) Unit to dynamically transform
the data layout and data format for different computation primitives, respectively.
43
Instruction Queue
Instruction Decoder &
Control Signal Generator
Activation Unit
Processing Element
Weight Buffer
Micro
Code
Table
Index Shuffle
Network (ISN)
Data Shuffle
Network (DSN)
Edge
Buffer
Edge
Loader Weight
loader
Adaptive
Computation
Kernel (ACK)
Feature Buffer
Feature
loader
Instruction Queue
Instruction Decoder &
Control Signal Generator
Activation Unit
Processing Element
Weight Buffer
Micro
Code
Table
Index Shuffle
Network (ISN)
Data Shuffle
Network (DSN)
Edge
Buffer
Edge
Loader Weight
loader
Adaptive
Computation
Kernel (ACK)
Feature Buffer
Feature
loader
Instruction Queue
Instruction Decoder &
Control Signal Generator
Activation Unit
Processing Element
Weight Buffer
Micro
Code
Table
Index Shuffle
Network (ISN)
Data Shuffle
Network (DSN)
Edge
Buffer
Edge
Loader Weight
loader
Adaptive
Computation
Kernel (ACK)
Feature Buffer
Feature
loader
Instruction Queue
Instruction Decoder &
Control Signal Generator
Activation Unit
Processing Element
Weight Buffer
Micro
Code
Table
Index Shuffle
Network (ISN)
Data Shuffle
Network (DSN)
Edge
Buffer
Edge
Loader Weight
Loader
Adaptive
Computation
Kernel (ACK)
Feature Buffer
Feature
Loader
FPGA local DDR memory
Input Graph GNN Model Binary File
Scheduler Memory Controller
Figure 3.5: Overview of hardware architecture
Hardware parameters: The proposed architecture is defined by the following hardware parameters: (1)
the number of Processing Elements Npe, (2) the dimension of the Adaptive Computation Kernel (ACK)
Psys ×Psys, (3) dimensions of buffers, including the dimension of Weight Buffer NW ×Psys, the dimension
of Edge Buffer NE × 3, the dimension of Feature Buffer NF1 × NF2, (4) the set of arithmetic operations
supported by the ACK and the Activation Unit.
3.5.2 Data Format and Data Layout
Graph data format: We use h
l
i
to denote the feature vector of vertex vi at layer l (Table 2.1). We use the
Coordinate Format (COO) to capture all graph edges. Each edge is a 3-tuple (src, dst, weight) denoting
44
the source vertex index, destination vertex index, and edge weight, respectively. We construct the feature
matrix H by stacking feature vectors. Each row of H is the feature vector of a vertex. Denote A as the
sparse adjacency matrix where for an edge (u, v, w), we have Au,v = w.
Data Layout: Data layout defines the order of storing the matrix elements. For a sparse matrix in rowmajor order, elements within the same row are stored in contiguous memory locations. Otherwise, it is in
column-major order. Similarly, the row-major and column-major order for a dense matrix can be derived.
3.5.3 Instruction Set
3.5.3.1 High-level Instruction
The proposed instruction set consists of high-level instructions and microcode. All the high-level instructions have a uniform 128-bit length, and the instruction fields are depicted in Figure 3.6. The OPCODE field
indicates the type of instruction. Other fields contain instruction-specific information.
• Control and Scheduling Instruction (CSI): A CSI contains the meta data of a computation layer in the
intermediate representation (Section 3.3.1). Based on the CSI, the scheduler assigns the workloads
of a layer to the PEs.
• Memory Read/Write Instruction: A memory read/write instruction initiates data communication (model
weights, edges, vertex feature vectors) with FPGA DDR memory.
• GEMM Instruction: A GEMM instruction contains the information (e.g., matrix size, buffer ID that
stores the matrices) of the matrix multiplication between the weight matrix (in the Weight Buffer)
and feature matrix (in the Feature Buffer).
• SpDMM Instruction: A SpDMM instruction performs multiplication of A and H. The instruction
specifies the number of non-zero elements in A (which enables edge-centric computation of SpDMM)
and buffer ID that stores A.
45
• SDDMM Instruction: Similar to the SpMM instruction, it specifies the number of non-zero elements
in A and the buffer IDs that store A and H.
• Other instructions: There are other instructions including the Initialization Instruction, Activation
Instruction, etc.
OPCODE INFO
DRAM_BASE
(dram_base)
SRAM_BASE
(sram_base)
BUFFER ID
(sram_ID)
Memory Read/write
Instruction:
GEMM
Instruction: OPCODE
SpDMM
Instruction: OPCODE INFO Edge Buffer Base
(edge_buffer_base)
Edge Buffer
ID
Feature
Buffer ID Unused Num Edge
(num_edge)
SDDMM
Instruction: OPCODE INFO Edge Buffer Base
(edge_buffer_base)
Edge Buffer
ID
Feature
Buffer ID Unused Num Edge
(num_edge)
INFO Feature
Buffer ID
Weight
Buffer ID
Feature Buffer Base
(feature_buffer_base)
Weight Buffer Base
(weight _buffer_base) Unused
Control and Scheduling
Instruction CSI:
OPCODE Num of Tiling Blocks
(num_of_tiling_block) Unused
Vector-Add
Instruction: OPCODE INFO Edge Buffer Base
(edge_buffer_base)
Edge Buffer
ID
Feature
Buffer ID Unused Num Edge
(num_edge)
Post processing/
Preprocessing
Instruction
OPCODE INFO BUFFER ID
(sram_ID) Unused
Figure 3.6: GraphAGILE high-level instruction fields
3.5.3.2 Microcode
A high-level instruction defines a computation task in coarse-grained granularity. To execute a high-level
instruction, the Instruction Decoder & Control Signal Generator translates it to a sequence of microcode
that has fine-grained granularity that can be executed by ACK. The translation is through looking up the
Microcode Table. For example, a GEMM instruction defines the multiplication of a large feature matrix
(stored in Feature Buffer) and a large weight matrix (stored in Weight Buffer). The GEMM instruction is
decomposed into block matrix multiplication (BlockMM), where block size corresponds to the dimension
of ACK. The microcodes of GEMM use a three-level nested loop to execute the BlockMM on ACK. The
microcodes of GEMM, SpDMM, and SDDMM are described as follows:
Microcode of GEMM instructions: A high-level GEMM instruction is translated to a sequence of microcode to execute the GEMM between a block of feature matrix HB ∈ R
SB×Len and a block of weight
46
c
:1
:1
Bank 1
Bank 2
Bank 3
Bank 4
Feature
Buffer
c
Weight Buffer
ACK
Figure 3.7: GEMM between a block of feature matrix HB (stored in Feature Buffer) and a block of weight
matrix WB (stored in Weight Buffer)
matrix WB ∈ R
Len×GB . The Pseudocode of the sequence of microcode is described in Algorithm 6. In
GEMM mode, the Adaptive Computation Kernel (ACK) works as a 2-D systolic array of size psys × psys
using output-stationary dataflow. In each clock cycle, ACK receives psys data from Feature Buffer and psys
data from Weight Buffer, respectively. In Feature Buffer, HB is further partitioned to small data tiles along
row dimensions, and each data tile HT:i has psys rows. Similarly, in Weight Buffer, WB is partitioned to
small data tiles along column dimension and each data tile WT:i has psys columns of WB. Hout:ij denotes
the result of the multiplication between HT:i and WT:j .
47
Algorithm 6 Pseudocode of GEMM microcode
Input: HB; WB
Output: Hout
1: for i ← 1 to SB
psys
do
2: for j ← 1 to GB
psys
do
3: // Pipelined execution of Hout:ij = HT:i × WT:j
4: for k ← 1 to Len Parallel do
5: Load the psys data of k
th column of HT:i and send them to ACK
6: Load the psys data of k
th row of WT:j and send them to ACK
Algorithm 7 Pseudocode of SpDMM microcode
Input: HB; AB; number of edges in AB: Ne
Output: Hout
1: for i ← 1 to 2Ne
psys
do ▷ Pipelined execution of SpDMM
2: Load psys
2
unprocessed edges from AB in Edge Buffer
3: Send the psys
2
edges to Index Shuffle Network (ISN)
Microcode of SpDMM instructions: A high-level SpDMM instruction is translated to a sequence of
microcode to execute the SpDMM between a block of feature matrix HB (stored in the Feature Buffer)
and a block of sparse adjacency matrix AB (stored in the Edge Buffer). The execution of SpDMM is edgecentric (See Section 3.5.4.1). Therefore, in each clock cycle, psys
2
unprocessed edges in AB are fetched from
Edge Buffer. The psys
2
edges are sent to Index Shuffle Network to execute feature aggregation.
Algorithm 8 Pseudocode of SDDMM microcode
Input: HB; AB; number of edges in AB: Ne
Output: weights of all the edges in AB
1: for i ← 1 to 2Ne
psys
do ▷ Pipelined execution of SDDMM
2: Load psys
2
unprocessed edges from AB in Edge Buffer
3: Extract the psys
2
src indices and psys
2
dst indices
4: Send the psys indices to ISN
Microcode of SDDMM instructions: A high-level SDDMM instruction is translated to a sequence of
microcode to execute the SDDMM using a block of feature matrix HB (stored in the Feature Buffer) and
a block of sparse adjacency matrix AB (stored in the Edge Buffer). Similar to SpDMM, the execution of
SDDMM is edge-centric (See Section 3.5.4.1). In each clock cycle, psys
2
unprocessed edges in AB are fetched
48
from Edge Buffer. The psys
2
src indices and psys
2
dst indices are extracted from the psys
2
unprocessed edges.
Then, the total psys indices are sent to Index Shuffle Network (ISN) to execute the SDDMM of AB and HB.
3.5.4 Hardware Architecture
3.5.4.1 Adaptive Computation Kernel
Weight Buffer Edge Buffer
Adaptive Computation Kernel
ALU Update
Unit
Reduce
Unit
RAW
Unit
Index
Shuffle
Network
(ISN)
Data
Shuffle
Network
(DSN)
Bank 0
Bank 1
Bank 2
Bank 3
Bank 4
Bank 5
Bank 6
Bank 7
Feature Buffer
Figure 3.8: Adaptive Computation Kernel (when psys = 8) with ISN and DSN. The interconnections among
ALUs are specified in Figure 3.9.
As shown in Figure 3.8, an ACK contains an array of Arithmetic Logic Units (ALUs) of size psys × psys,
where psys is the power of 2. An ALU can execute various arithmetic operations, including Multiplication,
Addition, Accumulation, Min, Max, etc. The interconnections among ALUs are shown in Figure 3.9. The
array of ALUs is divided into Update Units and Reduce Units. An Update Unit or a Reduce Unit has size
(psys/2) × 2. The Feature Buffer has psys parallel memory banks. hi
is stored in bank (i mod psys). There
are two interconnection networks – Index Shuffle Network (ISN) and Data Shuffle Network (DSN). The ISN
routes edges to the memory banks of Feature Buffer for fetching the features of incident vertices. The DSN
routes the input data (vertex features with the edge) to Adaptive Computation Kernel. The routing is based
49
on the least significant log(psys) bits of the vertex index. The ACK has various execution modes, including
GEMM mode, SpDMM mode, SDDMM mode, and Vector-Addition mode. Each ALU maintains multiplexers
with control logic to select the input and output ports for an execution mode. The mode switching incurs
the overhead of only one clock cycle.
Weight Buffer
Feature Buffer
Update
Unit
Update
Unit
Update
Unit
Update
Unit
Reduce
Unit
Reduce
Unit
Reduce
Unit
Reduce
Unit
Feature
Buffer
Edge Buffer
DSN
ISN
Update
Unit
Update
Unit
Update
Unit
Update
Unit
Reduce
Unit
Reduce
Unit
Reduce
Unit
Reduce
Unit
Feature
Buffer
Edge Buffer
DSN
ISN
Unfolded view
GEMM Mode SpDMM Mode
SDDMM Mode
UR pipeline
UR pipeline
Figure 3.9: Datapath of GEMM mode, SpDMM mode, SDDMM mode
GEMM mode: The array of ALUs is organized as a two-dimensional systolic array with fully localized
interconnection. GEMM mode supports dense matrix multiplication of feature matrix H and weight matrix
W. Weight Buffer streams the weight matrix into the systolic array, and Feature Buffer streams multiple
vertex feature vectors into the systolic array. Systolic array of size psys × psys executes p
2
sys MultiplyAccumulation operations per clock cycle.
50
Algorithm 9 SpDMM following Scatter-Gather paradigm
while not done do
for each edge e(src, dst, weight) do ▷ Scatter Phase
Fetch src.features from Feature Buffer
Form input pair (src.features, e)
for each input pair do ▷ Gather Phase
Produce u ←Update(src.features, e.weight)
Update vdst ← Reduce(u.features)
SpDMM mode: As shown in Algorithm 9, SpDMM is executed following the Scatter-Gather paradigm.
The array of ALUs in ACK is divided into multiple Update Units and Reduce Units. In each Update Unit, the
ALUs are organized as a vector multiplier that multiplies the vertex feature vector by the edge weight. In
each Reduce Unit, the ALUs execute the element-wise reduction operation ρ(). Suppose a vertex is defined
by (src, features), where src denotes the source vertex index and the features is the feature vector of the
source vertex. The generated intermediate results by the Update Units are represented by (dst, features).
The intermediate results are applied to the destination vertex vdst by the Reduce Unit. An Update Unit
and a Reduce Unit form an “UR-pipeline”. The computation of SpDMM is driven by unprocessed edges
(i.e., edge-centric processing [148]). Unprocessed edges are fetched from Edge Buffer to ISN. In ISN, an edge
e is routed to the corresponding memory bank in Feature Buffer to fetch src.features, thus forming the
input pair (src.features, e). Then, the DSN routes the input pairs to the UR pipelines based on the dst
of the edge. The input pairs having e.dst = i × psys + k (0 ⩽ k < psys) will be routed to the ⌊k/2⌋
th
UR pipeline. This is because the output port of ⌊k/2⌋
th UR pipeline is connected to bank ⌊k/2⌋ and bank
⌊k/2⌋ + 1 of Feature Buffer, where ve.dst is stored. Then, the UR pipeline processes the input pair, and
the intermediate result generated by the input pair is applied to the destination vertex ve.dst. psys/2 input
pairs can be processed by the psys/2 UR pipelines concurrently.
SDDMM mode: The basic operation is the inner product of two feature vectors. For each edge (src, dst),
the feature vectors hsrc and hdst are fetched from the Feature Buffer. The result of the inner product of
hsrc and hdst becomes the weight of the edge (src, dst). To support the inner-product, the ALUs in a
51
UR pipeline form a multiply-adder tree. The topological structure of the multiply-adder tree is shown in
Figure 3.9. Similar to SpDMM, the execution of SDDMM is edge-centric. For an edge (src, dst), src and
dst are routed to the corresponding memory banks of Feature Buffer to fetch hsrc and hdst. The inner
product of hsrc and hdst is executed by a UR pipeline. The ACK can execute psys/2 vector inner products
of length psys during each clock cycle. The dot product of two feature vectors of length |hi
| is executed in
l
|hi|
psys m
cycles and the intermediate result is stored at the root node of the adder tree for accumulation.
Vector Addition Mode: In Vector Addition Mode, the basic operation is the addition of two feature
vectors. An Update Unit (See Figure 3.8) works as a vector adder. To add hu and hv, the indices u and
v are routed through Index Shuffle Network (ISN) to Feature Buffer to fetch hu and hv. Then hu and hv
are routed to an Update Unit through Data Shuffle Network (DSN) to perform vector addition. The results
will bypass the Reduction Unit and are sent back to Feature Buffer. The ACK can execute psys/2 vector
additions of length psys at each clock cycle. Two feature vectors of length |hi
| can be added in ⌈
|hi|
psys
⌉ cycles.
3.5.4.2 Parallel On-chip Memory Access
The Feature Buffer supports parallel memory access patterns of various computation modes enabled by
ISN and DSN. Feature Buffer has psys parallel memory banks, and the feature vector of vertex vi
is stored
in bank (i mod psys). Edge Buffer can output psys edges at each clock cycle by having port width psysde,
where de is the bit width of an edge. ISN performs all-to-all interconnection between Edge Buffer and
Feature Buffer. DSN performs all-to-all interconnection between Feature Buffer and ACK. The ISN and the
DSN are implemented using the butterfly network [19].
Parallel memory accesses in GEMM mode: The ACK directly fetches psys features from psys memory
banks per clock cycle. No data shuffling is required for GEMM. The Weight Buffer also has psys memory
banks that can output psys data of the weight matrix each clock cycle.
52
Parallel memory accesses in SpDMM mode: psys/2 edges {e1, e2, ..., epsys/2
} are sent to ISN simultaneously. The edges are routed to the corresponding memory banks of Feature Buffer based on their src.
psys/2 edges will generate psys/2 input pairs (src.features, e) after fetching the feature vectors. Then the
psys/2 input pairs are routed to the corresponding UR pipelines based on their e.dst.
Parallel memory accesses in SDDMM mode: psys/2 edges {e1, e2, ..., epsys/2} are fetched from the Edge
Buffer in each cycle. The psys/2 src indices and psys/2 dstindices {src1, dst1, src2, dst2, ..., srcpsys/2
, dstpsys/2}
of psys/2 edges are sent to psys input ports of ISN. The ISN routes the psys indices to the Feature Buffer to
fetch the psys vertex feature vectors from the Feature Buffer. Then, the psys feature vectors are routed to
the psys/2 UR pipelines of ACK. The i
th UR pipeline performs the inner product of hsrci
and hdsti
.
3.6 Experiments
3.6.1 Implementation Details
Implementation of hardware accelerator: We implement the hardware design on a state-of-the-art FPGA
platform, Xilinx Alveo U250, consisting of four Super Logic Regions (SLRs). The FPGA DDR memory
has four channels with 77 GB/s memory bandwidth. On U250, we implement 8 PEs where each SLR
contains 2 PEs of psys = 16. We develop GraphAGILE using Verilog HDL. We synthesize the design and
perform Place&Route using Xilinx Vivado 2021.1 to obtain the frequency and resource utilization report.
GraphAGILE on Alveo U250 consumes 778K LUTs (45%), 10240 DSPs (83%), 1853 BRAMs (69%) and 1050
URAMs (82%). GraphAGILE runs at 300 MHz. Then, we build a cycle-accurate simulator for the hardware
accelerator to evaluate its performance. We use Ramulator [64] to simulate the performance of FPGA
DDR memory. We develop the compiler using Python. At runtime, the compiler reads the user-defined
GNN models (defined using Pytorch Geometric library (PyG) [91]) and input graphs. Then, the compiler
generates the binary file for the hardware accelerator and performs preprocessing for the input graph.
53
After that, the binary file, GNN model weights, and propocessed input graph are sent to the FPGA DDR
memory through PCIe. For performance simulation, we set the PCIe bandwidth to be 31.5 GB/s which is
the same as the baseline CPU-GPU platform for a fair comparison. The Alveo U250 board has four DDR
memories, each connected to an SLR, and each DDR memory has capacity of 16 GB. The DDR memory
on the Alveo U250 board is sufficient to store the input graphs used in our experiments (Table 3.5). For
example, for the largest graph used in our experiments, Amazon-Products, has a total size of 7.2 GB,
including the vertex features and the edges.
Alveo U250
SLR0 SLR1 SLR2 SLR3
FPGA Shell
PE
PE
PE
PE
PE
PE
PE
PE
FPGA Shell FPGA Shell FPGA Shell
FPGA DDR memory
Scheduler
Figure 3.10: Device map on Xilinx Alveo U250 FPGA board
Table 3.2: Hardware resource utilization
LUTs DSPs BRAMs URAMs
Soft Processor 5.5K 6 26 0
One PE 118K 1296 192 120
FPGA shell 182K 13 447 0
Total 1008K 9091 1819 840
Available 1728K 12288 2688 960
Utilization 58% 73% 67.6% 87.5%
Implementation of compiler: We develop the compiler using Python. The compiler takes as the input the
user-defined ML models which are developed using PyTorch [91], PyTorch Geometric [36], and Hugging
Face [116]. The compiler then generates an intermediate representation from the user-defined ML models.
Each compilation step is wrapped as a function to apply to the intermediate representation. Finally, the
compiler generates the bytecode for the runtime system.
54
3.6.2 Baselines, Benchmarks, Performance Metrics
Baselines: As shown in Table 3.3, we compare our design with state-of-the-art baselines: CPU-only platform (AMD Ryzen 3990x), CPU-GPU (AMD Ryzen 3990x + Nvidia RTX3090), HyGCN [127], BoostGCN
[137], AWB-GCN [45].
Benchmarks: We use eight GNN models in Table 3.4 and seven graph datasets ∗
in Table 3.5 as benchmarks.
Table 3.3: Specifications of platforms
Platforms CPU GPU HyGCN [127] AWB-GCN [45] BoostGCN [137] GraphAGILE
Platform AMD Ryzen 3990x Nvidia RTX3090 ASIC Stratix 10 SX Stratix 10 GX Alveo U250
Platform Technology TSMC 7 nm TSMC 7 nm TSMC 12 nm Intel 14 nm Intel 14 nm TSMC 16 nm
Frequency 2.90 GHz 1.7 GHz 1 GHz 330 MHz 250 MHz 300 MHz
Peak Performance 3.7 TFLOPS 36 TFLOPS 4608 GFLOPS 1351 GFLOPS 640 GFLOPS 614 GFLOPS
On-chip Memory 256 MB L3 cache 6 MB L2 cache 35.8 MB 22MB 32 MB 45 MB
Memory Bandwidth 107 GB/s 936.2 GB/s 256 GB/s 57.3 GB/s 77 GB/s 77 GB/s
Table 3.4: Evaluated GNN models in the experiments
Notation Layer Type # of layers Hidden Dimension Ref.
b1 GCN layer 2 16 [66, 45, 127]
b2 GCN layer 2 128 [127, 45]
b3 GraphSAGE layer 2 128 [53, 136]
b4 GraphSAGE layer 2 256 [53, 136]
b5 GIN layer 5 128-128-128-128 [126]
b6 GAT layer 2 64 [107]
b7 SGC layer 1 (k=2) N/A [117]
b8 GraphGym layer
1 preprocessing layer
3 GNN layer
1 postprocessing layer
256 [133]
Performance Metrics: We evaluate the performance by:
• End-to-End (E2E) latency TE2E: The TE2E of GraphAGILE includes (1) the latency of software compilation TLoC on the host processor, (2) the latency of CPU-FPGA data movement Tcomm, and (3)
∗Reddit in Table 3.5 is from a pre-existing publicly available third party dataset [53].
55
Table 3.5: Dataset Statistics (GraphAGILE)
Dataset Vertices Edges Features Classes
Citeseer (CI) [66] 3327 4732 3703 6
Cora (CO) [66] 2708 5429 1433 7
Pubmed (PU) [66] 19717 44338 500 3
Flickr (FL) [136] 89,250 899,756 500 7
Reddit (RE) [53] 232,965 116,069,919 602 41
Yelp (YE) [136] 716,847 6,977,410 300 100
Amazon-Products (AP) [55] 1,569,960 264,339,468 200 107
the latency of executing GNN inference on the accelerator (Latency of hardware execution TLoH).
The latency of moving data (processed graph, GNN model, binary file) from host platform to FPGA
DDR Tcomm is estimated through: Tcomm =
total data volume
sustained PCIe bandwidth . Then, the end-to-end latency of
GraphAGILE is calculated by: TE2E = TLoC + Tcomm + TLoH.
• Latency of compilation (LoC) TLoC: The latency of compilation is the overhead of software or hardware compilation. The measured TLoC of GraphAGILE is the time duration from the time the GNN
model (defined using PyG [36] API) and the input graph are provided, to the time the input graph
is processed and the instruction sequence is generated by the compiler. For the design automation
frameworks [137, 73], TLoC includes hardware meta compilation, hardware synthesis, Place&Route,
and FPGA reconfiguration.
• Latency of hardware execution (LoH) TLoH: Latency of hardware execution is the latency of executing
the binary code on the hardware accelerator. Before runtime, the GNN model, processed input graph,
and binary file are already stored in the FPGA DDR.
3.6.3 Execution Time and Size of Binary File
Execution time: Table 3.6 shows the measured latency of GraphAGILE. We observe that the software
compilation time ranges from 2 ms to 300 ms, which is proportional to the size of the input graph. The
reason is that data partitioning is the most time-consuming operation with complexity O(|V| + |E|). The
56
design automation frameworks (e.g., DeepBurning-GL [73]) undergo hours of overhead to perform hardware synthesis and Place&Route. Thus, the proposed software compiler is fast and lightweight.
Size of binary file: Table 3.7 shows the size of the generated binary files. Compared with the sizes of input
graphs or the inter-layer intermediate results, the size of binary files is negligible. Therefore, loading the
binary files from the FPGA external DDR memory to the on-chip scheduler results in a small amount of
memory traffic. The size of the binary files is small because the high-level instructions are compact and
powerful; For example, a single high-level instruction (128 bits) can define the computation task of a large
data partition (up to 16384 vertices).
3.6.4 Impact of Compiler Optimizations
0.001
0.01
0.1
1
10
CI COPU FL RE YE AP CI COPU FL RE YE AP CI COPU FL RE YE AP CI COPU FL RE YE AP CI COPU FL RE YE AP CI COPU FL RE YE AP CI COPU FL RE YE AP CI COPU FL RE YE AP
b2 b3 b4 b5 b6 b7 b8
Speedup by Computation Order Optimization
Figure 3.11: Impact of computation order optimization on the latency of hardware execution (LoH) TLoH
0
0.05
0.1
0.15
CI COPU FL RE YE AP CI COPU FL RE YE AP CI COPU FL RE YE AP CI COPU FL RE YE AP CI COPU FL RE YE AP CI COPU FL RE YE AP CI COPU FL RE YE AP CI COPU FL RE YE AP
b2 b3 b4 b5 b6 b7 b8
Speedup by Layer Fusion
Figure 3.12: Impact of layer fusion on the latency of hardware execution (LoH) TLoH
0
1
2
3
CI COPU FL RE YE AP CI COPU FL RE YE AP CI COPU FL RE YE AP CI COPU FL RE YE AP CI COPU FL RE YE AP CI COPU FL RE YE AP CI COPU FL RE YE AP CI COPU FL RE YE AP
b2 b3 b4 b5 b6 b7 b8
Speedup by Computation-communication Overlapping
Figure 3.13: Impact of computation and communication overlapping on the latency of hardware execution
(LoH) TLoH
To show the effectiveness of the proposed optimizations, we compare TLoH of using the compiler optimizations and TLoH without compiler optimizations. Figure 3.11, Figure 3.12, and Figure 3.13 show the
57
Table 3.6: End-to-End latency, latency of compilation, latency of hardware execution
Model Latency
(ms)
Dataset
CI CO PU FL RE YE AP
b1
TE2E
TLoC
TLoH
2.129
0.249
0.320
0.808
0.215
0.103
2.126
0.574
0.272
9.97
2.68
1.28
128.3
51.1
15.6
62.9
18.8
11.6
442.0
263.8
37.4
b2
TE2E
TLoC
TLoH
4.364
0.254
2.550
1.535
0.226
0.819
4.28
0.66
2.34
20.1
2.6
11.5
208.5
49.7
97.2
155.1
18.3
104.3
718.1
261.4
315.9
b3
TE2E
TLoC
TLoH
4.355
0.235
2.560
1.574
0.258
0.826
4.25
0.59
2.38
21.19
2.58
12.60
212.7
49.1
102.0
134.3
19.2
82.6
657.4
272.2
244.4
b4
TE2E
TLoC
TLoH
6.912
0.212
5.140
2.387
0.237
1.660
6.919
0.599
5.040
33.88
2.47
25.40
315.0
50.1
203.3
278.2
21.3
224.4
905.2
270.3
494.1
b5
TE2E
TLoC
TLoH
14.99
0.24
13.10
9.23
0.23
8.51
15.64
0.56
13.80
91.73
2.52
83.20
527.6
50.9
415.1
901.6
30.1
839.0
1415.5
300.3
974.4
b6
TE2E
TLoC
TLoH
3.139
0.249
1.330
1.201
0.258
0.453
3.24
0.58
1.38
17.69
2.69
8.99
219.2
50.0
107.6
123.1
18.7
71.9
680.9
270.9
269.2
b7
TE2E
TLoC
TLoH
2.252
0.223
0.469
0.826
0.235
0.101
2.285
0.594
0.411
11.32
2.63
2.68
368.8
53.8
253.4
72.1
17.5
22.1
601.8
261.4
199.6
b8
TE2E
TLoC
TLoH
7.98
0.23
6.19
3.25
0.24
2.52
13.79
0.61
11.90
67.65
2.74
58.90
537.8
52.2
424.0
548.2
28.7
487.0
1749.3
283.5
1325.0
impact of (1) computation order optimization, (2) layer fusion, and (3) overlapping the computation and
data communication (in task scheduling), respectively.
Computation order optimization: Computation order optimization leads to 82%, 9.6%, 9.9%, 6.3%,
1.3%, 121%, 260%, 0% average speedup on b1-b8, respectively. The computation order optimization can
reduce both the computation complexity and external memory traffic of the involved Aggregate layers.
The computation order optimization has no effect on model b8, because model b8 uses a preprocessing
MLP layer to transform the feature vectors to a uniform length, which eliminates the opportunities for
58
Table 3.7: Size (MB) of the generated binary files [Row 1-8], and the size (MB) of input graphs [Row 9]
Dataset
CI CO PU FL RE YE AP
b1 0.136 0.053 0.193 0.194 0.228 0.161 0.246
b2 0.141 0.057 0.234 0.270 0.234 0.218 0.369
b3 0.210 0.084 0.340 0.393 0.340 0.310 0.518
b4 0.217 0.093 0.421 0.421 0.427 0.423 0.764
b5 0.297 0.131 0.632 0.633 0.703 0.661 1.231
b6 0.145 0.060 0.263 0.299 0.264 0.258 0.457
b7 0.204 0.079 0.281 0.281 0.334 0.230 0.342
b8 0.101 0.059 0.422 0.422 0.439 0.528 1.098
Input graph 47 12.6 38 181 1863 900 4223
computation order optimization. Note that the Computation order optimization itself has a small overhead
(≈ 0.5µs average latency) during the software compilation.
Layer fusion: Layer fusion leads to 8.1%, 6.0%, 5.5%, 5.2%, 7.3%, 7.4%, 4.7%, 8.2% average speedup on
b1-b8, respectively. The performance improvement is because the individual Activation layers and BatchNorm layers are eliminated (See Section 3.4). Thus, extra memory traffic of the Activation and BatchNorm
layers is eliminated to reduce the latency of hardware execution. Note that layer fusion has complexity
O(L) and incurs small overhead (≈ 0.66µs average latency) during the software compilation.
Overlapping computation and communication: Overlapping the computation and communication
leads to 186%, 134%, 153%, 137%, 112%, 148%, 158%, 123% average speedup on b1-b8, respectively. It
demonstrates the effectiveness of proposed double/triple buffering techniques and the effectiveness of the
software compilation optimizations.
3.6.5 Cross Platform Comparison
We compare TE2E on three baseline platforms: (1) CPU-only platform, (2) CPU (Ryzen 3990x) + GPU, (3)
CPU (Ryzen 3990x) + GraphAGILE. On CPU-only platform, we execute CPU version of Pytorch Geometric
59
(PyG) and Deep Graph Library (DGL), with Intel MKL as the backend. On CPU-GPU platform, we execute
GPU version of PyG and DGL, with CUDA 11.3 as the backend. The E2E latency of CPU-only and CPUGPU platforms include the preprocessing overhead of runtime systems (e.g., GPU kernel launch). Figures
3.14 and 3.15 show the comparison. Compared with PyG-CPU, GraphAGILE achieves 10.3 × − 47.1×
speedup on b1-b8. Compared with PyG-GPU, GraphAGILE achieves 1.27 × − 3.8× speedup on b1-b8.
Compared with DGL-CPU, GraphAGILE achieves 9.1×− 20.1× speedup on b1-b7. Compared with DGLGPU, GraphAGILE achieves 1.7 × − 3.9× speedup on b1-b7.
The speedup over CPU-only and CPU-GPU platforms is due to: (1) The kernels in GNN (e.g., SpDMM,
SDDMM) have irregular computation&memory access patterns and low data reuse. GraphAGILE hardware
architecture optimizes the data path and memory organization for various GNN computation kernels. The
processors in CPU or GPU have limited cache sizes (e.g., 32KB L1 cache and 512KB L2 cache). The data
exchange (due to low data reuse) among L1, L2, and L3 caches becomes the performance bottleneck and results in reduced sustained performance. On CPU platforms, loading data from the L3 cache incurs latency
of 32ns, and loading data from L2 cache incurs latency of 5 − 12ns. Compared with the CPU-only/CPUGPU, the ACK in GraphAGILE can access data in one clock cycle from the on-chip edge/weight/feature
buffers. Therefore, although the baseline CPU-only and CPU-GPU platforms have higher (6×) peak performance than the state-of-the-art FPGAs, GraphAGILE still outperforms the baselines. (2) The compiler
of GraphAGILE automatically performs various optimizations to minimize execution time. While the computation order optimization and layer fusion can potentially be applied to CPU-only and CPU-GPU platforms, other compiler optimizations (such as data partitioning for partition-centric execution schemes, task
scheduling for dynamic load balancing) are specific to the proposed overlay architecture. For example, data
partitioning relies on an effective and customized memory organization. The hardware architecture and
the compiler of GraphAGILE perform synergistically to achieve lower latency.
60
1
10
100
CI COPU FL RE YE AP CI COPU FL RE YE AP CI COPU FL RE YE AP CI COPU FL RE YE AP CI COPU FL RE YE AP CI COPU FL RE YE AP CI COPU FL RE YE AP
b1 b2 b3 b4 b5 b6 b7
Speedup
Speedup over DGL-CPU Speedup over DGL-GPU
Figure 3.14: Comparison of end-to-end latency TE2E with DGL
1
10
100
1000
CI COPUFL RE YEAP CI COPUFL RE YEAP CI COPUFL RE YEAP CI COPUFL RE YEAP CI COPUFL RE YEAP CI COPUFL RE YEAP CI COPUFL RE YEAP CI COPUFL RE YEAP
b1 b2 b3 b4 b5 b6 b7 b8
Speedup
Speedup over PYG-CPU Speedup over PYG-GPU
Figure 3.15: Comparison of end-to-end latency TE2E with PyG. Note that PyG-CPU cannot execute AP due
to out of memory. PyG-GPU cannot execute RE, YE, and AP due to out of memory. Therefore, these results
are not shown in the Figure.
3.6.6 Comparison with the State-of-the-art Accelerators
Table 3.8: Advantages of GraphAGILE over the state-of-the-art work
GAT NHC ⋆
† Preprocesssing UFH ‡ GEMM SDDMM
HyGCN [127] No / No ,
graph partitioning
sparsity elimination No / YES , NO /
AWB-GCN [45] No / No ,
graph partitioning
data layout transformation YES , NO / NO /
DeepBurning-GL [73] No / Yes (6-8 hours) / (Unknown) NO / YES , NO /
BoostGCN [137] No / Yes (6-8 hours) / graph partitioning NO / YES , NO /
GraphAGILE YES , No , software compilation YES , YES , YES ,
We compare with state-of-the-art accelerators: HyGCN [127], AWB-GCN [45], DeepBuring-GL [73]
and BoostGCN [137].
Advantages of GraphAGILE: Table 3.8 summarizes the performance comparison. HyGCN [127] and
AWB-GCN [45] use fixed hardware designs that only support limited GNN models. For example, they cannot execute GAT due to the lack of support for SDDMM. Moreover, they use additional data-dependent
optimizations, such as sparsity elimination (HyGCN). These optimizations can reduce the latency of hardware execution TLoH at the cost of increased end-to-end latency due to the expensive preprocessing. Design
automation frameworks such as DeepBurning-GL [73] and BoostGCN [137] need to pay hours of overhead
61
Table 3.9: Comparison of TLoH
Model Dataset Approach TLoH (ms) Speedup
b2
FL BoostGCN [137]
GraphAGILE
20.1
11.5
1.75×
1×
RE
BoostGCN [137]
HyGCN [127]
AWB-GCN [45]
GraphAGILE
98.1
289
49.7
97.2
1.01×
2.97×
0.51×
1×
YE BoostGCN [137]
GraphAGILE
193
104.3
1.85×
1×
AP BoostGCN [137]
GraphAGILE
793.5
315.9
2.51×
1×
to regenerate FPGA bitstream for every pair of GNN models and input graph. Therefore, they have very
large end-to-end latency. HyGCN, DeepBurning-GL, and BoostGCN are hybrid architectures that initialize
different hardware modules for various computation kernels. However, hybrid architectures suffer from
load imbalance and thus, hardware under-utilization. AWB-GCN uses the same set of processing elements
to execute SpDMM under various data sparsity. It is not efficient for GEMM and does not support SDDMM.
For dense input graphs (e.g., AmazonProducts) or GNN models with the PReLU or SWISH activation functions, GEMM is essential to be supported.
Comparison of latency of hardware execution TLoH: Since no previous work measure the end-toend latency, their overhead of graph preprocessing (Table 3.8) are unknown. Therefore, we are only
able to compare the latency of hardware execution TLoH, as shown in Table 3.9. Table 3.3 shows the detailed resource utilization of various FPGA accelerators. Compared with BoostGCN, GraphAGILE achieves
1.01 × − 2.51× speedup on FL, RE, YE, and AP under comparable peak performance and memory bandwidth. Compared with HyGCN, GraphAGILE achieves 2.97× speedup on RE. GraphAGILE achieves higher
performance because BoostGCN and HyGCN are hybrid accelerators that suffer from load imbalance.
AWB-GCN is 1.96× faster than GraphAGILE on RE because (1) the platform of AWB-GCN has 2.2× peak
performance than GraphAGILE, and (2) AWB-GCN exploits the sparsity of vertex features to reduce the
62
total computation complexity. However, the sparsity exploitation in AWB-GCN requires a runtime system
to obtain the sparsity of the intermediate results and dynamically perform data format transformation and
kernel remapping. Therefore, the runtime optimizations of AWB-GCN are orthogonal to our static compiler optimizations. For an overlay accelerator, it is challenging to exploit the data sparsity because both
data format and high-level instructions need to be generated/changed dynamically at runtime. We leave
the dynamic data sparsity optimizations in the runtime system as future work.
63
Chapter 4
Dynasparse: Accelerating GNN Inference through Dynamic Sparsity
Exploitation
This Chapter introduces Dynasparse, the proposed comprehensive hardware-software codesign on FPGA
to accelerate GNN inference through dynamic sparsity exploitation. For this, we decouple the GNN computation kernels from the basic computation primitives, and explore hardware-software codesign as follows:
1) Hardware design: We propose a novel unified accelerator design on FPGA to efficiently execute various
computation primitives. We develop a customized soft processor that is tightly coupled with the accelerator to execute a runtime system. Moreover, we develop efficient hardware mechanisms to profile the data
sparsity and perform on-the-fly data format transformation to prepare the input data for various computation primitives; 2) Software design: We develop a runtime system that works synergistically with the
accelerator to perform dynamic kernel-to-primitive mapping based on data sparsity.
4.1 Data Sparsity in GNN inference
The density of a matrix is defined as the total number of non-zero elements divided by the total number
of elements. Note that, the sparsity is given by (1 − density). The computation kernels in GNNs involve
three types of matrices: graph adjacency matrix A, vertex feature matrix H, and weight matrix W. The
adjacency matrix A of different graph datasets [49] can have different densities. For a given adjacency
64
matrix, different parts of the matrix have different densities. Figure 4.2 shows the densities of feature
matrices in GCN [66]. For different graphs, the input feature matrices have different densities. The feature
matrices of different layers also have different densities. For the weight matrices, prior works ([96, 11])
have proposed various pruning techniques to reduce the density of the weight matrices.
CI CO PU FL NE RE
Dataset
0
0.5
1
1.5
2
2.5
Density
10-3The density of A
CiteSeer PubMed NELL
Cora Flickr
Figure 4.1: Density and the visualization of graph adjacency matrix A of various graphs [49]
CI CO PU FL NE RE
Dataset
0
Density
1
The density of input feature matrix
The density of the FM after the Update() of first GNN layer
The density of the FM after the Aggregate()+ () of first GNN layer
The density of the FM after the Update() of second GNN layer
The density of the FM afer the Aggregate+ () of second GNN layer
Density of feature matrix (FM) in various layers
Figure 4.2: Density of the feature matrices in the GCN model [66]
4.2 GNN Acceleration Based on Data Sparsity
Although there are various data Sparsity in GNNs, no prior work has systematically studied exploiting
the data sparsity for GNN inference acceleration. HyGCN [127] and BoostGCN [137] map Aggregate() to
SpDMM and map update() to GEMM, ignoring the data sparsity in feature matrices and weight matrices.
AWB-GCN [45] maps both Aggregate() and update() to SpDMM. Then, they propose an accelerator to efficiently execute SpDMM. However, they do not exploit the data sparsity in weight matrices. DeepBurningGL [73] is a design automation framework that generates the optimized hardware accelerator given the
65
information of the input graph and the GNN model. However, their framework needs to regenerate the
optimized accelerator if the sparsity of the data is changed. To summarize, prior GNN accelerators do not
fully exploit the data sparsity in GNNs, or are not flexible to exploit data sparsity in GNN inference.
4.3 Overview
4.3.1 Problem Definition
The computation kernels in GNN inference are feature aggregation and feature transformation which
correspond to Aggregate() and Update() in the message-passing paradigm of GNN.
• Aggregate(): The input is graph adjacency matrix A and feature matrix Hin. The output is Hout =
A × Hin.
• Update(): The input is vertex feature matrix Hin and weight matrix W. The output is Hout =
Hin × W.
The computation primitives are GEMM, SpDMMa sparse,and SPMM. While all the primitives perform
the multiplication of two input matrices to produce an output matrix, they have different ways of dealing
with the zero elements: (1) GEMM views the two input matrices as dense matrices and performs multiplyaccumulate for all the matrix elements with no matter whether an element is non-zero or not. (2) SpDMM
views one input matrix as a sparse matrix and skips the computation operations for all the zero elements
in this input matrix. (3) SPMM takes two input sparse matrices and skips the computation operations for
all the zero elements in the two input matrices.
This work targets full-graph inference: given a GNN model and an input graph, we perform the
message-passing paradigm in the full input graph to obtain the embeddings of all the vertices. Full-graph
inference has been widely studied in the literature [127, 45, 137]. Our objective is to exploit the data sparsity of GNN kernels to further accelerate the inference process. We assume that the sparsity of the data is
66
unknown before the accelerator design or hardware execution. Our intent is to develop a single hardwaresoftware codesign on FPGA that is efficient and flexible to support various graphs and GNN models of
various data sparsity. Therefore, the proposed work does not require regenerating the FPGA accelerator
if data sparsity changes.
4.3.2 System Overview
GNN model Input graph
Input parser
IR
Data partitioning
Optimized IR
Compiler
Runtime System
Host
Processor
FPGA External Memory
Soft
Processor
FPGA
Sparsity Info.
From Accelerator
Scheduler Control Signal for
Accelerator
Analyzer
Accelerator
Software Design
PCIe AXI
Host Memory
Hardware System
Figure 4.3: Overview of the proposed system
Aggregate
Update
Aggregate
Update
import torch
from torch_geometric.nn import GCNConv
class GCN(torch.nn.Module):
def __init__(self, in_ch, hidden_ch, out_ch):
super().__init__()
self.conv1 = GCNConv(in_ch, hidden_ch)
self.conv2 = GCNConv(hidden_ch, out_ch)
def forward(self, x:Tensor, edge_index: Tensor):
x = self.conv1(x, edge_index).relu()
x = self.conv2(x, edge_index)
return x
user_model = GCN(128, 16, 16)
dataset = Planetoid(root='.', name='Cora')
User-defined GNN model and graph meta data Computation Graph
Input
Parser
Optimized IR after data partitioning
GEMM
SpDMM
SPMM
Analyzer Primitives
Scheduler
Accelerator
#Execution of an Aggregate kernel
Input: Graph adjacency matrix
Feature matrix
Output: Output feature matrix
for = 1 to
1
for = 1 to 1
2
Initialize
in Result Buffer
for = 1 to
1
Load and
+= Matmul(,
)
write
back to DDR
Data partition
of a kernel
Figure 4.4: Proposed workflow
Figure 4.3 depicts the proposed system design. The software comprises a compiler and a runtime system.
The hardware system has three components:
67
• Host processor: The compiler is executed on the host processor to perform compilation (preprocessing) for the input GNN model and the input graph to generate the intermediate representation
(IR). The IR is sent to the soft processor for execution.
• Soft Processor on FPGA: The runtime system is executed on the soft processor. It takes the IR
as input, and dynamically schedules the computation tasks on the accelerator by sending control
signals to the accelerator.
• Accelerator on FPGA: It executes the three computation primitives (GEMM, SpDMM, SPMM),
profiles data sparsity, and performs data layout/format transformation. It receives the control signals
from the soft processor to execute the computation tasks and also sends the data sparsity information
to the soft processor at runtime.
The workflow is illustrated in Figure 4.4. The execution of GNN inference consists of two steps:
Step 1. Compilation/Preprocessing: The compiler performs the following preprocessing: 1 Generating intermediate representation (IR): It takes the specifications of the user-defined GNN model and the
graph metadata as input, and generates the IR for the GNN computation graph (See Figure 4.4). 2 Data
partitioning: The compiler performs data partitioning for each kernel. Data partitioning is required since
(1) in real-world applications, the input graph can be very large, and the FPGA accelerator has limited onchip memory, and (2) within a matrix, different parts of the matrix can have different data sparsity. Data
partitioning enables fine-grained kernel-to-primitive mapping, leading to more efficient sparsity exploitation. 3 Preprocessing of data sparsity: When the compiler performs data partitioning, it uses counters
to profile the sparsity information of graph adjacency matrix A, weight matrix W, and input feature matrix H0
. Note that the sparsity information of the feature matrices in the intermediate layers {H1
, ..., HL}
is unknown at compile time and is profiled by the accelerator at runtime.
68
Step 2. Runtime execution: At runtime, the soft processor and the accelerator collaborate to perform
GNN inference. The runtime system on the soft processor consists of an Analyzer and a Scheduler. The
accelerator contains multiple Computation Cores. The Analyzer takes the optimized IR from the compiler
and the data sparsity information from the compiler and the accelerator to dynamically map a kernel to a
primitive based on a performance model. Then, the Scheduler schedules the execution of the primitives on
the accelerator (Section 4.6.3). The runtime system performs dynamic kernel-to-primitive (K2P) mapping.
Note that the mapping must be performed dynamically at runtime: (1) The densities of the feature matrices
in the intermediate layers {H1
, ..., HL} are unknown before runtime; (2) The Computation Core has various execution modes (Section 4.5.2) with each mode executing a specific primitive. These execution modes
have different computation efficiency (See Section 4.6.1) with respect to the density of data. As a result,
for a computation kernel of high density, executing it using GEMM primitive on the Computation Core
will be more efficient. For a GNN kernel of low density, executing it using SpDMM or SPMM primitive on
the Computation Core will be more efficient. To handle this scenario, we build an analytical performance
model (Section 4.6.2) to estimate the execution latency of a given primitive on the Computation Core with
respect to the data sparsity.
4.4 Compiler
4.4.1 Intermediate Representation (IR)
We define the meta data in the IR in Table 3.1, including the meta data of the kernel and the meta data
of the execution scheme. The execution scheme of a kernel is the plan for executing the kernel. The IR
defines two types of kernels – Aggregate and Update, corresponding to Aggregate() and Update() in the
GNN abstraction.
69
Table 4.1: Meta data of a kernel in the IR
Layer Type Aggregate(0), Update(1)
Layer ID 1,2,3,...
Input Dimension fin
Output Dimension fout
# of vertices |V|
# of edges |E|
Aggregation operator Max, Sum, Min, Mean
Activation type ReLU, PReLU
Activation enabled True, False
Meta data of execution scheme {...} (See Algorithm 10 and 11)
||
||
1
1
1
2
2
||
1
2
2
1
2
block fiber
subfiber
block
Graph adjacency matrix Feature matrix
Weight matrix
Figure 4.5: Illustration of data and model partitioning
4.4.2 Compilation Process
The compilation process has two steps (See Figure 4.3):
• Step 1 (parsing the input): The compiler takes the specification of the GNN model (Defined using
PyTorch Geometric Library [49]) and the graph metadata as input, and generates the computation
graph for GNN inference (See the example in Figure 4.4). The computation graph has PL
l=1 kl nodes,
where L denotes the number of GNN layers in the GNN model and kl denotes the number of kernels
in layer l (1 ⩽ l ⩽ L). In the computation graph, each node represents the IR of a kernel. An edge
denotes the data dependency between two kernels.
70
• Step 2 (data partitioning and execution scheme generation): The compiler performs data partitioning for each kernel and generates the execution scheme for the kernel. Then, the metadata of
the execution scheme is stored in the IR to produce the optimized IR (See Figure 4.4) that is sent to
the runtime system.
4.4.3 Data Partitioning
Figure 4.5 depicts the proposed data partition scheme. The graph adjacency matrix A has the dimension
|V| × |V|. A is partitioned into blocks with each block having dimension of N1 × N1. We use Aij to
denote a block where Aij = A[i ∗ N1 : (i + 1) ∗ N1][j ∗ N1 : (j + 1) ∗ N1]. The feature matrix H of
dimension |V| × f1 is partitioned into fibers. Each fiber has dimension N1 × N2 and Hij = H[i ∗ N1 :
(i + 1) ∗ N1][j ∗ N2 : (j + 1) ∗ N2]. We further partition each fiber into subfibers where each subfiber
has size N2 × N2. Hij−k denotes the k
th subfiber of Hij . We use Hi−k to denote the concatenation of
{Hi1−k, Hi2−k, ..., Hi
N1
N2
−k
}. The weight matrix W is partitioned into blocks with each block having size
of N2 × N2. Wij = W[i ∗ N2 : (i + 1) ∗ N2][j ∗ N2 : (j + 1) ∗ N2].
4.4.4 Execution Scheme
Based on the data partition scheme, the compiler generates the execution plan for each computation kernel, shown in Algorithm 10 and 11. The execution of a computation kernel is decomposed into a set of
independent computation tasks. Each task performs the execution of an output data partition and there
is no data dependency among the tasks within a kernel. Each task performs the multiplication of data
partitions to obtain an output data partition, and the computation primitive to execute the matrix multiplication Matmul() is determined by the Runtime System. We generalize the representation of a task in
Algorithm 12.
71
Algorithm 10 Execution scheme of an Aggregate kernel
Input: Graph adjacency matrix A; Input feature matrix Hin;
Output: Output feature matrix Hout;
1: Execute the Aggregate kernel
2: for i = 1 to |V|
N1
do
3: for k = 1 to f1
N2
do
4: Initialize Hout
ik in the Result Buffer
5: for j = 1 to |V|
N1
do
6: Load Aij and Hin
jk
7: Hout
ik + = Matmul(Aij , Hin
jk)
8: Write Hout
ik back to DDR memory
Algorithm 11 Execution scheme of an Update Kernel
Input: Input feature matrix Hin; Weight matrix W;
Output: Output feature matrix Hout;
1: Execute the Update kernel
2: for i = 1 to |V|
N2
do
3: for k = 1 to f2
N2
do
4: g = ⌊
i×N2
N1
⌋, f = i%( N1
N2
)
5: Initialize Hout
gk−f
in the Result Buffer
6: for j = 1 to f1
N1
do
7: Load Hin
gj−f
and Wjk
8: Hout
gk−f+ = Matmul(Hin
gj−f
,Wjk)
9: Write Hout
gk−f
back to DDR memory
Algorithm 12 A computation task
Input: {Xi1, Xi2, ..., XiK} and {Y1j , Y2j , ..., XKj};
Output: Output matrix: Zij ;
1: Initialize Zij in the Result Buffer
2: for k = 1 to K do
3: Load Xit and Ytj onto the on-chip buffer
4: Zij+ = Matmul(Xit, Ytj )
5: Write Zij back to DDR memory
72
BufferU
ALU Array
Index
Shuffle
Network
(ISN)
Data
Shuffle
Network
(DSN)
Bank 0
Bank 1
Bank 2
Bank 3
Bank 4
Bank 5
Bank 6
Bank 7
Buf
Buf
Buf
Buf
Buf
Buf
Buf
Buf
BufferP
BufferO
Sparsity Profiler
Agile Computation Module
Auxiliary Hardware
Module
Update Unit
Reduce Unit
SQ
Sparse DataQueue
for SPMM primitive
Bank 0
Bank 1
Bank 2
Bank 3
Bank 4
Bank 5
Bank 6
Bank 7
Result Buffer
SQ
SQ
SQ
SQ
SQ
SQ
SQ
SQ
Buf
Buffer
FTM
FTM
FTM
Format
Transformation
Module
Layout Transformation Unit Layout Merger
Soft
Processor
External
Memory
Auxiliary
Hardware
Module
Agile
Computation
Module
Computation
Core
Accelerator
Figure 4.6: Diagram of a Computation Core
BufferP
BufferO
Result Buffer
BufferU
ISN
BufferO
DSN
cUpdate
Unit
Reduce
Unit
Update
Unit
Reduce
Unit
Update
Unit
Reduce
Unit
Update
Unit
Reduce
Unit
Result Buffer
c
c
c
c
c
c
c
c
BufferU
ISN
BufferO
DSN c
SQ
SQ
SQ
SQ
SQ
SQ
SQ
SQ
Result Buffer
GEMM Mode SpDMM Mode SPMM Mode
Sparse Computation Pipeline
Figure 4.7: Various execution modes of a Computation Core
4.5 Accelerator Design
In Section 4.5.1, we introduce the data layout and data format that are used by Dynasparse. In Section
4.5.2.1, we introduce the Agile Computation Module which can execute three primitives (GEMM, SpDMM,
and SPMM). In Section 4.5.2.2, we describe the hardware mechanism for sparsity profiling, and data format/layout transformation.
4.5.1 Data Format and Data Layout
Data format: We store the matrices using sparse format or dense format. We use Coordinate (COO) format
to represent a sparse matrix where an nonzero element is represented using a three-tuple (col, row, value)
denoting the column index, row index, and value, respectively. COO format is the standard data format
used in the state-of-the-art GNN libraries [49].
73
Data layout: It defines the order of storing the matrix elements. For a sparse matrix in the row-major
order, the elements within the same row are stored in contiguous locations. Otherwise, it is column-major
order. Similarly, row-major and column-major order for a dense matrix can be derived.
Notations: For a matrix B, we use B[i] to denote the i
th row of B and use B[i : j] to denote the submatrix
of B from i
th row to (j − 1)th row. We use B[i][j] to denote the element of B at the i
th row and the j
th
column. An element (j, i, value) in sparse B will also be denoted as B[i][j] = value.
4.5.2 Microarchitecture
Each Computation Core (Figure 4.6) has an Agile Computation Module (ACM) and an Auxiliary Hardware
Module (AHM). The ACM has an ALU (Arithmetic Logic Unit) array of dimension psys × psys and the
interconnection among the ALUs are shown in Figure 4.7. AHM performs sparsity profiling, data layout
and format transformation (Section 4.5.2.2).
4.5.2.1 Agile Computation Module (ACM)
It has four data buffers – BufferU, BufferO, BufferP and Result Buffer (RB). Buffer[U/O/P] store the input
matrices and RB stores the output matrix. Each Buffer has psys memory banks (denoted bank 0 to bank
psys −1) for parallel on-chip memory access. Each ALU can execute various arithmetic operations, including multiplication, max, addition, etc. There are two interconnection networks – Index Shuffle Network
(ISN) and Data Shuffle Network (DSN) – for data communication. The ACM has three execution modes –
GEMM mode, SpDMM mode and SPMM mode. The required data format and layout for various execution
modes are summarized in Table 4.2.
GEMM Mode: The ALU array is organized as a two-dimensional systolic array (See Figure 4.7) to execute
GEMM using output stationary dataflow. The systolic array can execute p
2
sys multiply-accumulate (MAC)
operations per clock cycle.
74
Table 4.2: Buffer (data format) [data layout] requirement to store the input/output matrices for executing
Z = X × Y in the three execution modes
X Y Z
GEMM BufferO (dense)
[row major]
BufferP (dense)
[column major]
Result Buffer (dense)
[row major]
SpDMM BufferU (sparse)
[row or column major]
BufferO (dense)
[row major]
Result Buffer (dense)
[row major]
SPMM BufferU (sparse)
[row major]
BufferO (sparse)
[row major]
Result Buffer (dense)
[row major]
Algorithm 13 SpDMM using Scatter-Gather Paradigm
Input: Sparse matrix (BufferU): X; Dense matrix (BufferO): Y;
Output: Output matrix (Result Buffer): Z (Z = X × Y);
1: while not done do
2: for each e(i, j, value) in X Parallel do ▷ Scatter Phase
3: Fetch Y[i] from BufferO ▷ ISN routes e to BufferO
4: Form input pair (Y[i], e)
5: # DSN routes input pair to Update Units
6: for each input pair Parallel do ▷ Gather Phase
7: u ←Update(Y[i], e.value) ▷ Update Unit
8: Fetch Z[j] from Result Buffer
9: Z[j] ← Reduce(u) ▷ Reduce Unit
75
SpDMM Mode: The ALU array is divided into psys/2 Update Units and psys/2 Reduce Units. Each Update
or Reduce Unit has an ALU array of size psys/2 × 2. Multiplication of a sparse matrix with a dense matrix
is executed using the Scatter-Gather Paradigm shown in Algorithm 9. The sparse matrix denoted as X (in
BufferU) is stored in row-major order using COO format. The dense matrix denoted as Y (in BufferO) is
stored in row-major order using dense format, and Y[i] is stored in bank (i mod psys) of BufferO. Each
non-zero element e(i, j, weight) in X is fetched from the BufferU (psys/2 elements can be fetched from
BufferU per cycle) and sent to the ISN. Then e is routed to bank (i mod psys) for fetching Y[i], which
forms the input data pair (Y[i], e). The input pair is routed to the (j mod psys/2)th Update Unit. The
Update Unit performs the multiplication of e.value and Y[i] to produce the intermediate result u. Then
the corresponding Reduce Unit adds u to Z[j]. SpDMM Mode can efficiently skip zero elements in the
sparse matrix X. The SpDMM Mode can execute p
2
sys/2 MAC operations per clock cycle.
Algorithm 14 SPMM using Row-wise Product with Scatter-Gather Paradigm
Input: Sparse matrix (BufferU): X; Sparse matrix (BufferO): Y;
Output: Output matrix (In Result Buffer): Z = X × Y;
1: for each row Z[j] in Z Parallel do
2: Assign the workload of Z[j] to SCP[j%psys]
3: load Z[j] to the Sparse Data Queue from Results Buffer
4: for each e(i, j, value) in X[j] do ▷ Scatter Phase
5: Fetch Y[i] from BufferO ▷ ISN routes e to BufferO
6: Form input pair (Y[i], e) ▷ DSN routes input to SCPs
7: for each input pair (Y[i], e) do ▷ Gather Phase
8: for each non-zero Y[i][k] in Y[i] do ▷ SCP
9: Produce u ← Update(e.value × Y[i][k])
10: Merge Z[j][k] ← Reduce(u)
11: Store Z[j] to the Result Buffer ▷ Obtain Z[j]
SPMM Mode: The ALU array is organized as psys parallel Sparse Computation Pipelines (SCP) as shown
in Figure 4.7. Each SCP has two ALUs to perform multiplication of two non-zero elements and the merging
of intermediate results. Each SCP also has a Sparse Data Queue (SQ) to store the intermediate results in
sparse format. The multiplication of two input sparse matrices is executed using the Row-wise Product
76
with Scatter-Gather paradigm as shown in Algorithm 14. For Row-wise Product, a row Z[j] of output
matrix Z is calculated through:
Z[j] = X
i
X[j][i] ∗ Y[i] (4.1)
For calculating the output matrix Z, a SCP is assigned the workload of a row of output matrix (Equation
4.1). psys SCPs can calculate psys output rows in parallel until all the rows of the output matrices are
calculated. To efficiently execute Row-wise Product, all input sparse matrices (X, Y) and output matrix
are stored using COO format in row-major order (See Section 4.5.1). Using SPMM Mode, we can skip
the zero elements in both the input matrices. SPMM Mode can execute psys multiply-accumulate (MAC)
operations per clock cycle.
Mode switching: The execution mode is set by the control bits of the hardware multiplexers in ACM.
The overhead of switching execution modes is just one clock cycle.
Trade-off: The three execution modes have different ways of dealing with non-zero elements in the two
input matrices (Section 4.3.1). Therefore, their execution time of multiplying two input matrices depends
on the data sparsity. We analyze the trade-off of the three execution modes w.r.t. data sparsity in Section
4.6.1.
4.5.2.2 Auxiliary Hardware Module (AHM)
While the ACM can execute various primitives, the data format and layout should meet the requirement
of the execution modes (Table 4.2). Moreover, the soft processor needs the data sparsity information at
runtime for dynamic K2P mapping. To this end, the AHM has the following hardware modules: (1) a
Layout Transformation Unit and a Layout Merger to transform the data layout, (2) a Sparsity Profiler (SP)
to obtain the density of the intermediate results, (3) Format Transformation Module (FTM), which contains
a Sparse-to-Dense Module and a Dense-to-Sparse Module.
77
Layout Transformation Unit (LTU): Transformation of the data layout between row-major order and
column-major order is transposing a matrix. LTU is implemented using a streaming permutation network
[10] (See [10] for details) for efficient layout transformation. Since most of the on-chip data are stored using
row-major order, we store all the data partitions of (A, H, W) in the external memory using row-major
order to minimize the effort for data layout transformation.
Layout Merger: When the accelerator executes a task (See algorithm 12), the results Z can be in rowmajor or column-major order. Therefore, in Results Buffer, we store two partial results of Z in row-major
and column-major order, respectively. The two partial results of Z are merged by Layout Merger into
row-major order when Z is sent back to the external memory. Note that the LTU is also used by BufferO
to transform the data layout for X
⊺
2
(column-major order of X2).
Sparsity Profiler: To profile the density of sparse matrix or dense matrix, we use the adder tree based
design for the Sparsity Profiler. At the output port of the Result Buffer, we implement a comparator array
with an adder tree to count the total number of non-zero elements. After obtaining the data sparsity of
the current output matrix, the sparsity information is sent to the soft processor.
Dense-to-Sparse (D2S) Module: It transforms an array from dense format to sparse format. Suppose the
D2S Module can read n elements per clock cycle. Then, the D2S Module has log(n) pipeline stages. For an
n-element array, we use the value of Prefix-Sum to indicate the number of zeros before an element in this
array. An example is shown in Figure 4.8. In Stage i (1 ⩽ i ⩽ log(n)), an array element will be shifted left
by 2
i−1 positions if the (i − 1)th bit of Prefix Sum value is equal to 1. The throughput of D2S Module is n
elements per cycle. For example, a DDR4 channel of the FPGA board can output 16 32-bit data per cycle.
A D2S Module of n = 16 is sufficient to match the data rate of a DDR4 channel. The architecture of S2D
is similar to D2S, but in the reverse direction.
78
7 8 0 6 0 0 1
0 0 0 1 1 2 3
7 8 6 0 0 1
0 0 0 0 1 2
7 8 6 1
0 0 0 0
7 8 6 0 0 1
Prefix-Sum 0 0 0 0 1 2
Array
If Prefix-Sum[0] == 1, shift
left by 1 position
If Prefix-Sum[1] == 1, shift
left by 2 position
Column Index 1 2 3 4 5 6 7 1 2 4 5 6 7
1 2 4 5 6 7 1 2 4 7
Stage 1 Stage 2
Prefix-Sum
Array
Column Index
Figure 4.8: Transforming dense format to sparse format
4.5.2.3 Double Buffering
We exploit double buffering technique for Buffer[U/O/P] and Results Buffer. Therefore, when the Computation Core is executing the current task, the Buffers can load the input data of the next task. The data
sparsity profiling, data layout and format transformation are streaming processes that can be executed
during the data loading/storing process. Double Buffering not only overlaps the computation and data
communication, but also hides the overhead of sparsity profiling and data layout/format transformation.
4.6 Runtime System
4.6.1 Performance Model
The performance model predicts the execution time of the primitives for a given data sparsity. For analysis,
we denote the two input matrices to a Computation Core as X ∈ R
m×n
and Y ∈ R
n×d where X has the
density αX (0 ⩽ αX ⩽ 1) and Y has the density αY (0 ⩽ αY ⩽ 1).
Table 4.3: Performance model
GEMM SpDMM SPMM
MACs per cycle p
2
sys p
2
sys/2 psys
Execution time
(cycles)
mnd
p
2
sys
αmin
2mnd
p
2
sys
, where
αmin = Min(αX, αY)
αXαY
mnd
psys
79
In the GEMM mode, the two input matrices are viewed as dense matrices and the Computation Core
can execute p
2
sys MACs per cycle. Therefore, the total execution time is mnd
p
2
sys
cycles. In the SpDMM mode,
the Computation Core can skip the zero elements in one input matrix and can execute p
2
sys/2 MACs per
cycle. We view the input matrix with lower density as a sparse matrix and view another input matrix as a
dense matrix. Therefore, the total execution time is αmin
2mnd
p
2
sys
cycles where αmin = Min(αX, αY). In the
SPMM mode, the Computation Core can skip the zero elements in both two input matrices and can execute
psys MACs per cycle. Therefore, the total execution time is αXαY
mnd
psys
cycles. In the state-of-the-art FPGA
such as Xilinx Alveo U250, the dimension of a Computation Core psys can be chosen to be ⩾ 8. We denote
αmax = Max(αX, αY). To summarize, for executing Z = X×Y on a Computation Core, when αmin ⩾
1
2
,
GEMM Mode has the least execution time; When αmin <
1
2
and αmax ⩾
2
psys
, SpDMM Mode has the least
execution time; When αmin <
1
2
and αmax <
2
psys
, SPMM Mode has the least execution time. The three
cases are non-overlapping and cover all the points in the domain 0 ⩽ αmin ⩽ αmax ⩽ 1.
4.6.2 Dynamic Kernel-to-primitive Mapping
The Analyzer performs dynamic kernel-to-primitive (K2P) mapping for each computation task shown
in Algorithm 15. For each pair of input matrices (Xit, Ytj ), the runtime system fetches their densities
αXit and αYtj . Then, the Analyzer determines the target primitive for multiplying Xit and Ytj , and also
determines which buffers to store Xit and Ytj . The proposed dynamic K2P algorithm has the computation
complexity O(K) = O(
|V|
N1
+
f1
N2
) for a computation task, which has small overhead compared with total
computation complexity of a task O(|V| ∗ N2 + f1 ∗ N2
2
). See evaluation results in Section 4.8.3. There are
several benefits: (1) the proposed dynamic K2P mapping is fine-grained that for different data partitions, we
can use different primitives to efficiently exploit the data sparsity in the input. (2) When the accelerator is
executing kernel l, the runtime system can perform K2P mapping for kernel l+ 1. Therefore, the overhead
of the runtime system can be hidden.
80
Algorithm 15 Dynamic kernel-to-primitive (K2P) mapping Algorithm for a computation task
Input: {Xi1, Xi2, ..., XiK} and {Y1j , Y2j , ..., XKj};
1: for t = 1 to K do
2: TargetPrimitive(Xit, Ytj ) ← NULL
3: The buffers to store Xit and Ytj : BXit , BYtj
4: αmin = Min(αXit , αYtj ) ▷ αXit : The density of Xit
5: αmax = Max(αXit , αYtj ) ▷ αYtj : The density of Ytj
6: if αmin = 0 then ▷ Skip empty input matrix
7: Skip the multiplication of Xit and Ytj
8: if αmin ⩾
1
2
then
9: TargetPrimitive(Xit, Ytj ) ← GEMM
10: BXit ← BufferO and BYtj ← BufferP
11: else
12: if αmax ⩾
2
psys
then
13: TargetPrimitive(Xit, Ytj ) ← SpDMM
14: Bargmin(αM) ← BufferU, (M ∈ {Xit, Ytj})
15: Bargmax(αM) ← BufferO, (M ∈ {Xit, Ytj})
16: else
17: TargetPrimitive(Xit, Ytj ) ← SPMM
18: BXit ← BufferU and BYtj ← BufferO
4.6.3 Task Scheduling
The scheduler performs scheduling of computation tasks (See Section 4.4.1) on the parallel Computation
Cores as shown in Algorithm 16. The proposed task scheduling is a dynamic task scheduling strategy. Each
Computation Core maintains an interrupt interface to trigger the interrupt handling in the soft processor
when the Computation Core is idle. Then, the soft processor assigns a task to the Computation Core.
Algorithm 16 Task scheduling
Input: Intermediate Representation of the GNN model: IR; The number of computation kernels in the IR:
L;
Output: Output of the GNN model;
1: for l = 1 to L do
2: for each Task in kernel l of IR parallel do
3: if there is an idle CC: CCi then
4: Assign this Task to CCi
5: CCi executes this computation Task
6: Wait until all the Tasks in kernel l are executed
Partition size (N1, N2): The objectives of the data partitioning are to (1) enable fine-grained data sparsity
exploitation, (2) exploit data locality, and (3) maximize resource utilization during dynamic task scheduling
81
(Algorithm 16). Specifically, to maximize resource utilization that keeps all the Computation Cores busy,
the compiler selects the partition configuration (N1, N2) such that there will be at least η ∗ NCC (η ≥ 1)
tasks in each computation kernel assigned to NCC Computation Cores. η is a factor that is determined
empirically. Since different partitions can have different data sparsity leading to the different workloads of
the tasks, small η (e.g., η = 1) can potentially lead to long idle time for the Computation Cores with small
workloads. Therefore, we set η = 4 following state-of-the-art graph processing frameworks [69].
To meet the above three objectives, we use a heuristic approach to determine the partition size as
shown in Algorithm 17. As shown in Algorithm 10 (line 2-3), the number of tasks of an Aggregate kernel
is Ta =
|V|∗f1
N1∗N2
. Also, as shown in Algorithm 11 (line 2-3), the number of tasks of an Update kernel
is Tu =
|V|∗f2
N2∗N2
. For simplicity, we use Q to denote the workload of a kernel (e.g., Q = |V| ∗ f1 or
Q = |V| ∗ f2), and use Q[k] to denote the workload of k
th (1 ⩽ k ⩽ L) kernel. We use p() to denote
the function that determines the number of tasks of a kernel based on Q, N1, and N2. For example,
Ta = p(Q, N1, N2) = Q
N1∗N2
and Tu = p(Q, N2) = Q
N2∗N2
. In line 9 and line 15 of Algorithm 17, the
partition size of each kernel is constrained by Nit = min(Nit, Nmax), where min(N′
, Nmax) is the largest
partition size such that Nit ⩽ N′
and Nit ⩽ Nmax. N ⩽ N′
ensures that there will be at least η ∗ NCC
tasks of a kernel for load balance. Nit ⩽ Nmax ensures that the data partition does not exceed the size of
on-chip memory. Lines 10 and 16 find a partition size N1 and N2 that can be used for all the kernels.
4.7 Implementation Details
We implement the proposed accelerator on a state-of-the-art FPGA board – Xilinx Alveo U250, which has
four Super Logic Regions (SLR) [125]. As shown in Figure 4.9, we implement two Computation Cores
(CC) in each SLR except for SLR1, because the FPGA shell (which handles the CPU-FPGA communication)
and soft processor is placed in SLR1. For each CC, psys = 16. We develop the CC using Verilog HDL,
and implement the soft processor using Xilinx Microblaze Soft IP core [86]. Each CC is connected to
82
Algorithm 17 Data partitioning algorithm
Input: On-chip memory size So; Computation workload of each kernel: {Q[k] : 1 ⩽ k ⩽ L}; p():
function that determines the number of tasks of a kernel based on Q, N1 and N2; g(): function that
determines the maximum partition size based on the on-chip memory size So; η: factor for load balance.
Output: Partition size N1, N2;
1: Nmax ← g(So) ▷ Maximum partition size
2: //Objective: Maximize N1 and N2 to improve data locality
3: //Constraint 1 (Maximize utilization): Ta, Tu ≥ η ∗ NCC
4: //Constraint 2 (Memory capacity): N1, N2 ⩽ Nmax
5: ======= Step 1: determine N2 ========
6: N2 ← Nmax
7: for each Update kernel: k
th kernel do
8: Choose largest N′
such that Tu[k] = p(Q[k], N′
) = η ∗ NCC
9: Nit ← min(N′
, Nmax)
10: N2 ← min(Nit, N2)
11: ======= Step 2: determine N1 ========
12: N1 ← Nmax
13: for each Aggregate kernel: k
th kernel do
14: Choose largest N′
such that Ta[k] = p(Q[k], N′
, N2) = η ∗ NCC
15: Nit ← min(N′
, Nmax)
16: N1 ← min(Nit, N1)
the soft processor through the AXI4-Stream interface [86], through which the soft processor sends the
control signals to CC and the CC sends the sparsity information to the soft processor. We develop the
compiler using Python. The IR of a kernel is implemented as a Python object that stores the meta data of a
kernel and its execution scheme. We develop the Runtime system on the soft processor using C in Xilinx
Vitis Unified Software Platform (version 2020.1). The Index Shuffle Network and Data Shuffle Network
are implemented using a butterfly network with buffering to handle the routing congestion. We perform
synthesis and Place&Route using Vivado 2020.1. The resource utilization is shown in Figure 4.9. The CCs
run at 250 MHz.
Soft processor: Our implementation achieves 370 MHz and around 500 Million Instructions Per Second
[86] performance. It has two caches – an Instruction Cache (I-Cache) and a Data Cache (D-Cache). I-Cache
has size 32 KB which is sufficient to hold the binary code of the runtime system after a warm-up execution.
D-Cache has the size 64 KB which stores the sparsity of the data partitions. For large graphs that D-Cache
83
SLR3 SLR2 SLR1 SLR0
CC0
CC1 CC3 CC2
CC4
CC5
CC6
FPGA Shell +
Soft
Processor
LUTs DSPs BRAMs URAMs
Soft
Processor
5.5K 6 26 0
One CC 118K 1024 96 120
FPGA Shell 181K 13 447 0
Total 1011K 7187 1145 840
Available 1728K 12288 2688 960
Utilization 58.6% 58.4% 42.6% 87.5%
Figure 4.9: Layout (FPGA chip) and resource utilization of the proposed design on Xilinx Alveo U250. The
Computation Cores (CC0-CC6) are represented using different colors.
is not enough to hold the sparsity information of all the data partitions, we store it in the external memory
and prefetch the sparsity information to the D-Cache. The soft processor reads/writes the data from/to
the AXI-stream interface through the get and put instructions [86], which have one or two clock cycles
latency.
4.8 Evaluation Results
This Section is organized as follows: In Section 4.8.2, we measure of the impact of the dynamic K2P mapping strategy. In Section 4.8.3, we analyze the overhead of compilation and runtime system. In Section
4.8.4, we compare our work with the state-of-the-art implementations.
4.8.1 Benchmarks and Baselines
Benchmarks: We evaluate Dynasparse on four widely used GNN models – GCN [66], GraphSAGE (SAGE)
[53], GIN [126], and SGC [117]. Figure 4.10 shows the IR of various GNN layers. We evaluate the design
on six widely used graph datasets – Cora (CO) [66], CiteSeer (CI) [66], PubMed (PU) [66], Flickr (FL) [136],
NELL (NE) [129], Reddit (RE) [53]. We evaluate the 2-layer GNN models used in [66, 45, 127, 137], where
the hidden dimension for CO, CI and PU is set as 16, and the hidden dimension for FL, NE and RE is set as
128.
84
Baselines: We compare our work with the state-of-the-art CPU (AMD Ryzen 3990x), GPU (Nvidia RTX3090)
and GNN accelerators HyGCN [127], BoostGCN [137]. The details of the platforms are shown in Table 3.3.
Aggregate
Update Aggregate
Update Update
Aggregate Update
Update
Aggregate
…… Update
Aggregate
GCN layer GraphSAGE layer GIN layer SGC layer
Figure 4.10: IR of various GNN layers
Table 4.4: Specifications of platforms
CPU GPU [127] [137] Dynasparse
Platform Ryzen
3990x
Nvidia
RTX3090 ASIC Stratix 10
GX
Alveo
U250
Technology TSMC
7 nm
TSMC
7nm
TSMC
12 nm
Intel
14 nm
TSMC
16 nm
Frequency 2.90 GHz 1.7 GHz 1 GHz 250 MHz 250 MHz
Peak Performance
(TFLOPS) 3.7 36 4.608 0.64 0.512
On-chip Memory 256 MB 6 MB 35.8 MB 32 MB 45 MB
Memory Bandwidth 107 GB/s 936.2 GB/s 256 GB/s 77 GB/s 77 GB/s
Table 4.5: Dataset Statistics (Dynasparse)
Dataset Vertices Edges Features Classes Density of
A
Density of
H0
CI 3327 4732 3703 6 0.08% 0.85%
CO 2708 5429 1433 7 0.14% 1.27%
PU 19717 44338 500 3 0.02% 10.0%
FL 89,250 899,756 500 7 0.01% 46.4%
NE 65,755 251,550 61,278 186 0.0058% 0.01%
RE 232,965 11 × 107
602 41 0.21% 100.0%
Performance metric: Following the convention in [45, 127, 137], we use latency (accelerator execution
latency) as the metric which is the duration from the time when the accelerator starts to execute the
optimized IR to the time all the inference results are obtained. The preprocessing time by the compiler is
not included in the latency, because (1) the overhead of generating the optimized IR is usually small (See
85
Table 4.6: Latency (ms) on the unpruned GNN models
CI CO PU FL NE RE
GCN [66]
S1 31E-1 9.6E-1 2.7E-1 10E0 83E2 9.3E1
S2 8.9E-3 5.6E-3 7.1E-3 9.9E0 5.4E0 12E1
Dynamic 7.7E-3 4.7E-3 6.3E-2 8.8E0 2.9E0 8.4E1
SO-S1 41.3× 21.5× 4.29× 1.13× 278× 1.10×
SO-S2 1.15× 1.19× 1.12× 1.11× 1.82× 1.42×
SAGE [53]
S1 74E-2 25E-2 65E-2 20E0 17E2 334E0
S2 75E-2 25E-2 69E-2 28E0 17E2 389E0
Dynamic 33E-2 11E-2 42E-2 19E0 83E1 331E0
SO-S1 1.93× 1.72× 1.56× 1.02× 2.05× 1.01×
SO-S2 1.94× 1.73× 1.65× 1.41× 2.05× 1.17×
GIN [126]
S1 4.3E-1 1.5E-1 4.1E-1 1.3E1 8.8E2 3.1E2
S2 7.4E-1 2.4E-1 6.5E-1 2.0E1 1.7E3 3.4E2
Dynamic 3.3E-1 1.1E-1 3.7E-1 1.2E1 8.3E2 2.7E2
SO-S1 1.30× 1.40× 1.11× 1.13× 1.06× 1.15×
SO-S2 2.26× 2.31× 1.76× 1.73× 2.05× 1.25×
SGC [117]
S1 5.3E-1 2.0E-1 5.5E-1 1.29E-1 9.33E2 5.7E2
S2 8.5E-1 3.0E-1 7.9E-1 2.18E-1 1.77E3 6.0E2
Dynamic 4.3E-1 1.5E-1 5.1E-1 1.27E-1 8.83E2 5.0E2
SO-S1 1.23× 1.27× 1.08× 1.02× 1.06× 1.13×
SO-S2 1.95× 1.91× 1.55× 1.72× 1.99× 1.19×
Section 4.8.3) and the optimized IR can be stored and reused if the sparsity of the input graph and GNN
model changes, (2) we follow the same convention in [45, 127, 137] for a fair comparison.
4.8.2 Impact of Dynamic K2P Mapping Strategy
To demonstrate the impact of the proposed dynamic K2P mapping strategy, we execute the following three
K2P mapping strategies on our proposed accelerator:
• Static-1 (S1): It is used in [127, 137] that Aggregate() is mapped to SpMM and Update() is mapped
to GEMM.
• Static-2 (S2): It is used in [45] that both the Aggregate() and Update() are mapped to SpDMM. For
Aggregate(A, H), it views A as sparse matrix and views H as dense matrix. For Update(H, W), it
views H as sparse matrix and views W as dense matrix.
86
• Dynamic: It is our proposed dynamic K2P mapping strategy (Algorithm 15).
We use SO-S1 to denote the speedup of Dynamic over S1. We use SO-S2 to denote the speedup of
Dynamic over S2.
Evaluation on unpruned GNN models: We evaluate the above three strategies using unpruned GNN
models where all the weight matrices have density 100%. The results are shown in Table 4.6. Compared
with S1 and S2, Dynamic achieves 2.13× and 1.59× speedup on the average (geometric mean), respectively. Dynamic achieves limited speedup over S2 on GCN because (1) for the first Update(H0
, W1
) kernel
of GCN, there is high data sparsity in H0 of CI, CO, PU and NE (See Table 4.5), (2) both Dynamic and S2
can exploit the sparsity of feature matrix H0 while S1 does not exploit the sparsity of H0
. As the first
Update(H0
, W1
) kernel of GCN consumes the majority of the execution time, Dynamic achieves very
large speedup over S1 on GCN. Since the weight matrices have density 100%, both Dynamic and S2 map
Update(H0
, W1
) to SpDMM (for CI, CO, PU and NE), leading to similar performance of Dynamic and S2
on GCN.
Evaluation on pruned GNN models: We evaluate the three strategies using the pruned GNN models [96]
where the weight matrices are pruned to have various sparsity. Figures 4.11 and 4.12 show the speedup of
Dynamic over S1&2. For evaluation, all the weight matrices in a GNN model are pruned to have the same
sparsity, and the sparsity of weights in Figures 4.11&4.12 means the average sparsity of all the weight
matrices in a GNN model. Table 4.7 summarizes the average (geometric mean) speedup under various
sparsity of weight matrices. The achieved speedup over S1 is because S1 cannot exploit the data sparsity
in feature matrices and weight matrices. The achieved speedup over S2 is due to (1) when there is limited
data sparsity (density < 50%) in Update(), executing Update() using SpDMM primitive is not efficient. (2)
In Aggregate(), S2 does not exploit data sparsity in feature matrix H since S2 views H as a dense matrix.
87
0 20 40 60 80 100
100
102
104
Speedup
GCN
CI CO PU FL NE RE
0 20 40 60 80 100
100
GraphSAGE
0 20 40 60 80 100
100
102
Speedup
GIN
0 20 40 60 80 100
100
101
102
SGC
Figure 4.11: Speedup of Dynamic over S1 when there are various sparsity (%) in the GNN weight matrices
(X-axis)
0 20 40 60 80 100
2
4
Speedup
GCN
CI CO PU FL NE RE
0 20 40 60 80 100
100
101
GraphSAGE
0 20 40 60 80 100
100
101
Speedup
GIN
0 20 40 60 80 100
100
101
SGC Figure 4.12: Speedup of Dynamic over S2 when there are various sparsity (%) in the GNN weight matrices
(X-axis)
In conclusion, the proposed dynamic K2P mapping strategy leads to lower accelerator execution latency compared with the static mapping strategies. Using dynamic K2P mapping strategy, the execution
latency reduces as the data sparsity increases.
4.8.3 Analysis of Compiler and Runtime System
Overhead of the compilation/preprocessing: Table 4.8 shows the overhead of the compiler on the host
processor (Intel Xeon 5120). The processing time includes the overheads of generating IR, data partitioning,
and preprocessing of data sparsity. Compared with design automation framework [73] which needs to
88
Table 4.7: Average speedup (geometric mean)
Sparsity of weight matrices < 50% 50% − 70% 70% − 90% > 90%
SO-S1 2.16× 4.36× 10.77× 15.96×
SO-S2 1.38× 1.64× 2.11× 5.03×
Table 4.8: Preprocessing time of the compiler (ms)
CI CO PU FL NE RE
GCN 2.5E-1 2.2E-2 5.7E-1 2.68E0 1.70E0 5.1E1
GraphSAGE 2.3E-1 2.6E-1 5.9E-1 2.58E0 1.65E0 4.9E1
GIN 2.4E-1 2.6E-3 5.8E-1 2.69E0 1.71E0 5.0E1
SGC 2.3E-1 2.4E-3 6.1E-1 2.74E0 1.73E0 5.2E1
regenerate FPGA accelerator if the graph or GNN model changes, the overhead of the compiler in our
design is small.
Overhead of the Runtime System: We measure the overhead of the runtime system, which is the execution time of dynamic K2P mapping on the soft processor. See Figure 4.13. On the average, the Runtime
System takes 6.8% of the total execution time and is hidden by the task scheduling (Section 4.6.3). For the
pruned GNN models, as the densities of weight matrices decrease, the overhead of the Runtime System
will decrease since there will be more empty data partitions skipped by the runtime system (Algorithm
15).
GCN GraphSAGE GIN SGC
0
0.1
0.2
CI CO PU FL NE RE
Overhead of the Runtime System divided by the
total execution time
Figure 4.13: Overhead of runtime system on unpruned GNNs
89
Figure 4.14: Speedup over the CPU and GPU platforms (Some results are not shown due to out of memory
on CPU/GPU)
4.8.4 Comparison with the State-of-the-art
Comparison with CPU/GPU: We execute the state-of-the-art GNN frameworks – PyTorch Geometric
(PyG, version 1.11.0) and Deep Graph library (DGL, version 0.8.0post2) on CPU and GPU platforms (Table
3.3). The evaluation results are shown in Figure 4.14. We execute the same unpruned GNN models on CPU,
GPU and Dynasparse for a fair comparison. Dynasparse achieves 306×, 16.4×, 141.9× and 35× speedup
compared with PyG-CPU, PyG-GPU, DGL-CPU and DGL-GPU, respectively. Note the CPU and GPU have
7.2× and 70× higher peak performance than Dynasparse. The achieved speedup is because Dynasparse
can efficiently exploit the data sparsity in graph structure, vertex features and weight matrices. In contrast,
PyG and DGL on CPU and GPU only exploit the sparsity in the graph structure. Moreover, Dynasparse exploits FPGA-specific optimizations: (1) customized data-path with Index/Data Shuffle Networks to handle
the irregular memory access pattern of GNNs, (2) customized on-chip memory management for exploiting data locality, (3) dedicated hardware modules for sparsity profiling and layout/format transformation;
The proposed double buffering hides their overheads, and (4) lightweight soft processor interacting with
Computation Cores with extreme low latency for dynamic kernel-to-primitive mapping.
90
Table 4.9: Comparison of latency with the state-of-the-art GNN accelerators (using GCN model)
CI CO PU FL NE RE Peak Perf.
(TFLOPS)
BoostGCN [137] 1.9E-2 2.5E-2 1.6E-1 4.0E1 N/A⋆ 1.9E2 1.35
HyGCN [127] 2.1E-2 3E-1 6.4E1 N/A N/A 2.9E2 4.6
Dynasparse 7.7E-3 4.7E-3 6.3E-2 8.8E0 2.9E0 1.0E2 0.512
⋆ N/A: not available.
Comparison with GNN accelerators: Table 4.9 shows the comparison of the latency with the state-ofthe-art GNN accelerators, which do not require regenerating accelerator if the data sparsity changes. All
the accelerators execute the same unpruned GCN models and graph datasets. Dynasparse achieves 2.7×,
171× speedup on the average than BoostGCN and HyGCN, respectively. The platforms used in BoostGCN
and HyGCN have 1.25×, 9× higher peak performance than Dynasparse. The achieved speedup is because
Dynasparse can efficiently exploit data sparsity in vertex features. We expect to achieve higher speedup
when executing the same pruned GNN models, since [137, 127] do not exploit the sparsity in weights.
Discussion of preprocessing and data communication overheads: We define the end-to-end latency
as the sum of (1) the overhead of compilation/preprocessing (Section 4.8.3), (2) the overhead of CPU-FPGA
data movement (moving the processed input graph, processed GNN model, and optimized IR from the host
memory to FPGA external memory), and (3) execution latency of the accelerator. With respect to end-toend latency, Dynasparse still achieves 56.9×, 2.37×, 16.3×, 1.37× speedup on the unpruned GNN models
compared with PyG-CPU, PyG-GPU, DGL-CPU, and DGL-GPU, respectively. The preprocessing overhead,
data movement overhead, and execution latency contribute to 43.1%, 27.2%, 27.6% of the total end-toend latency on the average. The major overhead in preprocessing is data partitioning that reorganizes
the input data into data partitions. It can be reduced by multi-threading and increasing the host memory
bandwidth.
Note that the CPU-FPGA data movement overhead depends on the PCIe bandwidth. The sustained
PCIe bandwidth of the Alveo U250 FPGA board is around 11.2 GB/s while the baseline GPU (Nvidia
91
RTX3090) has PCIe bandwidth of 31.5 GB/s. The overhead of CPU-FPGA data movement can be reduced
by exploiting state-of-the-art CPU-FPGA interconnection techniques (offered by FPGA vendors), such as
PCIe 5.0. Since prior GNN accelerators [137, 127] do not include their preprocessing overheads (data partitioning and CPU-FPGA data movement), in Table 4.9, we only compare the accelerator execution latency
with [137, 127] for a fair comparison.
92
Chapter 5
GCV-Turbo: End-to-end Acceleration of GNN-based Computer Vision
Tasks
This Chapter introduces the proposed software-hardware codesign, GCV-Turbo, for GNN-based Computer
vision (CV) tasks. The development of GCV-Turbo is to address challenge 4 (Section 1.3), where some GNNbased applications utilize a mixture of models. In Section 5.1, we introduces the GNN-based computer
vision tasks. In Section 5.2, we present the experimental study to understand the performance bottleneck
in GNN-based CV tasks. In Section 5.3, we introduce the overview of GCV-Turbo system. In Section 5.4,
we elaborate the hardware architecture design of GCV-Turbo. In Section 5.5, we cover the compiler design
of GCV-Turbo. In Section 5.6, we describe the implementation details, and in Section 5.7, we demonstrate
the experimental results.
5.1 GNN-based Computer vision Tasks
Graph Neural Networks (GNNs) are playing an increasingly important role in various computer vision
(CV) tasks [8]. Figure 5.1 demonstrates several examples. These applications utilize the combined power of
convolution in CNN layers and message passing in GNN layers. This has given rise to a new domain called
GNN-based CV: CV tasks that utilize a combination of CNN and GNN layers (e.g., iteratively interleaving CNN
layer and GNN layer) or rely solely on GNN layers.
93
CNN
Support set
Query image
Embedding vectors
GNN
Classification result
Few-shot learning image
classification
Multi-label image
classification
CNN GNN
Input
image
Knowledge
database
x
Classification result
Image segmentation
Input image
CNN layer
GNN in spatial
dimension
GNN in channel
dimension
CNN
layer
GNN
layer
Skeleton-based human
action recognition
Classification
result
Figure 5.1: Examples of GNN-based CV tasks [43, 16, 142, 128]
GNN layers have gained widespread adoption in CV tasks because: (1) Firstly, GNN layers facilitate
label-efficient image classification. Training standalone CNN [54] or vision transformer (ViT) [32] typically
requires a substantial number of labeled images. For instance, achieving high accuracy with ViTs requires
over 300 million labeled images. In contrast, researchers have devised label-efficient few-shot learning
techniques [43] that combine GNN layers and CNN layers, requiring only a small number of labeled images.
(2) Secondly, GNN layers can naturally handle non-Euclidean data structures in diverse CV tasks, such as
point clouds [103, 93, 94], 3D meshes [115, 106]. In contrast, the convolution of the CNN layer and the
multi-head self-attention (MSA) of ViTs are designed for regular grids and cannot be directly employed
with non-Euclidean data structures. For example, convolution operates on 2D grids, and MSA [32] relies
on positional encodings on 2D grids. (3) Thirdly, the message passing of GNN layers excel in relation
learning for various CV tasks, allowing them to understand complex object relationships. In video action
94
recognition, a CNN [98] detects multiple objects, while GNN layers [118] are employed to capture object
relationships.
5.2 Profiling and Analysis of GNN-based CV
Figure 5.1 shows several representative GNN- based CV tasks. We conduct an experimental study to understand the challenges of accelerating GNN-based CV: (1) In CNN-based CV tasks, both CNN and GNN
layers can be computationally extensive. Moreover, the computation workloads of CNN/GNN layers vary
in tasks ranging from 2% − 100% (Figure 5.2). Directly combining a CNN accelerator and a GNN accelerator can lead to severe hardware underutilization. For example, the GNN accelerator will be idle when
executing a CNN layer. This underutilization can increase the inference latency. (2) In GNN-based CV
tasks, the CNN layer and GNN layer have very different data layouts for input and output data. Moreover,
the CNN layer and GNN layer can be interleaved (which is for better feature fusion in GNN-based CV.
See image segmentation and skeleton-based human action recognition in Figure 5.1). Switching the data
layout (including permute(), transpose(), and other indexing functions) between the CNN layer and the
GNN layer can lead to significant overhead, taking 1% − 15% execution time on a state-of-the-art GPU
platform (Figure 5.2). This can be more severe on embedded platforms with limited memory bandwidth
since layout transformation is memory-bound. (3) General purpose processors (CPU, GPU) are hard to
achieve low-latency inference for GNNs [127, 102, 137], because GNN has irregular data access patterns
and memory access patterns. Due to the complex cache hierarchy, CPU and GPU have low efficiency [127,
102, 137] for executing GNNs.
95
Figure 5.2: Breakdown analysis of GNN-based CV tasks on state-of-the-art GPU (RTX A5000). The details
of the models and datasets are elaborated in Section 5.6.
5.3 Overview
5.3.1 Problem Definition
Our objective is to perform end-to-end inference acceleration of GNN-based CV tasks. End-to-end acceleration refers to reducing the inference latency of a GNN-based CV task, which is duration from when the
input data is given to the time when the inference result is obtained. This includes data loading from external memory, executing all the layers of the model on the accelerator, and storing the results in the external
memory. To this end, we propose a compiler-hardware codesign. The compilation is an offline process.
The GCV-Turbo compiler takes a user-defined model (written in PyTorch [70] and PyTorch Geometric
[36]) as input and generates optimized code for hardware execution. The GCV-Turbo hardware design
has a fixed architecture that execute various models without reconfiguring the FPGA. This is important
for many real-world systems, such as autonomous driving, which execute various models for various data
modality. We target latency-sensitive applications such as autonomous driving, where inference latency
should be low to ensure safety.
96
5.3.2 Overview of GCV-Turbo
Figure 5.3 illustrates the overview of GCV-Turbo: (1) Compiler: It is executed on the host processor. The
input parser generates the intermediate representation (computation graph) from the given input model.
The computation graph or intermediate representation is the high-level representation of the input model
with each node representing a layer and each arrow representing the data dependency. Then, compiler
performs five-step compilation to map the input model onto the hardware accelerator. We apply several
compiler optimizations (Section 5.5.3) for GNN-based CV. Finally, the compiler generates an instruction
sequence for hardware execution. (2) Application processing unit (APU): The APU of FPGA [86] takes the
instruction sequence as input and launches the workload of inference on the hardware accelerator. (3)
Hardware accelerator: The hardware accelerator executes the computation tasks scheduled by APU.
Hardware design: Existing CNN or GNN accelerators suffer from inefficiency when handling GNNbased CV tasks. To tackle this challenge, we identify fundamental computation primitives (Section 5.4.1)
capable of representing computation kernels in both GNNs and CNNs. Subsequently, we design a flexible
data path and memory organization for efficient execution of these computation primitives within our
hardware design. This enables our accelerator to support both CNNs and GNNs. Meanwhile, our proposed
accelerator incorporates an instruction set (Section 5.4.2) providing software-like programmability. Note
that our hardware design employs resource sharing strategy (Section 5.4) such that the computation kernels
of CNN and GNN share the same set of computation resources.
Compiler design: Designing a compiler to support GNN-based CV is not merely merging separate compiler optimization for CNNs and GNNs. Instead, it needs the end-to-end optimization of the computation
graph of a GNN-based CV model. Because: (1) a GNN-based CV task often comprises both CNN and GNN
layers, and these layers can be interwoven (e.g., [128]). (2) These two layer types exhibit different data
layouts and memory access patterns. Without careful dataflow optimization, switching data layouts can
lead to substantial overhead and increased memory access latency. To address this challenge, we devise
97
Config. Of Input Config. Of Model
Intermediate Representation (IR)
Input Parser
Layer Fusion
Layer-to-op Mapping
Op-to-primitive Mapping
Task Scheduling
Five-step compilation
Instruction Sequence (Executable File)
Compiler
APU Accelerator
Input Model Exe.
External Memory
Host
Processor
Host
Memory
Hardware
Platform
Config. Of Input Config. Of Model
Intermediate Representation (IR)
Input Parser
Layer-to-Primitives Mapping
Computation Order Optimization
Operator Fusion
Data tiling & Task partitioning
Task Scheduling
Optimizations
Executable File
Compiler
APU Accelerator
Input Model Exe.
External Memory
Host
Processor
Host
Memory
Hardware Platform
Input
Data Tiling & Task Partitioning
Config. Of Input Config. Of Model
Intermediate Representation (IR)
Input Parser
Instruction Sequence (Executable File)
Compiler
APU Accelerator
Input Model Exe.
External Memory
Host
Processor
Host
Memory
Hardware
Platform
Input
Compilation
Workflow
Compiler
optimizations for
GNN-based CV
Figure 5.3: Overview of GCV-Turbo
a five-step compilation workflow (Section 5.5) with various compiler optimizations for GNN-based CV
(Section 5.5.3).
Workflow: The workflow is illustrated in Figure 5.4. Atcompile time, the compiler takes the user-defined
model as input and produces the intermediate representation (i.e., computation graph). The compiler
then performs a five-step compilation to generate an instruction sequence stored in a binary file. During
hardware execution, the APU reads the binary file and schedules the computation tasks on the hardware
accelerator.
Experimental study: We conduct the comprehensive experimental study on six representative GNNbased CV tasks (Section 5.7.1). Because these tasks (1) cover various use cases and data modalities (See Table
5.3) in real-world applications, such as autonomous driving, (2) cover various computational characteristics
of GNN-based CV (See Figure 5.2), such as varying portions of CNN/GNN layers, varying patterns of layout
transformation between CNN and GNN layers. Evaluating these tasks, we expect GCV-Turbo to perform
similarly on a broad range of GNN-based CV tasks.
98
Input Model (st_gcn)
Conv
DM
MP
BatchNorm
Activation
Conv
BatchNorm
Activation
st_gcn 1
……
st_gcn 10
Intermediate Representation (Computation graph)
Input
Parser
Instruction Sequence (Binary file)
DDMM
DDMM
Memory Read
Memory Read
Memory Write
……
SpDMM
……
Binary file
Model
Parameters
Input
External
Memory
APU
Accelerator
Compile Time
Hardwar Execution
Hardware Platform
Input
Figure 5.4: Workflow of GCV-Turbo using the skeleton-based human action recognition [128] as an example.
5.4 Hardware Architecture
As illustrated in Figure 5.5, GCV-Turbo has a unified hardware architecture that efficiently executes various
computation primitives (Section 5.4.1) in CNNs and GNNs. The accelerator has multiple parallel processing
elements (PEs), with each having an Instruction Queue (IQ) and an Instruction Decoder (ID). Each PE has a
computation array (CA) with p
2
ca computation units. Each computation unit executes the basic arithmetic
operations. Each PE has a Scalar Buffer (SB), a Vector Buffer (VB), a Weight Buffer (WB), and a Result Buffer
(RB). Each Buffer (SB/VB/WB/RB) has pca memory banks. Each bank can output pca data per cycle. There
are two all-to-all data routing networks – Buffer-to-Buffer (B2B) and Buffer-to-Pipeline (B2P) Routing
Network. Data Manipulation Module performs transformations of data layout between different layers
(e.g., CNN layer and GNN layer).
99
WB
VB
RB
SB
B2B
VB
B2P
SCU
SCU
SCU
SCU
GAU
GAU
GAU
GAU
RB
SB
B2B
VB
B2P
RB
ADT
ADT
ADT
ADT
SB
(Bank0,1) VM
VM
VM
VM
RB
VB
(Bank0,1)
SB
(Bank2,3)
VB
(Bank2,3)
SB
(Bank4,5)
VB
(Bank4,5)
SB
(Bank6,7)
VB
(Bank6,7)
VA
VA
VA
VA
RB
VB (Bank 0)
VB (Bank 1)
VB (Bank 0)
VB (Bank 1)
VB (Bank 0)
VB (Bank 1)
VB (Bank 0)
VB (Bank 1)
DDMM SpDMM
SDDMM
PSVM
PVVA
Buffer
-to
-Pipeline
Routing Network
Result Buffer
Weight Buffer
Vector Buffer
Buffer
-to
-Buffer
Routing Network
Scalar Buffer
Computation Array (CA)
Computation Pipeline
IQ ID
External FPGA
DDR Memory
PE1 PE2 … … PEP
Accelerator
Processing Element (PE)
IQ ID IQ ID IQ ID
APU IQ ID
Instruction
Queue
Instruction
DecoderActivation Units
Data Manipulation
Module (DMM)
The five computation primitives supported by the processing element (PE)
B2P
Figure 5.5: Architecture of hardware accelerator, and the basic computation primitives supported by a PE.
Resource sharing: Note that in a PE, different computation primitives share the same set of computation
units, data buffers, and routing networks. See more details in Section 5.4.1. This increases the resource
100
utilization for executing a GNN-based CV task. For resource sharing, it only requires extra wire connections and hardware multiplexers for selecting data path for different computation primitives, which incur
small hardware cost (See Section 5.6).
5.4.1 Computation Primitives
In GNN-based CV tasks, we identify five basic computation primitives (Figure 5.5), including dense-dense
matrix multiplication (DDMM), sparse-dense matrix multiplication (SpDMM), sampled dense-dense matrix
multiplication (SDDMM), parallel scalar-vector multiplication (PSVM), and parallel vector-vector addition
(PVVA). Each layer can be mapped to these basic computation primitives. The PE has a flexible architecture
to support these computation primitives. Each PE maintains hardware multiplexers to select the data
path for executing various primitives. Switching among primitives incurs one clock cycle overhead. For
simplicity, the input to a computation primitive are two matrices denoted as X ∈ R
s1×s2 and Y ∈ R
s2×s3
.
The output matrix is denoted as Z ∈ R
s1×s3
.
DDMM: DDMM executes X×Y, and views X and Y as dense matrices. To this end, the computation array
is organized as a 2-D systolic array (See Figure 5.5) with localized interconnection. X and Y are stored in
VB and WB, respectively. Different from traditional 2-D systolic arrays, DDMM incorporates a B2P routing
network for shuffling the position of input vectors (rows of X), which supports data layout transformation
between CNN layer and GNN layer (See Section 5.5.3). DDMM can execute pca × pca multiply-accumulate
(MAC) operations in each clock cycle.
SpDMM: SpDMM executes X × Y where X is a sparse matrix. The computation array is organized
as multiple pipelines with each having a Scatter Unit (SCU) and a Gather Unit (GAU). Each non-zero
element in X is represented using a three-tuple (src, dst, val), denoting row index, column index, and
value, respectively. The execution follows the scatter-gather paradigm [148, 14] as shown in Algorithm
101
18. Executing X × Y takes lSpDMM clock cycles: lSpDMM(X, Y) = ⌈
N onz(X)
pca/2
⌉ × ⌈ s3
pca
⌉ where Nonz(X)
denotes the number of non-zeros in X.
Algorithm 18 SpDMM using Scatter-Gather paradigm
while not done do ▷ Pipelined Execution
for each (src, dst, val) ∈ X in SB do ▷ Data Fetching
Route (src, dst, val) from SB to VB ▷ B2B
Fetch row src of Y: Y[src] from VB
Form input pair {Y[src], (src, dst, val)}
Route the input pair to pipeline dst%(pca/2) ▷ B2P
for each input pair {Y[src], (src, dst, val)} do
Produce u ← val × Y[src] ▷ Scatter Unit (SCU)
Update Z[dst]+ = u ▷ Gather Unit (GAU)
SDDMM: SDDMM executes Z = A ⊙ (XY) (A ∈ R
s1×s3 ), where ⊙ is the element-wise multiplication. A is a sampling matrix where each element is either 1 or 0 to sample results from XY. For
example, if A[i][j] = 1, then Z[i][j] = ⟨X[i], Y[j]⟩ where ⟨,⟩ denotes vector inner product operator.
If A[i][j] = 0, Z[i][j] = 0. Each computation pipeline is organized as an adder tree (ADT). The execution of SDDMM is shown in Algorithm 19. Executing A ⊙ (XY) takes lSDDMM clock cycles, where
lSDDMM(X, Y) = ⌈
N onz(X)
pca/2
⌉ × ⌈ s2
pca
⌉.
Algorithm 19 Sampled dense-dense matrix multiplication
while not done do ▷ Pipelined Execution
for each (src, dst) ∈ A in SB do ▷ Data Fetching
Route (src, dst) from SB to VB ▷ B2B
Fetch X[src] and Y[dst] from VB
Form input pair {X[src], Y[dst]}
Route the input pair to a pipeline ▷ B2P
for each input pair do ▷ Computation
Update Z[src][dst]+ = ⟨X[src], Y[dst]⟩ ▷ ADT
PSVM: To execute PSVM, the computation array is organized as pca/2 independent pipelines. Each pipeline
has a vector multiplier (VM) to execute the multiplication between a scalar and a vector of length pca. A
PE can execute p
2
ca/2 multiply operations per clock cycle. PSVM can be used to perform matrix-vector
multiplication.
102
PVVA: To execute PVVA, for pca/2 independent pipelines, each pipeline has a vector adder (VA) to execute
the vector addition between two vectors of length pca. A PE can execute p
2
ca/2 addition operations per cycle.
PVVA can be used to execute matrix addition.
5.4.2 Instruction Set
We develop a customized instruction set, including computation instructions, memory read/write instructions. (1) Computation Instructions includes the instruction for each computation primitives (e.g., DDMM
instruction). Each instruction contains the meta data (e.g., matrix size) of the corresponding computation
primitive. The Instruction Decoder decodes the instruction and generates control signal for the PE to execute the computation primitives in pipelined manner. Memory Read/Write Instructions launch the data
transactions between the on-chip buffer and the external memory.
5.5 Compiler
Existing compilers for CNN or GNN accelerator [33, 95, 62, 1, 134, 15, 83, 140] support only one type of
model (CNN or GNN). In contrast, GCV-Turbo offers an end-to-end compilation/optimization workflow for
GNN-based CVs. For a given input model developed using PyTorch, the Input Parser converts it into an intermediate representation (Section 5.5.1), which serves as the computation graph underlying the inference
process. The compiler then performs a five-step compilation (Section 5.5.2) to generate an instruction
sequence. Especially, we perform a number of specific optimizations (Section 5.5.3) for GNN-based CV
tasks: including (1) data manipulation (DM) layer generation, (2) layer fusion for DM layer, (3) uniform
mapping, (4) data layout centric mapping, and (5) sparsity-aware primitive mapping. Our compiler utilizes
the infrastructure of TVM framework [12]. Based on it, we develop our own input parser, intermediate
representation, compilation workflow, and compiler optimizations.
103
5.5.1 Intermediate Representation
We develop the intermediate representation (IR) for the following set of computation layers in GNN-based
CV tasks:
Convolutional (Conv) Layer: The input Fin has cin feature maps (channels), each having a size of hin ×
win. The output Fout has cout feature maps (channels) with each having the size of hout × wout. The
convolution kernel W has the size of cout×cin×k1×k2. The output Fout is obtained through 2D convolution
between input Fin and kernel W.
Message Passing (MP) Layer: It is used in GNNs for message passing within graph G(V, E). The input
are vertex feature vectors {hin[v] ∈ R
f
: v ∈ V} and edges {evu ∈ R
1
: evu ∈ E}. The output vertex feature vectors {hout[v] ∈ R
f
: v ∈ V} are obtained through message passing: hout[v] = ρ({euv · hin[u] : u ∈ N (v)})
where N (v) denotes the set of neighbors of v, and ρ() is the element-wise reduction function, such as
Max() and Sum().
Linear Layer: In a Linear Layer, an input matrix Hin is multiplied by a weight matrix W to obtain
output matrix Hout
.
Vector Inner Product (VIP) Layer: The inputs are the vertex feature vectors {hin[v] ∈ R
f
: v ∈ V}, and
predefined edge connectivity {evu ∈ R
1
: evu ∈ E} with the value of evu to be calculated. evu is calculated
by: euv = ⟨hin[u], hin[v]⟩ where ⟨,⟩ denotes vector inner product.
Data Manipulation (DM) Layers: The DM layer is our proposed new layer that represents the necessary
data manipulation operation between the CNN layer and the GNN layer. See details in Section 5.5.3.1.
Other Layers: Include other types of layers, such as Pooling layers, Normalization (Norm) layers, and
Activation layers.
Following the convention of TVM [12], we implement the IR of each layer as a tensor IR function
(T.prim_func) using TVMScript. The input parser of the compiler generates the computation graph from
the input model and represents each layer using the IR.
104
5.5.2 Compilation Workflow
We introduce the basic compilation workflow of GCV-Turbo, which has five steps:
• Step 1 - layer fusion: For the computation graph of an input model, the layer fusion step merges
some layers (e.g., activation layer, normalization layer) into the adjacent layers to facilitate tasklevel parallelism, reduce memory traffic, and reduce overall computation complexity.
• Step 2 - layer-to-matrix operation mapping: For each layer in the computation graph, the compiler
maps it into a set of matrix operations (e.g., matrix multiplication).
• Step 3 - data tiling and task partitioning: Because the accelerator has limited on-chip memory, this
step performs data tiling for each matrix operation. Therefore, a large matrix operation can be
decomposed into a set of matrix operations on small data tiles.
• Step 4 - mapping matrix operation to Computation Primitive: This step maps each matrix operation
into the basic computation primitives (Section 5.4.1) that are supported by the accelerator.
• Step 5 - Task scheduling: This step plans the execution of the computation graph on the accelerator.
The proposed accelerator processes the model layer-by-layer. For each layer, the APU schedules its
computation using a centralized load balancing scheme for workload balance between PEs, according
the status (idle or busy) of PEs.
In our design, each step is implemented as a compilation pass. Finally, the compiler generates an instruction
sequence for hardware execution.
5.5.3 Compiler Optimizations for GNN-based CV tasks
In this Section, we introduce the following set of compiler optimizations for GNN-based CV.
105
5.5.3.1 Data Manipulation Layer Generation
CNN layer and GNN layer can have very different data layouts. For example, the output data layout of
a CNN layer may not be compatible with the input data layout requirement of a GNN layer, and vice
versa. The input parser generates the data manipulation (DM) layer between the CNN and GNN layers.
In GNN-based CV tasks, the data manipulation process between the CNN and GNN layers is illustrated in
Figure 5.6. For example, for the output feature maps of a CNN layer, GNN is used to perform reasoning in
channel or spatial dimensions. For reasoning in channel dimension, each channel is viewed as a graph node
(channel-to-node transformation), while for reasoning in spatial dimension, each patch of pixels in spatial
dimension is viewed as a graph node (patch-to-node transformation). This data manipulation process can
lead to significant overhead and requires careful compiler-hardware co-optimization. The DM layer will
be optimized during the compilation process.
Input feature maps to a
CNN layer
Generating
graph node
from channel
dimension
Input graph for
a GNN layer
Output feature
maps of a CNN
layer
Generating
graph nodes
from spatial
dimension
Output Graph of a GNN layer
CNN
layer to
GNN
layer
GNN
layer to
CNN
layer
Output
feature maps
of a CNN layer
Input feature maps to a
Output Graph of a GNN layer CNN layer
Figure 5.6: Data Manipulation between CNN and GNN layers
106
5.5.3.2 Layer fusion for DM layer
To reduce the overhead of the DM layer, the compiler merges the DM layer with the following computation
layer (Conv layer or MP layer). This overlaps the data manipulation and the computation. In our hardware
design (Figure 5.5), each PE maintains a Data Manipulation Module (DMM), which pipelines the data
manipulation operation and computation.
5.5.3.3 Uniform Mapping
CNN layer (Conv layer) and GNN layer (MP layer) have very different computation patterns. To leverage
our flexible hardware design, the compiler performs uniform mapping for the CNN layer and GNN layer in
step 2. Both CNN layer and GNN layer are mapped to matrix operations, including matrix multiplication
and matrix addition. Since our hardware architecture is optimized for various matrix operations, both the
CNN layer and GNN layer can be efficiently executed using our unified architecture design.
5.5.3.4 Data Layout Centric Mapping
CNN layer and GNN layer have very different data layouts. To reduce the data manipulation overhead
(Figure 5.6) between two layers, we proposed to perform data layout centric mapping, which involves the
mapping of Conv layers and mapping of MP layers:
Mapping of a Conv layer: As shown in Figure 5.7, for a Conv layer, the convolution kernel matrix W
is rearranged into k1 × k2 submatrices, denoted as {KMi
: 0 ⩽ i ⩽ k1k2 − 1}, where each KMi has
dimensions cin × cout. The input feature maps Fin are organized into a matrix denoted IFM of size cin ×
hinwin, with each input feature map represented as a row in this matrix. Each KMi
is multiplied by IFM
to obtain k1k2 output matrices, denoted as {OFMi
: 0 ⩽ i ⩽ k1k2 − 1}, each having dimensions cout ×
houtwout. Through shift and add (shift-add) operations, the k1k2 output matrices are merged into a single
107
output matrix OFM of size cout × houtwout. OFM can be further reorganized back to cout output feature
maps. Consequently, a Conv Layer is mapped to matrix multiplication and matrix addition operations.
Data layout of Conv Layer: The proposed mapping strategy brings several benefits: (1) The computation of a Conv Layer is mapped to the matrix operations associated with the computation primitives. (2)
The reorganization of data layout for the kernel matrix (W) occurs at compile time, incurring a one-time
cost. (3) The data layout for both IFM and OFM remains consistent without the need for data layout
transformations between consecutive Conv Layers. (4) Most importantly, the data layout of IFM/OFM
can simplify the data layout manipulation between CNN layer and GNN layer. For example, if the data
manipulation layer performs channel-to-node transformation, each row of IFM/OFM corresponds to a
channel in the feature maps of a CNN layer. IFM/OFM can serves as the input feature matrix for the
following MP layer. If the data manipulation layer performs patch-to-node transformation, each column or
several columns of IFM/OFM corresponds to an image patch in the feature maps of a CNN layer. The
following MP layer can load node features through matrix transpose, which can be efficiently executed by
the Data Manipulation Module.
Mapping of MP layer: An MP layer is mapped to the multiplication of graph adjacency matrix A and feature matrix H. This matrix multiplication will be mapped to either dense computation primitive (DDMM)
or sparse computation primitive (SpDMM), which will be introduced later (Section 5.5.3.5) in detail. To
reduce the overhead of data manipulation from MP layer to Conv layer (Figure 5.6), the compiler utilizes
the Buffer-to-pipeline (B2P) routing network for channel shuffling in DDMM and SpDMM. Because for the
data from a GNN layer to a CNN layer (Figure 5.6), each node feature vector or a piece of node feature
vector needs to be routed to the corresponding channel of the feature maps of a CNN layer. During compilation, the compiler assigns a channel index for each node feature vector or a piece of feature vector.
During hardware execution, when performing DDMM or SpDMM, the B2P routing network routes the
108
k1
k2
cin
Input feature maps in
cin
cout
……
k1k2
Kernel Matrix
cin hin
win
× cin
hinwin
=
……
k1k2
Output feature maps out
cout hout
wout
houtwout
Shift-add
cout
cout cout
IFM
KM0
KM1
KMk1k2−1
houtwout
cout OFM
Figure 5.7: Mapping a Conv layer to matrix operations
feature vector to the corresponding channel stored in the result buffer. Through this on-the-fly channel
shuffling, we eliminate the overhead of data manipulation from the GNN layer to the CNN layer.
5.5.3.5 Sparsity-aware Primitive Mapping
In step 2, both CNN layers and GNN layers are mapped to matrix operations. Nevertheless, the weight
matrix of a CNN/GNN layer or the adjacency matrix of a GNN layer can have different data sparsity. To
exploit the data sparsity, the compiler performs sparsity-aware primitive mapping in step 4. In Step 4, for
each matrix multiplication operation, the compiler maps it to dense computation primitive (DDMM) and
sparse computation primitive (SpDMM) based on the data sparsity and performance models of computation
primitives (Section 5.4.1).
109
5.6 Implementation Details
Hardware: We implement the accelerator and the APU (Figure 5.3) on an Alveo U250 FPGA [125]. We
empirically set pca = 16 for each PE and use the half-precision floating-point data format (fp16). The Alveo
U250 board consists of four Super Logic Regions (SLRs). Each SLR can be deployed with 2 PEs, except for
SLR1, where half of it is occupied by FPGA shell and APU. We utilize Verilog HDL for developing the PE,
and use MicroBlaze [86] IP core from AMD Xilinx for implementing the APU. FPGA synthesis and placeroute are carried out using Vivado 2022.2. The generated device map and resource utilization are reported
in Figure 5.8. We also perform frequency optimization following the methodology in Xilinx DPU [37] to
set the frequency of the computation units (fcu = 600 MHz) to double that of the data buffers (fbuffer = 300
MHz), enhancing the peak performance of the accelerator.
Impact of resource sharing: As discussed in Section 5.4, different computational primitives share the
same set of computation units, on-chip buffers, and routing networks. The wires of different primitives
and multiplexers for selecting data paths incur extra area costs. In each PE, these wires and multiplexers consume 37K LUTs (Figure 5.8), taking 31% LUTs consumption of a PE (A PE consumes 118K LUTs).
Through resource sharing, our PE design only costs extra 31% LUTs for supporting various computation
primitives.
Compiler: We develop the compiler using Python built upon TVM infrastructure [12]. Based on it,
we develop our own intermediate representation, compilation workflow, and compiler optimizations. The
compiler takes the computation graph generated by PyTorch [91] and the metadata of input data as input.
We develop customized intermediate representation (IR) as TVM prime functions. The five-step compilation is implemented as IR transformation passes to process the generated IR step by step. The output of
the compiler is a sequence of instructions that is stored in a binary file.
110
PE1 PE2 PE3 PE4 PE5 PE6 PE7
SLR3 SLR2 SLR1 SLR0
FPGA
Shell +
APU
Utilization
LUT 1129K/1726K (65.4%)
BRAM 1539/2688 (57.3%)
URAM 960/1280 (75%)
DSP 960/1280 (59.3%)
Figure 5.8: Device map on Alveo U250 FPGA
5.7 Experimental Results
Overview: We conduct experiments to demonstrate two key aspects: (1) Scope: GCV-Turbo’s versatility
to handle a wide range of GNN-based CV tasks as well as traditional CNNs and GNNs; (2) Performance:
GCV-Turbo’s ability to achieve high performance, especially for the end-to-end acceleration of GNN-based
CV tasks. A comparison of scope and performance are summarized in Table 5.1 and Table 5.2, respectively.
Table 5.1 clearly illustrates that unlike existing CNN DSAs [33, 95, 62, 1, 134, 15, 83] and GNN accelerators
[127, 45, 137, 102, 73, 139, 46], which target only one specific scope (either CNNs or GNNs), GCV-Turbo
can handle all three scopes – CNNs, GNNs, as well as GNN-based CV. We note that current state-of-theart implementations of GNN-based CV tasks run these ML models on CPUs or GPUs [23, 25, 24, 27, 26]
(given the limited scope of CNN DSAs and GNN accelerators). Thus, a natural performance comparison of
GCV-Turbo is with standalone CPUs or GPUs for such tasks. Table 5.2 shows a comprehensive comparison
of GCV-Turbo versus all alternative baselines - standalone CPU, GPU as well as all the DSAs. Note that
GCV-Turbo not only offers comparable performance with CNN DSAs and GNN accelerators within their
specialized scopes, it also outperforms CPU and GPU platforms in all three scopes.
The rest of this section is organized as follows: (1) Section 5.7.1 introduces the benchmarks, baselines,
metrics, and datasets. (2) Section 5.7.2 presents the comparison results with state-of-the-art CPU and
GPU on six GNN-based CV tasks, and standalone CNNs and GNNs. (3) Section 5.7.3 shows the impact of
111
Table 5.1: Scope of various accelerators
Accelerator Scope 1
(CNNs)
Scope 2
(GNNs)
Scope 3
(GNN-based CV)
Performance
Comparison
CPU and GPU ✓ ✓ ✓ See Section 5.7.2
CNN DSAs [33, 95, 62, 1, 134, 15, 83] ✓ ✗ ✗ See Section 5.7.4.1
GNN Accelerators [127, 45, 137, 102, 73, 139, 46] ✗ ✓ ✗ See Section 5.7.4.2
GCV-Turbo ✓ ✓ ✓
Table 5.2: Average speedup achieved by GCV-Turbo over various baselines within their specialized scopes.
Each entry represents the performance of GCV-Turbo divided by the performance of the respective baseline. “Not supported" means that the scope is not supported by the baseline.
Baseline CNNs GNNs GNN-based CV tasks
CPU (GPU) 418.8× (1.8×) 499.5× (3.2×) 68.4× (4.1×)
CNN DSAs 0.88 − 0.93× Not supported Not supported
GNN Accelerators Not supported 1.03 × −1.25× Not supported
compiler optimizations. (4) Section 5.7.4 compares GCV-Turbo’s performance with that of state-of-the-art
CNN and GNN accelerators, within their respective scopes.
5.7.1 Benchmarks, Baselines, and Metrics
Benchmarks: We collect benchmarks from three scopes, including (1) scope 3 (GNN-based CV): representative GNN-based CV tasks, as elaborated in Table 5.3, which cover diverse data modalities and model
types. (2) scope 1 (CNNs): popular CNN models for CV tasks, including c1: AlexNet, c2: ResNet-50
[54], c3: ResNet-101 [54], c4: VGG16 [104], and c5: VGG19 [104]; (3) scope 2 (GNNs): widely used GNN
models (g1: GCN [66], g2: GraphSAGE [53], g3: GAT [107]);
Table 5.3: Details of evaluated GNN-based CV tasks
Notation Task Input Modality Model Type Dataset
b1 [43] Few-shot image classification image CNN + GNN Omniglot [68]
b2 [16] Multi-label image classification image CNN + GNN MS-COCO [79]
b3 [142] Image segmentation image CNN + GNN Cityscapes [22]
b4 [128] Skeleton-based action recognition human skeleton CNN + GNN NTU RGB+D [81]
b5 [138] SAR automatic target classification radar signal CNN + GNN MSTAR [87]
b6 [93] Point cloud classification point cloud GNN ModelNet40 [122]
112
Table 5.4: Statistics of the graphs in GNN-based CV tasks
Model # of vertices # of edges Feature length
b1 25-100 300-5000 300-400
b2 80 6400 300-2048
b3 100-300 10000-30000 561-33153
b4 25 75-125 9600-19200
b5 16384 131072 48
b6 1024 10000-30000 64-1024
Baselines: We compare the performance with the implementations on CPU and GPU as shown in Table
5.5.
Table 5.5: Specifications of platforms
Platforms CPU GPU GCV-Turbo
Platform AMD Ryzen 3990x Nvidia RTX A5000 Alveo U250
Platform Technology TSMC 7 nm Samsung 8 nm TSMC 16 nm
Frequency 2.90 GHz 1170 MHz 600/300 MHz
Peak Performance 3.7 TFLOPS 27.7 TFLOPS 1.08 TFLOPS
On-chip Memory 256 MB L3 cache 6 MB L2 cache 45 MB
Memory Bandwidth 107 GB/s 768 GB/s 77 GB/s
Performance Metrics: We consider two performance metrics (1) batch-size-one latency: this measures
the accelerator’s latency when the batch size is equal to one. In applications like autonomous driving
[100], low latency is critical for ensuring safety; (2) throughput: when comparing with state-of-the-art
CNN accelerators for standalone CNNs (Section 5.11), we use throughput as the performance metric. CNN
accelerator performance is typically reported in throughput [33, 134].
5.7.2 Comparison with CPU and GPU Implementations
In this section, we provide a comprehensive comparison between GCV-Turbo and CPU/GPU across the
three scopes (Table 5.1). The summarized results can be found in Table 5.2.
113
5.7.2.1 Evaluation on Scope 3 (GNN-based CV tasks)
Figure 5.9 displays the comparison results with CPU and GPU performance on six representative GNNbased CV tasks. The CPU and GPU implementations of these six tasks are from the well-optimized opensource implementations [23, 25, 24, 27, 26], which utilize the optimized CUDA library for CNN and GNN
layers. Note that b3 employs two different CNN models, ResNet-50 and ResNet-101, in combination with
their proposed GNN layers, resulting in two combinations denoted as b3-r50 and b3-r101, respectively.
On average, GCV-Turbo achieves 68.4× and 4.1× latency reduction compared with CPU and GPU, respectively. This speedup is attributed to two factors: (1) The proposed accelerator utilizes unified architecture to
accelerate both CNN and GNN layers, improving resource utilization. (2) Our compiler optimizations hide
and eliminate the overhead of data layouts transformation between CNN and GNN layers. As illustrated
in Figure 5.9, GCV-Turbo achieves a higher speedup on b1 and b4-6, due to (1) As shown in breakdown
analysis below, GNN layers constitute a larger portion of the workload in b1 and b4-6. GCV-Turbo can
achieve a higher speedup for the GNN layers due to its optimized architecture for irregular computation in
GNN. (2) As shown in Table 5.6, the model b1 and b4-6 can fit in the on-chip memory of our accelerator.
Due to the customized on-chip memory organization, the Computation Array can access the parameters
of the model in one clock cycle. In contrast, GPUs have complex cache hierarchy and small L1 cache (128
KB per SM); Accessing the parameters requires navigating through a complex cache hierarchy, resulting
in higher latency.
Discussion on the throughput of GPU: While GCV-Turbo achieves lower latency when batch size is
1, GPU can achieve higher throughput by increasing the batch size (e.g., 8/16/32) as GPU has higher peak
performance and memory bandwidth. Nevertheless, this work targets latency-sensitive applications (e.g.,
autonomous driving). For higher throughput, it requires FPGA vendors to develop more powerful FPGA
boards with more hardware resources.
114
Figure 5.9: Speedup (latency reduction) over CPU and GPU on GNN-based CV tasks
Table 5.6: Model size in GNN-based CV tasks
Task b1 b2 b3-r50 b3-r101 b4 b5 b6
Model size (MB) 9.6 115 66 114 5.2 0.76 1.67
Breakdown Analysis: We conduct a breakdown analysis to understand the speedup of GCV-Turbo compared with the GPU on b1-6. The results, depicted in Figure 5.2 and 5.10, demonstrate that different
GNN-based CV tasks consist of varying proportions of CNN and GNN layers, and data layout transformation. Table 5.7 presents the breakdown analysis of speedup over the baseline GPU. GCV-Turbo achieves a
speedup of 1.2 − 2.4× on the CNN portion and 1.3 − 15.2× on the GNN portion for various GNN-based
CV tasks. Moreover, through our compiler optimizations, the overhead of data layout transformation is
completely reduced and hided, leading to higher latency reduction.
Table 5.7: Speedup (batch-size-one latency) of GCV-Turbo over GPU on various portions of the GNNbased CV tasks. For layout transformation, the speedup is ∞ because GCV-Turbo completely eliminates
its overhead.
b1 b2 b3-r50 b3-r101 b4 b5 b6
CNN layers 2.4× 1.2× 1.2× 1.2× 1.8× 2.3× N/A
GNN layers 7.6× 6.8× 1.3× 1.3× 8.4× 6.5× 15.2×
Layout transformation ∞ ∞ ∞ ∞ ∞ ∞ 0
Total 5.1× 1.3× 1.2× 1.2× 3.6× 4.6× 15.2×
5.7.2.2 Evaluation on Scope 1 (CNNs)
Table 5.8 illustrates the comparison between GCV-Turbo and highly-optimized CPU and GPU implementations across various widely used CNNs. On average, GCV-Turbo achieves 418.8× (1.8×) latency reduction
compared with CPU (GPU) implementations.
115
Figure 5.10: Proportion of hardware execution latency of various portions (CNN portion and GNN portion)
on GCV-Turbo.
Table 5.8: Speedup (batch-size-one latency) over CPU and GPU on various CNNs
Model c1: AlexNet c2: ResNet50 c3: ResNet101 c4: VGG16 c5: VGG19
Speedup over CPU 182× 43× 42× 971× 855×
Speedup over GPU 3.9× 1.2× 1.2× 1.4× 1.5×
5.7.2.3 Evaluation on Scope 2 (GNNs)
We evaluate GCV-Turbo using various GNN models and graph datasets. Table 5.9 displays the speedup
achieved by GCV-Turbo over the CPU and GPU platforms. The implementation on CPU and GPU utilizes
the state-of-the-art GNN library, PyTorch Geometric [36]. On average, GCV-Turbo achieves a speedup of
499.5× compared with CPU and 3.2× compared with GPU.
116
Table 5.9: Speedup over CPU/GPU across various GNNs and graph datasets. [] denotes the speedup of
GCV-Turbo over CPU, while () denotes the speedup of GCV-Turbo over GPU.
Cora [66] CiteSeer [66] PubMed [66] Flickr [53]
g1: GCN [76.2×] (6.7×) [28.8×] (2.7×) [1009×] (2.4×) [312×] (2.4×)
g2: SAGE [131.4×] (2.5×) [119.7×] (1.9×) [178.9×] (2.1×) [421.9×] (3.6×)
g3: GAT [2250×] (6.8×) [1016×] (2.9×) [178.9×] (2.1×) [278.8×] (2.0×)
5.7.3 Impact of Compiler Optimizations
We evaluate the impact of two compiler optimizations:
Layer fusion: Layer fusion yields a speedup ranging from 11.8% to 48.9% across the six GNN-based CV
tasks. This speedup can be attributed to layer fusion’s capacity to enhance task-level parallelism, reduce
external memory traffic, and decrease the overall computational complexity.
Sparsity-aware mapping: As the weight matrices of the CNN portions in b1-b6 remain unpruned,
sparsity-aware mapping does not accelerate the CNN portion. As a result, our speedup measurements
exclusively focus on the GNN portion within b1-b6. The sparsity-aware mapping results in speedup percentages of 5.2%, 330%, 356%, 356%, 2.3%, 2.3%, 20.5%, and 0% for the GNN portions within b1 to b6,
respectively. The GNN within b6 does not experience any speedup because, in b6, the GNN consists of
Linear layers, activation layers, and batch normalization layers. The weight matrices within the Linear
layers of b6 do not have data sparsity.
5.7.4 Comparison with State-of-the-art Accelerators
We compare the performance of GCV-Turbo with CNN DSAs [33, 134] on CNN models in Section 5.7.4.1
and with GNN accelerators [137, 140] on GNN models in Section 5.7.4.2. Different accelerators are implemented on different hardware platforms and use different amount of hardware resources. For a fair comparison, we normalize the performance (latency/throughput) by their respective peak performance (FLOPs).
For example, normalized throughput is calculated by: Normalized Throughput of [X] =
Throughput of [X]
Peak performance of [X]
where [X] can be AMD DPU [33], OPU [134], BoostGCN [137], GraphAGILE [140], or GCV-Turbo.
117
Table 5.10: Specifications of CNN/GNN accelerators
CNN DSAs GNN Accelerators
Platforms AMD DPU [33] OPU1024 [134] BoostGCN [137] GraphAGILE [140]
Platform ZCU102 Xilinx XC7K325T Stratix10 GX Alveo U250
Platform Technology N/A 28 nm Intel 14 nm TSMC 16 nm
Peak Performance 1.15 TFLOPS 0.2 TFLOPS 0.64 TFLOPS 0.64 TFLOPS
On-chip Memory 32.1 MB 2 MB 45 MB 45 MB
Memory Bandwidth 19.2 GB/s 12.8 GB/s 77 GB/s 77 GB/s
5.7.4.1 Comparison with CNN Domain-specific Accelerators (DSAs)
We compare GCV-Turbo’s performance with state-of-the-art FPGA-based CNN DSAs, AMD DPU [33] and
OPU [134] (Table 5.10), on throughput (Table 5.11). GCV-Turbo’s throughput is computed as 1
latency . GCVTurbo achieves a normalized throughput of 0.88× and 0.93× compared with OPU and DPU on c1-c5.
These results demonstrate GCV-Turbo’s competitive throughput in various CNN models, despite slight
lower performance. These throughput differences are due to two design trade-offs: GCV-Turbo’s versatility, supporting both CNNs and GNNs, sacrifices some CNN-specific architectural optimizations. For example, OPU’s multi-level parallelism is fine-tuned for CNN convolution operations, whereas GCV-Turbo’s
architecture is more generalized. GCV-Turbo’s compilation flow optimizes CNNs and GNNs holistically
but cannot support certain convolution-specific optimizations. DPU, for example, selects dataflow for convolutional layers based on kernel size, which cannot be directly applied to GNN layers. Moreover, due to
CNN-specific optimizations, OPU and DPU have higher efficiency for using limited DDR memory bandwidth for CNNs. However, OPU and DPU compilers do not support GNNs, and their architectures are
inefficient for irregular computations and memory access patterns of GNNs.
Table 5.11: Comparison of inference throughput (images/second) with CNN DSAs on various CNN models
Throughput (unnormalized) Normalized average speedup of
c1 c2 c3 c4 c5 GCV-Turbo over the DSA
DPU [33] N/A 43.4 38.8 274 N/A 0.93×
OPU1024 [134] N/A 12.2 9.7 54.4 27 0.88×
GCV-Turbo 512.9 58.8 46.5 254.7 127.3 1×
118
5.7.4.2 Comparison with GNN Accelerators
Table 5.12: Comparison of hardware execution latency (ms) with state-of-the-art GNN accelerators
Latency (ms) (unnormalized) Normalized average speedup of
CO CI PU FL RE YE AP GCV-Turbo over the accelerator
BoostGCN N/A N/A N/A 20.1 98.5 193.5 793.5 1.25×
GraphAGILE 0.819 2.55 2.24 11.5 97.2 104.3 315 1.03×
FlowGNN 6.9E − 3 8.3E − 3 53E − 3 N/A 136 N/A N/A 0.003× (CO/CI/PU), 0.38× (RE)
GCV-Turbo 0.48 1.47 1.25 6.09 72.7 43.5 196.9 1×
We compare GCV-Turbo with state-of-the-art GNN accelerators, BoostGCN [137], GraphAGILE [140],
and FlowGNN [102] (see Table 5.10). Latency measurements follow the methodology from [137, 140],
focusing on GCN model and various non-CV graph datasets (Citation networks: CO [66], CI [66], PU [66];
Recommendation systems: FL [136], RE [53], YE [136], AP [136]). Table 5.12 presents the results, where
latency is normalized by peak performance of the hardware platform to obtain the speedup. GCV-Turbo
outperforms BoostGCN and GraphAGILE with speedups of 1.25× and 1.03×, respectively. BoostGCN’s
inferior performance is attributed to separate sparse and dense computation hardware modules, leading
to underutilization. In contrast, GCV-Turbo optimizes resource usage with a unified architecture for both
sparse and dense computations in GNNs. The slight advantage over GraphAGILE is due to GCV-Turbo’s
sparsity-aware mapping (Section 5.5.3.5), considering data sparsity in the input graph’s connectivity. GCVTurbo maps computations to DDMM for densely connected subgraphs, unlike GraphAGILE, which neglects
data sparsity in graph connectivity. Compared with FlowGNN, GCV-Turbo performs lower because (1)
FlowGNN leverages sparsity in graph feature matrices (CO, CI, PU has high data sparsity (> 90%) in
feature matrices), while GCV-Turbo only uses sparsity in graph adjacency and weight matrices, and (2)
FlowGNN generates optimized hardware implementation for different input models. However, the above
two optimizations of FlowGNN are unattractive for CV tasks because (1) the sparsity of graph feature
matrices is only known during hardware execution; Utilizing its sparsity needs on-the-fly sparsity profiling
and data format transformation, causing extra preprocessing overhead. Moreover, the execution time
119
varies with the sparsity of the input data. However, autonomous driving requires deterministic latency
for safety. (2) An autonomous driving system will execute various models for various data modalities.
Generating optimized bitstreams for each model incurs large latency for switching between the bitstreams
through dynamic reconfiguration.
120
Chapter 6
Conclusions
In this concluding chapter, we will provide remarks on the dissertation, exploring the broader impacts of
the work and the future directions it may inspire.
6.1 Broader Impacts
Graph neural networks (GNNs) have found widespread application across various real-world domains, including recommendation systems, social networks, and circuit design. It is foreseen that their significance
will only grow in the future, given that most real-world data can be effectively represented as graphs. Even
seemingly unrelated data formats, such as two-dimensional images, can be conceptualized as a form of
graph, particularly with their underlying grid structure. Furthermore, as real-world applications leveraging GNNs continue to expand, so does the scale of the graph data involved. For instance, recommendation
systems and social networks are expected to accommodate ever-growing numbers of users, while circuit designs will become increasingly intricate, incorporating larger arrays of transistors. Consequently,
there is an imminent demand for high-performance machine learning systems tailored for graph-based
applications to support these evolving needs. The GraphAGILE approach, a compiler-hardware codesign
framework developed in this study, stands out as an exemplary solution, offering a comprehensive strategy
for accelerating graph neural networks from end to end.
121
Furthermore, the inherent data sparsity within graph neural networks (GNNs) can be leveraged to
expedite GNN inference processes. This exploitation of sparsity not only diminishes computational complexity but also results in decreased inference latency. Effectively harnessing this sparse data necessitates
a versatile hardware design and runtime system capable of dynamically mapping a GNN computation kernel to fundamental computation primitives. The hardware-runtime-system codesign approach presented
in this study, termed Dynasparse, offers a novel methodology for maximally capitalizing on data sparsity
within GNN inference tasks.
Finally, graph neural networks (GNNs) have emerged as pivotal tools in various computer vision tasks,
owing to their ability to glean insights from unstructured data prevalent in computer vision realms, such
as point clouds and 3-D meshes. GNN-based computer vision endeavors entail a hybrid machine learning
model design, integrating both convolutional neural network (CNN) layers and GNN layers. This demands
a comprehensive optimization approach within the machine learning system to effectively manage the
execution of GNN-based tasks, considering the interleaving nature of CNN and GNN layers. The hardwaresoftware codesign framework introduced in this study, named GCV-Turbo, marks the pioneering effort in
optimizing GNN-based computer vision tasks. It offers a suite of compiler optimizations and hardware
enhancements tailored to address the intricate interleaving dynamics between CNN and GNN layers.
6.2 Future Directions
The hardware-software codesigns presented in this study pave the way for high-performance graph neural
network inference across numerous real-world applications. In this section, we delve into several prospective avenues for future exploration.
122
6.2.1 Accelerating Graph Neural Network on Heterogeneous Platforms
Graph Neural Networks (GNNs) inference involve the heterogeneous computation kernels that are suitable
to be executed on the heterogeneous platform with heterogeneous components. Different components are
suitable to execute different computation kernels, which have the potential to accelerate graph neural
networks. There are several candidate platforms that can be potentially exploited:
Figure 6.1: Diagram of AMD-Xilinx ACAP platform
• AMD-Xilinx ACAP [38]: As depicted in Figure 6.1, the AMD-Xilinx ACAP platform represents a
system-on-chip (SOC) architecture amalgamating heterogeneous components. These include scalar
engines, adaptable engines, and intelligent engines. Scalar engines feature microprocessors adept
at executing intricate control workflows. Adaptable engines comprise programmable logics and
hardware DSPs, ideally suited for deploying customized data paths to execute sparse computation
kernels within GNNs. On the other hand, intelligent engines encompass a 2-D array of AI engines,
tailor-made for executing dense computation kernels.
123
• AMD Ryzen AI platform: As shown in Figure 6.2, AMD Ryzen AI platform consists of Central
processing unit (CPU), Neural processing unit (NPU), and Graphic processing unit (GPU). The CPU
is suitable to execute the complex control flow. The NPU is developed for executing the dense
computations in machine learning model. The GPU has optimized libraries for executing sparse
computation. Utilizing the combined strength of CPU, NPU, and GPU, Ryzen AI platform has the
potential to further speedup the GNN inference.
Figure 6.2: Diagram of AMD Ryzen AI platform
6.2.2 Accelerating Hybrid Machine Learning Models in Computer Vision Tasks
The emergence of diverse machine learning (ML) models has brought about a significant revolution in
computer vision (CV), facilitating various novel CV applications. These ML models include convolutional
neural networks (CNNs) [54, 67, 98], graph neural networks (GNNs) [66, 53, 108], and vision transformers
(ViTs) [31, 82]. While various ML models are available for CV tasks, there is no one-size-fits-all solution,
as different models have different strengths. Convolutional Neural Networks (CNNs): CNNs excel in tasks
like image classification [54], object detection [98], and image segmentation [63]. They are good at capturing local patterns and hierarchical features through their convolutional layers. CNNs are well-suited for
124
handling large image datasets, offering computational efficiency. However, they are less effective when
dealing with graph-structured data or sequences and struggle to model long-range dependencies. Graph
Neural Networks (GNNs): GNNs are designed to capture relationships and propagate information in graph
data. They are an ideal choice for CV tasks where the data structure is defined by nodes and edges, as seen
in point clouds [93] and 3-D meshes. However, GNNs are less suitable for directly handling regular gridlike data, such as images. Vision Transformers (ViTs): ViTs are tailored for image analysis tasks similar to
CNNs. However, they adopt a different approach by leveraging self-attention mechanisms to model image
content. Unlike CNNs, ViTs are scalable and good at capturing long-range dependencies in images. Nevertheless, ViTs require a substantial amount of labeled data for training, which can be resource-intensive.
They can be computationally expensive when compared to specific CNN architectures. Additionally, ViTs
depend on positional encoding in regular data and cannot directly deal with graph-structured data.
Different ML models have different strengths, necessitating the selection between CNNs, GNNs, or
ViTs for real-world applications based on specific problem requirements and data structures. Furthermore, many CV tasks leverage the combined strengths of different models. For example, GNN-based CV
tasks utilize the capabilities of both CNNs and GNNs to enable a wide array of innovative CV applications, including few-shot learning [43], multi-label image classification [16], image segmentation [142],
and more. Several studies [123, 72] integrate CNNs and ViTs to enhance model robustness in CV tasks.
Extrapolating the current research trend, we anticipate that computer vision systems, such as autonomous
driving cars, will leverage a diverse range of ML models and various combinations of models. Moreover,
in cloud computing platforms provided by service providers like Google Cloud [48], AWS Inferentia [44],
and Intel Habana Lab [52], the capability to support a diverse range of machine learning models is crucial
to meet the demands of their customers. Consequently, developing a versatile domain-specific accelerator
(DSA) that supports diverse ML models, including CNNs, GNNs, and ViTs, becomes essential.
125
6.3 Concluding Remarks
In this dissertation, we presented a range of hardware-software codesign approaches aimed at overcoming
four primary challenges associated with accelerating graph neural network (GNN) inference: (1) handling irregular data structures, (2) supporting heterogeneous computation kernels, (3) exploiting dynamic
data sparsity, and (4) accommodating a mixture of models. Our first contribution, GraphAGILE, embodies
a accelerator-compiler codesign strategy tailored for end-to-end acceleration across diverse GNN models. Focused primarily on irregular data structures and heterogeneous computation kernels, GraphAGILE establishes a foundation for efficient GNN inference. Our second contribution, Dynasparse, adopts a
hardware-runtime-system codesign philosophy to harness the potential of dynamic data sparsity in GNN
inference. By dynamically adapting to sparse data conditions, Dynasparse enhances inference efficiency.
Lastly, our third contribution, GCV-Turbo, addresses the challenges posed by a mixture of models, leading
to notable acceleration in GNN-based computer vision tasks. We hope our work will serve as a catalyst for
the development of comprehensive systems for graph-based machine learning, encompassing compiler,
runtime system, and hardware accelerator designs.
126
Bibliography
[1] Mohamed S Abdelfattah, David Han, Andrew Bitar, Roberto DiCecco, Shane O’Connell,
Nitika Shanker, Joseph Chu, Ian Prins, Joshua Fender, Andrew C Ling, et al. “DLA: Compiler and
FPGA overlay for neural network inference acceleration”. In: 2018 28th international conference on
field programmable logic and applications (FPL). IEEE. 2018, pp. 411–4117.
[2] Adam Auten, Matthew Tomei, and Rakesh Kumar. “Hardware acceleration of graph neural
networks”. In: 2020 57th ACM/IEEE Design Automation Conference (DAC). IEEE. 2020, pp. 1–6.
[3] Sergio Barrachina, Maribel Castillo, Francisco D Igual, Rafael Mayo, and Enrique S Quintana-Orti.
“Evaluation and tuning of the level 3 CUBLAS for graphics processors”. In: 2008 IEEE
International Symposium on Parallel and Distributed Processing. IEEE. 2008, pp. 1–8.
[4] Trinayan Baruah, Kaustubh Shivdikar, Shi Dong, Yifan Sun, Saiful A Mojumder, Kihoon Jung,
José L Abellán, Yash Ukidave, Ajay Joshi, John Kim, et al. “Gnnmark: A benchmark suite to
characterize graph neural network training on gpus”. In: 2021 IEEE International Symposium on
Performance Analysis of Systems and Software (ISPASS). IEEE. 2021, pp. 13–23.
[5] Rianne van den Berg, Thomas N Kipf, and Max Welling. “Graph convolutional matrix
completion”. In: arXiv preprint arXiv:1706.02263 (2017).
[6] Michaela Blott, Thomas B Preußer, Nicholas J Fraser, Giulio Gambardella, Kenneth O’brien,
Yaman Umuroglu, Miriam Leeser, and Kees Vissers. “FINN-R: An end-to-end deep-learning
framework for fast exploration of quantized neural networks”. In: ACM Transactions on
Reconfigurable Technology and Systems (TRETS) 11.3 (2018), pp. 1–23.
[7] Khac-Hoai Nam Bui, Jiho Cho, and Hongsuk Yi. “Spatial-temporal graph neural network for
traffic forecasting: An overview and open research issues”. In: Applied Intelligence 52.3 (2022),
pp. 2763–2774.
[8] Chaoqi Chen, Yushuang Wu, Qiyuan Dai, Hong-Yu Zhou, Mutian Xu, Sibei Yang, Xiaoguang Han,
and Yizhou Yu. “A survey on graph neural networks and graph transformers in computer vision:
a task-oriented perspective”. In: arXiv preprint arXiv:2209.13232 (2022).
[9] Jie Chen, Yousef Saad, and Zechen Zhang. “Graph coarsening: from scientific computing to
machine learning”. In: SeMA Journal 79.1 (2022), pp. 187–223.
127
[10] Ren Chen, Sruja Siriyal, and Viktor Prasanna. “Energy and memory efficient mapping of bitonic
sorting on FPGA”. In: 2015 ACM/SIGDA FPGA.
[11] Tianlong Chen, Yongduo Sui, Xuxi Chen, Aston Zhang, and Zhangyang Wang. “A unified lottery
ticket hypothesis for graph neural networks”. In: International conference on machine learning.
PMLR. 2021, pp. 1695–1706.
[12] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen,
Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, et al. “{TVM}: An automated {End-to-End}
optimizing compiler for deep learning”. In: 13th USENIX Symposium on Operating Systems Design
and Implementation (OSDI 18). 2018, pp. 578–594.
[13] Xiaobing Chen, Yuke Wang, Xinfeng Xie, Xing Hu, Abanti Basak, Ling Liang, Mingyu Yan,
Lei Deng, Yufei Ding, Zidong Du, et al. “Rubik: A Hierarchical Architecture for Efficient Graph
Neural Network Training”. In: IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems (2021).
[14] Xinyu Chen, Hongshi Tan, Yao Chen, Bingsheng He, Weng-Fai Wong, and Deming Chen.
“ThunderGP: HLS-based graph processing framework on fpgas”. In: The 2021 ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays. 2021, pp. 69–80.
[15] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen,
Zhiwei Xu, Ninghui Sun, et al. “Dadiannao: A machine-learning supercomputer”. In: 2014 47th
Annual IEEE/ACM International Symposium on Microarchitecture. IEEE. 2014, pp. 609–622.
[16] Zhao-Min Chen, Xiu-Shen Wei, Peng Wang, and Yanwen Guo. “Multi-label image recognition
with graph convolutional networks”. In: Proceedings of the IEEE/CVF conference on computer vision
and pattern recognition. 2019, pp. 5177–5186.
[17] Zhaodong Chen, Mingyu Yan, Maohua Zhu, Lei Deng, Guoqi Li, Shuangchen Li, and Yuan Xie.
“fuseGNN: Accelerating graph convolutional neural network training on GPGPU”. In: Proceedings
of the 39th International Conference on Computer-Aided Design. 2020, pp. 1–9.
[18] Young Kyu Choi, Jason Cong, and Di Wu. “FPGA implementation of EM algorithm for 3D CT
reconstruction”. In: 2014 IEEE 22nd Annual International Symposium on Field-Programmable
Custom Computing Machines. IEEE. 2014, pp. 157–160.
[19] Young-kyu Choi, Yuze Chi, Weikang Qiao, Nikola Samardzic, and Jason Cong. “Hbm connect:
High-performance hls interconnect for fpga hbm”. In: The 2021 ACM/SIGDA International
Symposium on FPGA. 2021.
[20] Davide Conficconi, Eleonora D’Arnese, Emanuele Del Sozzo, Donatella Sciuto, and
Marco D Santambrogio. “A framework for customizable fpga-based image registration
accelerators”. In: The 2021 ACM/SIGDA International Symposium on Field-Programmable Gate
Arrays. 2021, pp. 251–261.
[21] Jason Cong, Jason Lau, Gai Liu, Stephen Neuendorffer, Peichen Pan, Kees Vissers, and
Zhiru Zhang. “FPGA HLS today: successes, challenges, and opportunities”. In: ACM Transactions
on Reconfigurable Technology and Systems (TRETS) 15.4 (2022), pp. 1–42.
128
[22] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler,
Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. “The cityscapes dataset for
semantic urban scene understanding”. In: Proceedings of the IEEE conference on computer vision
and pattern recognition. 2016, pp. 3213–3223.
[23] CPU and GPU implementation of few-shot image classification. url:
https://github.com/vgsatorras/few-shot-gnn.
[24] CPU and GPU implementation of image segmentation. url: https://github.com/lxtGH/GALD-DGCNet.
[25] CPU and GPU implementation of multi-label image classification. url:
https://github.com/megvii-research/ML-GCN.
[26] CPU and GPU implementation of Point cloud classification. url:
https://github.com/WeijingShi/Point-GNN.
[27] CPU and GPU implementation of skeleton-based action recognition. url:
https://github.com/yysijie/st-gcn.
[28] Yoginder S Dandass, Shane C Burgess, Mark Lawrence, and Susan M Bridges. “Accelerating string
set matching in FPGA hardware for bioinformatics research”. In: BMC bioinformatics 9 (2008),
pp. 1–11.
[29] Austin Derrow-Pinion, Jennifer She, David Wong, Oliver Lange, Todd Hester, Luis Perez,
Marc Nunkesser, Seongjae Lee, Xueying Guo, Brett Wiltshire, et al. “Eta prediction with graph
neural networks in google maps”. In: Proceedings of the 30th ACM International Conference on
Information & Knowledge Management. 2021, pp. 3767–3776.
[30] Kien Do, Truyen Tran, and Svetha Venkatesh. “Graph transformation policy network for
chemical reaction prediction”. In: Proceedings of the 25th ACM SIGKDD international conference on
knowledge discovery & data mining. 2019, pp. 750–760.
[31] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al.
“An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”. In: International
Conference on Learning Representations. 2020.
[32] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al.
“An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”. In: ICLR.
[33] DPU. url: https://www.xilinx.com/products/%20intellectual-property/dpu.html.
[34] Wenqi Fan, Yao Ma, Qing Li, Yuan He, Eric Zhao, Jiliang Tang, and Dawei Yin. “Graph neural
networks for social recommendation”. In: The world wide web conference. 2019, pp. 417–426.
[35] Jacob Fein-Ashley, Tian Ye, Sachini Wickramasinghe, Bingyi Zhang, Rajgopal Kannan, and
Viktor Prasanna. “A Single Graph Convolution Is All You Need: Efficient Grayscale Image
Classification”. In: arXiv preprint arXiv:2402.00564 (2024).
129
[36] Matthias Fey and Jan Eric Lenssen. “Fast graph representation learning with PyTorch Geometric”.
In: arXiv preprint arXiv:1903.02428 (2019).
[37] Frequency optimization. url:
https://docs.xilinx.com/r/en-US/ds962-u200-u250/FPGA-Resource-Information.
[38] Brian Gaide, Dinesh Gaitonde, Chirag Ravishankar, and Trevor Bauer. “Xilinx adaptive compute
acceleration platform: Versaltm architecture”. In: Proceedings of the 2019 ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays. 2019, pp. 84–93.
[39] Claudio Gallicchio and Alessio Micheli. “Fast and deep graph neural networks”. In: Proceedings of
the AAAI Conference on Artificial Intelligence. Vol. 34. 04. 2020, pp. 3898–3905.
[40] Swapnil Gandhi and Anand Padmanabha Iyer. “P3: Distributed deep graph learning at scale”. In:
15th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 21). 2021,
pp. 551–568.
[41] Chen Gao, Yu Zheng, Nian Li, Yinfeng Li, Yingrong Qin, Jinghua Piao, Yuhan Quan,
Jianxin Chang, Depeng Jin, Xiangnan He, et al. “A survey of graph neural networks for
recommender systems: Challenges, methods, and directions”. In: ACM Transactions on
Recommender Systems 1.1 (2023), pp. 1–51.
[42] Victor Garcia and Joan Bruna. “Few-shot learning with graph neural networks”. In: arXiv preprint
arXiv:1711.04043 (2017).
[43] Victor Garcia and Joan Bruna. “Few-shot learning with graph neural networks”. In: 6th
International Conference on Learning Representations, ICLR 2018. 2018.
[44] GAWS Inferentia: High performance at the lowest cost in Amazon EC2 for deep learning inference.
url: https://aws.amazon.com/machine-learning/inferentia/.
[45] Tong Geng, Ang Li, Runbin Shi, Chunshu Wu, Tianqi Wang, Yanfei Li, Pouya Haghi,
Antonino Tumeo, Shuai Che, Steve Reinhardt, et al. “AWB-GCN: A graph convolutional network
accelerator with runtime workload rebalancing”. In: 2020 53rd Annual IEEE/ACM MICRO. IEEE.
2020, pp. 922–936.
[46] Tong Geng, Chunshu Wu, Yongan Zhang, Cheng Tan, Chenhao Xie, Haoran You,
Martin Herbordt, Yingyan Lin, and Ang Li. “I-GCN: A graph convolutional network accelerator
with runtime locality enhancement through islandization”. In: MICRO-54. 2021, pp. 1051–1063.
[47] Zhangxiaowen Gong, Houxiang Ji, Yao Yao, Christopher W Fletcher, Christopher J Hughes, and
Josep Torrellas. “Graphite: optimizing graph neural networks on CPUs through cooperative
software-hardware techniques”. In: Proceedings of the 49th Annual International Symposium on
Computer Architecture. 2022, pp. 916–931.
[48] Google Cloud. url: https://cloud.google.com/bigquery/docs/inference-overview.
[49] graph datasets. url: https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html.
130
[50] Yijin Guan, Hao Liang, Ningyi Xu, Wenqiang Wang, Shaoshuai Shi, Xi Chen, Guangyu Sun,
Wei Zhang, and Jason Cong. “FP-DNN: An automated framework for mapping deep neural
networks onto FPGAs with RTL-HLS hybrid templates”. In: 2017 IEEE 25th Annual International
Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE. 2017, pp. 152–159.
[51] Atika Gupta, Priya Matta, and Bhasker Pant. “Graph neural network: Current state of Art,
challenges and applications”. In: Materials Today: Proceedings 46 (2021), pp. 10927–10932.
[52] Habana. url: https://habana.ai/.
[53] William L Hamilton, Rex Ying, and Jure Leskovec. “Inductive representation learning on large
graphs”. In: Proceedings of the 31st International Conference on Neural Information Processing
Systems. 2017.
[54] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual learning for image
recognition”. In: Proceedings of the IEEE conference on computer vision and pattern recognition.
2016, pp. 770–778.
[55] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu,
Michele Catasta, and Jure Leskovec. “Open graph benchmark: Datasets for machine learning on
graphs”. In: arXiv preprint arXiv:2005.00687 (2020).
[56] Yuwei Hu, Yixiao Du, Ecenur Ustun, and Zhiru Zhang. “GraphLily: Accelerating graph linear
algebra on HBM-equipped FPGAs”. In: 2021 IEEE/ACM International Conference On Computer
Aided Design (ICCAD). IEEE. 2021, pp. 1–9.
[57] Guyue Huang, Guohao Dai, Yu Wang, and Huazhong Yang. “Ge-spmm: General-purpose sparse
matrix-matrix multiplication on gpus for graph neural networks”. In: SC20: International
Conference for High Performance Computing, Networking, Storage and Analysis. IEEE. 2020,
pp. 1–12.
[58] Hanaa M Hussain, Khaled Benkrid, Huseyin Seker, and Ahmet T Erdogan. “Fpga implementation
of k-means algorithm for bioinformatics application: An accelerated approach to clustering
microarray data”. In: 2011 NASA/ESA Conference on Adaptive Hardware and Systems (AHS). IEEE.
2011, pp. 248–255.
[59] Sergey Ioffe and Christian Szegedy. “Batch normalization: Accelerating deep network training by
reducing internal covariate shift”. In: International conference on machine learning. PMLR. 2015,
pp. 448–456.
[60] Weirong Jiang and Viktor K Prasanna. “Scalable packet classification on FPGA”. In: IEEE
Transactions on Very Large Scale Integration (VLSI) Systems 20.9 (2011), pp. 1668–1680.
[61] Licheng Jiao, Jie Chen, Fang Liu, Shuyuan Yang, Chao You, Xu Liu, Lingling Li, and Biao Hou.
“Graph representation learning meets computer vision: A survey”. In: IEEE Transactions on
Artificial Intelligence 4.1 (2022), pp. 2–22.
[62] Norman Jouppi, Cliff Young, Nishant Patil, and David Patterson. “Motivation for and evaluation
of the first tensor processing unit”. In: ieee Micro 38.3 (2018), pp. 10–19.
131
[63] Baris Kayalibay, Grady Jensen, and Patrick van der Smagt. “CNN-based segmentation of medical
imaging data”. In: arXiv preprint arXiv:1701.03056 (2017).
[64] Yoongu Kim, Weikun Yang, and Onur Mutlu. “Ramulator: A fast and extensible DRAM
simulator”. In: IEEE Computer architecture letters (2015).
[65] Kevin Kiningham, Philip Levis, and Christopher Ré. “GRIP: A graph neural network accelerator
architecture”. In: IEEE Transactions on Computers 72.4 (2022), pp. 914–925.
[66] Thomas N Kipf and Max Welling. “Semi-supervised classification with graph convolutional
networks”. In: arXiv:1609.02907 (2016).
[67] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet classification with deep
convolutional neural networks”. In: Advances in neural information processing systems 25 (2012).
[68] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. “The Omniglot challenge: a
3-year progress report”. In: Current Opinion in Behavioral Sciences 29 (2019), pp. 97–104.
[69] Kartik Lakhotia, Rajgopal Kannan, Sourav Pati, and Viktor Prasanna. “GPOP: A scalable
cache-and memory-efficient framework for graph processing over parts”. In: ACM Transactions on
Parallel Computing (TOPC) (2020).
[70] Adam Lerer, Ledell Wu, Jiajun Shen, Timothee Lacroix, Luca Wehrstedt, Abhijit Bose, and
Alex Peysakhovich. “Pytorch-biggraph: A large-scale graph embedding system”. In: arXiv preprint
arXiv:1903.12287 (2019).
[71] Jiajun Li, Ahmed Louri, Avinash Karanth, and Razvan Bunescu. “GCNAX: A flexible and
energy-efficient accelerator for graph convolutional neural networks”. In: 2021 IEEE International
Symposium on High-Performance Computer Architecture (HPCA). IEEE. 2021, pp. 775–788.
[72] Shunfeng Li, Chunxue Wu, and Naixue Xiong. “Hybrid architecture based on CNN and
transformer for strip steel surface defect classification”. In: Electronics 11.8 (2022), p. 1200.
[73] Shengwen Liang, Cheng Liu, Ying Wang, Huawei Li, and Xiaowei Li. “Deepburning-gl: an
automated framework for generating graph neural network accelerators”. In: Proceedings of the
39th International Conference on Computer-Aided Design. 2020, pp. 1–9.
[74] Shengwen Liang, Ying Wang, Cheng Liu, Lei He, LI Huawei, Dawen Xu, and Xiaowei Li. “Engn: A
high-throughput and energy-efficient accelerator for large graph neural networks”. In: IEEE
Transactions on Computers 70.9 (2020), pp. 1511–1525.
[75] Yi-Chien Lin, Yuyang Chen, Sameh Gobriel, Nilesh Jain, Gopi Krishna Jha, and Viktor Prasanna.
“Argo: An auto-tuning runtime system for scalable gnn training on multi-core processor”. In:
arXiv preprint arXiv:2402.03671 (2024).
[76] Yi-Chien Lin and Viktor Prasanna. “Hyscale-gnn: A scalable hybrid gnn training system on
single-node heterogeneous architecture”. In: 2023 IEEE International Parallel and Distributed
Processing Symposium (IPDPS). IEEE. 2023, pp. 557–567.
132
[77] Yi-Chien Lin, Bingyi Zhang, and Viktor Prasanna. “Accelerating GNN Training on CPU
Multi-FPGA Heterogeneous Platform”. In: Latin American High Performance Computing
Conference. Springer. 2022, pp. 16–30.
[78] Yi-Chien Lin, Bingyi Zhang, and Viktor Prasanna. “HP-GNN: Generating High Throughput GNN
Training Implementation on CPU-FPGA Heterogeneous Platform”. In: arXiv preprint
arXiv:2112.11684 (2021).
[79] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan,
Piotr Dollár, and C Lawrence Zitnick. “Microsoft coco: Common objects in context”. In: Computer
Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014,
Proceedings, Part V 13. Springer. 2014, pp. 740–755.
[80] Zhiqi Lin, Cheng Li, Youshan Miao, Yunxin Liu, and Yinlong Xu. “Pagraph: Scaling gnn training
on large graphs via computation-aware caching”. In: Proceedings of the 11th ACM Symposium on
Cloud Computing. 2020, pp. 401–415.
[81] Jun Liu, Amir Shahroudy, Mauricio Perez, Gang Wang, Ling-Yu Duan, and Alex C Kot. “NTU
RGB+D 120: A large-scale benchmark for 3D human activity understanding”. In: IEEE
Transactions on Pattern Analysis and Machine Intelligence 42.10 (2020), pp. 2684–2701.
[82] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo.
“Swin Transformer: Hierarchical Vision Transformer using Shifted Windows”. In: 2021 IEEE/CVF
International Conference on Computer Vision (ICCV). IEEE Computer Society. 2021,
pp. 9992–10002.
[83] Tao Luo, Shaoli Liu, Ling Li, Yuqing Wang, Shijin Zhang, Tianshi Chen, Zhiwei Xu,
Olivier Temam, and Yunji Chen. “DaDianNao: A Neural Network Supercomputer”. In: IEEE
Transactions on Computers 66.1 (2017), pp. 73–88. doi: 10.1109/TC.2016.2574353.
[84] J Arjona Martínez, Olmo Cerri, Maria Spiropulu, JR Vlimant, and M Pierini. “Pileup mitigation at
the Large Hadron Collider with graph neural networks”. In: The European Physical Journal Plus
134.7 (2019), p. 333.
[85] Vasimuddin Md, Sanchit Misra, Guixiang Ma, Ramanarayan Mohanty, Evangelos Georganas,
Alexander Heinecke, Dhiraj Kalamkar, Nesreen K Ahmed, and Sasikanth Avancha. “Distgnn:
Scalable distributed training for large-scale graph neural networks”. In: Proceedings of the
International Conference for High Performance Computing, Networking, Storage and Analysis. 2021,
pp. 1–14.
[86] Microblaze. url: https://docs.xilinx.com/v/u/2021.1-English/ug984-vivado-microblaze-ref.
[87] MSTAR. url: https://www.sdms.afrl.af.mil/index.php?collection=mstar.
[88] Annamalai Narayanan, Mahinthan Chandramohan, Rajasekar Venkatesan, Lihui Chen, Yang Liu,
and Shantanu Jaiswal. “graph2vec: Learning distributed representations of graphs”. In: arXiv
preprint arXiv:1707.05005 (2017).
133
[89] Tan Nguyen, Samuel Williams, Marco Siracusa, Colin MacLean, Douglas Doerfler, and
Nicholas J Wright. “The performance and energy efficiency potential of FPGAs in scientific
computing”. In: 2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High
Performance Computer Systems (PMBS). IEEE. 2020, pp. 8–19.
[90] Fuping Niu, Jianhui Yue, Jiangqiu Shen, Xiaofei Liao, and Hai Jin. “FlashGNN: An In-SSD
Accelerator for GNN Training”. In: 2024 IEEE International Symposium on High-Performance
Computer Architecture (HPCA). IEEE. 2024, pp. 361–378.
[91] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,
Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. “Pytorch: An imperative style,
high-performance deep learning library”. In: Advances in neural information processing systems 32
(2019), pp. 8026–8037.
[92] P Pradhyumna, GP Shreya, et al. “Graph neural network (GNN) in image and video understanding
using deep learning for computer vision applications”. In: 2021 Second International Conference on
Electronics and Sustainable Communication Systems (ICESC). IEEE. 2021, pp. 1183–1189.
[93] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. “Pointnet: Deep learning on point sets
for 3d classification and segmentation”. In: Proceedings of the IEEE conference on computer vision
and pattern recognition. 2017, pp. 652–660.
[94] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. “Pointnet++: Deep hierarchical
feature learning on point sets in a metric space”. In: Advances in neural information processing
systems 30 (2017).
[95] Eric Qin, Ananda Samajdar, Hyoukjun Kwon, Vineet Nadella, Sudarshan Srinivasan,
Dipankar Das, Bharat Kaul, and Tushar Krishna. “SIGMA: A Sparse and Irregular GEMM
Accelerator with Flexible Interconnects for DNN Training”. In: 2020 IEEE International
Symposium on High Performance Computer Architecture (HPCA). 2020, pp. 58–70. doi:
10.1109/HPCA47549.2020.00015.
[96] Md Khaledur Rahman and Ariful Azad. “Triple sparsification of graph convolutional networks
without sacrificing the accuracy”. In: arXiv preprint arXiv:2208.03559 (2022).
[97] Md Khaledur Rahman, Majedul Haque Sujon, and Ariful Azad. “Fusedmm: A unified
sddmm-spmm kernel for graph embedding and graph neural networks”. In: 2021 IEEE
International Parallel and Distributed Processing Symposium (IPDPS). IEEE. 2021, pp. 256–266.
[98] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. “You only look once: Unified,
real-time object detection”. In: Proceedings of the IEEE conference on computer vision and pattern
recognition. 2016, pp. 779–788.
[99] Sungmin Rhee, Seokjun Seo, and Sun Kim. “Hybrid approach of relation network and localized
graph convolutional filtering for breast cancer subtype classification”. In: arXiv preprint
arXiv:1711.05859 (2017).
134
[100] Kamil Roszyk, Michał R Nowicki, and Piotr Skrzypczyński. “Adopting the YOLOv4 architecture
for low-latency multispectral pedestrian detection in autonomous driving”. In: Sensors 22.3 (2022),
p. 1082.
[101] Alvaro Sanchez-Gonzalez, Nicolas Heess, Jost Tobias Springenberg, Josh Merel, Martin Riedmiller,
Raia Hadsell, and Peter Battaglia. “Graph networks as learnable physics engines for inference and
control”. In: International conference on machine learning. PMLR. 2018, pp. 4470–4479.
[102] Rishov Sarkar, Stefan Abi-Karam, Yuqi He, Lakshmi Sathidevi, and Cong Hao. “FlowGNN: A
dataflow architecture for real-time workload-agnostic graph neural network inference”. In: 2023
IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE. 2023,
pp. 1099–1112.
[103] Weijing Shi and Raj Rajkumar. “Point-gnn: Graph neural network for 3d object detection in a
point cloud”. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
2020, pp. 1711–1719.
[104] K Simonyan and A Zisserman. “Very deep convolutional networks for large-scale image
recognition”. In: 3rd International Conference on Learning Representations (ICLR 2015).
Computational and Biological Learning Society. 2015.
[105] Indro Spinelli, Simone Scardapane, and Aurelio Uncini. “Adaptive propagation graph
convolutional network”. In: IEEE Transactions on Neural Networks and Learning Systems 32.10
(2020), pp. 4755–4760.
[106] Wenming Tang, Yuanhao Gong, and Guoping Qiu. “Feature preserving 3D mesh denoising with a
Dense Local Graph Neural Network”. In: Computer Vision and Image Understanding 233 (2023),
p. 103710.
[107] Petar Veličković, Guillem Cucurull, Arantxa Casanova, and Yoshua Bengio. “Graph attention
networks”. In: arXiv preprint arXiv:1710.10903 (2017).
[108] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and
Yoshua Bengio. “Graph Attention Networks”. In: International Conference on Learning
Representations. 2018.
[109] Endong Wang, Qing Zhang, Bo Shen, Guangyong Zhang, Xiaowei Lu, Qing Wu, Yajuan Wang,
Endong Wang, Qing Zhang, Bo Shen, et al. “Intel math kernel library”. In: High-Performance
Computing on the Intel® Xeon Phi™: How to Fully Exploit MIC Architectures (2014), pp. 167–188.
[110] Jianian Wang, Sheng Zhang, Yanghua Xiao, and Rui Song. “A review on graph neural network
methods in financial applications”. In: arXiv preprint arXiv:2111.15367 (2021).
[111] Minjie Wang, Da Zheng, Zihao Ye, et al. “Deep graph library: A graph-centric, highly-performant
package for graph neural networks”. In: arXiv preprint arXiv:1909.01315 (2019).
[112] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon.
“Dynamic graph cnn for learning on point clouds”. In: ACM Transactions on Graphics (tog) 38.5
(2019), pp. 1–12.
135
[113] Yuke Wang, Boyuan Feng, Gushu Li, Shuangchen Li, Lei Deng, Yuan Xie, and Yufei Ding.
“{GNNAdvisor}: An adaptive and efficient runtime system for {GNN} acceleration on {GPUs}”.
In: 15th USENIX symposium on operating systems design and implementation (OSDI 21). 2021,
pp. 515–531.
[114] Boris Weisfeiler and Andrei Leman. “The reduction of a graph to canonical form and the algebra
which appears therein”. In: NTI, Series 2.9 (1968), pp. 12–16.
[115] Tingxi Wen, Jiafu Zhuang, Yu Du, Linjie Yang, and Jianfei Xu. “Dual-Sampling Attention Pooling
for Graph Neural Networks on 3D Mesh”. In: Computer Methods and Programs in Biomedicine 208
(2021), p. 106250.
[116] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi,
Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. “Huggingface’s transformers:
State-of-the-art natural language processing”. In: arXiv preprint arXiv:1910.03771 (2019).
[117] Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger.
“Simplifying graph convolutional networks”. In: International conference on machine learning.
PMLR. 2019, pp. 6861–6871.
[118] Jianchao Wu, Limin Wang, Li Wang, Jie Guo, and Gangshan Wu. “Learning actor relation graphs
for group activity recognition”. In: Proceedings of the IEEE/CVF Conference on computer vision and
pattern recognition. 2019, pp. 9964–9974.
[119] Lingfei Wu, Peng Cui, Jian Pei, Liang Zhao, and Xiaojie Guo. “Graph neural networks:
foundation, frontiers and applications”. In: Proceedings of the 28th ACM SIGKDD Conference on
Knowledge Discovery and Data Mining. 2022, pp. 4840–4841.
[120] Peng Wu. “Pytorch 2.0: The journey to bringing compiler technologies to the core of pytorch
(keynote)”. In: Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and
Optimization. 2023, pp. 1–1.
[121] Yidi Wu, Kaihao Ma, Zhenkun Cai, Tatiana Jin, Boyang Li, Chenguang Zheng, James Cheng, and
Fan Yu. “Seastar: vertex-centric programming for graph neural networks”. In: Proceedings of the
Sixteenth European Conference on Computer Systems. 2021, pp. 359–375.
[122] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and
Jianxiong Xiao. “3d shapenets: A deep representation for volumetric shapes”. In: Proceedings of
the IEEE conference on computer vision and pattern recognition. 2015, pp. 1912–1920.
[123] Tete Xiao, Mannat Singh, Eric Mintun, Trevor Darrell, Piotr Dollár, and Ross Girshick. “Early
convolutions help transformers see better”. In: Advances in neural information processing systems
34 (2021), pp. 30392–30400.
[124] Zhiqiang Xie, Minjie Wang, Zihao Ye, Zheng Zhang, and Rui Fan. “Graphiler: Optimizing graph
neural networks with message passing data flow graph”. In: Proceedings of Machine Learning and
Systems 4 (2022), pp. 515–528.
136
[125] Xilinx Alveo U250. url:
https://docs.xilinx.com/r/en-US/ds962-u200-u250/FPGA-Resource-Information.
[126] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. “How powerful are graph neural
networks?” In: arXiv preprint arXiv:1810.00826 (2018).
[127] Mingyu Yan, Lei Deng, Xing Hu, Ling Liang, Yujing Feng, Xiaochun Ye, Zhimin Zhang,
Dongrui Fan, and Yuan Xie. “Hygcn: A gcn accelerator with hybrid architecture”. In: 2020 IEEE
International Symposium on High Performance Computer Architecture (HPCA). IEEE. 2020,
pp. 15–29.
[128] Sijie Yan, Yuanjun Xiong, and Dahua Lin. “Spatial temporal graph convolutional networks for
skeleton-based action recognition”. In: Proceedings of the AAAI conference on artificial intelligence.
Vol. 32. 1. 2018.
[129] Zhilin Yang, William Cohen, and Ruslan Salakhudinov. “Revisiting semi-supervised learning with
graph embeddings”. In: International conference on machine learning. PMLR. 2016, pp. 40–48.
[130] Hanchen Ye, Xiaofan Zhang, Zhize Huang, Gengsheng Chen, and Deming Chen. “HybridDNN: A
framework for high-performance hybrid DNN accelerator design and implementation”. In: 2020
57th ACM/IEEE Design Automation Conference (DAC). IEEE. 2020, pp. 1–6.
[131] Zihao Ye, Ruihang Lai, Junru Shao, Tianqi Chen, and Luis Ceze. “Sparsetir: Composable
abstractions for sparse compilation in deep learning”. In: Proceedings of the 28th ACM
International Conference on Architectural Support for Programming Languages and Operating
Systems, Volume 3. 2023, pp. 660–678.
[132] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec.
“Graph convolutional neural networks for web-scale recommender systems”. In: Proceedings of
the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
[133] Jiaxuan You, Zhitao Ying, and Jure Leskovec. “Design space for graph neural networks”. In:
Advances in Neural Information Processing Systems 33 (2020), pp. 17009–17021.
[134] Yunxuan Yu, Chen Wu, Tiandong Zhao, Kun Wang, and Lei He. “OPU: An FPGA-based overlay
processor for convolutional neural networks”. In: IEEE Transactions on Very Large Scale
Integration (VLSI) Systems (2019).
[135] Hanqing Zeng and Viktor Prasanna. “GraphACT: Accelerating GCN training on CPU-FPGA
heterogeneous platforms”. In: Proceedings of the 2020 ACM/SIGDA FPGA. 2020, pp. 255–265.
[136] Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, and Viktor Prasanna.
“GraphSAINT: Graph Sampling Based Inductive Learning Method”. In: International Conference
on Learning Representations. 2020. url: https://openreview.net/forum?id=BJe8pkHFwS.
[137] Bingyi Zhang, Rajgopal Kannan, and Viktor Prasanna. “BoostGCN: A Framework for Optimizing
GCN Inference on FPGA”. In: 2021 IEEE 29th Annual International Symposium on
Field-Programmable Custom Computing Machines (FCCM). IEEE. 2021, pp. 29–39.
137
[138] Bingyi Zhang, Hanqing Zeng, and Viktor Prasanna. “GraphAGILE: An FPGA-based Overlay
Accelerator for Low-latency GNN Inference”. In: IEEE Transactions on Parallel and Distributed
Systems (2023).
[139] Bingyi Zhang, Hanqing Zeng, and Viktor Prasanna. “Hardware acceleration of large scale GCN
inference”. In: 2020 IEEE ASAP, pp. 61–68.
[140] Bingyi Zhang, Hanqing Zeng, and Viktor K. Prasanna. “GraphAGILE: An FPGA-Based Overlay
Accelerator for Low-Latency GNN Inference”. In: IEEE Transactions on Parallel and Distributed
Systems 34.9 (2023), pp. 2580–2597. doi: 10.1109/TPDS.2023.3287883.
[141] Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. “Optimizing
FPGA-based accelerator design for deep convolutional neural networks”. In: Proceedings of the
2015 ACM/SIGDA international symposium on field-programmable gate arrays. 2015, pp. 161–170.
[142] Li Zhang, Xiangtai Li, Anurag Arnab, Kuiyuan Yang, Yunhai Tong, and Philip HS Torr. “Dual
graph convolutional network for semantic segmentation”. In: arXiv preprint arXiv:1909.06121
(2019).
[143] Xiao-Meng Zhang, Li Liang, Lin Liu, and Ming-Jing Tang. “Graph neural networks and their
current applications in bioinformatics”. In: Frontiers in genetics 12 (2021), p. 690049.
[144] Yongan Zhang, Haoran You, Yonggan Fu, Tong Geng, Ang Li, and Yingyan Lin. “G-CoS:
GNN-accelerator co-search towards both better accuracy and efficiency”. In: 2021 IEEE/ACM
International Conference On Computer Aided Design (ICCAD). IEEE. 2021, pp. 1–9.
[145] Chuanpan Zheng, Xiaoliang Fan, Cheng Wang, and Jianzhong Qi. “Gman: A graph
multi-attention network for traffic prediction”. In: Proceedings of the AAAI conference on artificial
intelligence. Vol. 34. 01. 2020, pp. 1234–1241.
[146] Da Zheng, Chao Ma, Minjie Wang, Jinjing Zhou, Qidong Su, Xiang Song, Quan Gan,
Zheng Zhang, and George Karypis. “Distdgl: distributed graph neural network training for
billion-scale graphs”. In: 2020 IEEE/ACM 10th Workshop on Irregular Applications: Architectures
and Algorithms (IA3). IEEE. 2020, pp. 36–44.
[147] Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang,
Changcheng Li, and Maosong Sun. “Graph neural networks: A review of methods and
applications”. In: AI open 1 (2020), pp. 57–81.
[148] Shijie Zhou, Rajgopal Kannan, Viktor K Prasanna, Guna Seetharaman, and Qing Wu. “Hitgraph:
High-throughput graph processing framework on fpga”. In: IEEE Transactions on Parallel and
Distributed Systems 30.10 (2019), pp. 2249–2264.
[149] Zhe Zhou, Bizhao Shi, Zhe Zhang, Yijin Guan, Guangyu Sun, and Guojie Luo. “BlockGNN:
Towards efficient GNN acceleration using block-circulant weight matrices”. In: 2021 58th
ACM/IEEE Design Automation Conference (DAC). IEEE. 2021, pp. 1009–1014.
138
[150] Rong Zhu, Kun Zhao, Hongxia Yang, Wei Lin, Chang Zhou, Baole Ai, Yong Li, and Jingren Zhou.
“Aligraph: a comprehensive graph neural network platform”. In: arXiv preprint arXiv:1902.08730
(2019).
139
Abstract (if available)
Abstract
Graph Neural Networks (GNNs) have revolutionized many real-world applications where data can be represented as graphs. These applications include recommendation systems, social networks, traffic prediction, computer vision tasks (e.g., the tasks in autonomous driving), etc. Many applications require high-performance execution (e.g., low latency or high throughput) of GNNs. To this end, several state-of-the-art libraries (e.g., PyTorch Geometric and Deep Graph Library) have been developed for executing GNNs on general-purpose processors, including CPUs and GPGPUs. However, current GNN libraries on general-purpose processors achieve sub-optimal performance due to several challenges: 1. Irregular data structures: graphs in real-world applications are highly unstructured, with uneven degree distribution. Such irregularity leads to complex data access patterns. 2. Heterogeneous computation kernels: GNNs involve both sparse computation kernels (e.g., sparse-dense matrix multiplication) and dense computation kernels (dense-dense matrix multiplication). While general-purpose processors are efficient for dense computations, their data path and memory hierarchy are inefficient for sparse computations. 3. Dynamic data sparsity: In many applications, the graph connectivity and the data sparsity of vertex features are unknown before executing the GNN model. Therefore, a GNN system needs to deal with the data sparsity of the graphs dynamically. Such dynamic data sparsity makes it difficult for the compiler and runtime system to generate an optimal execution scheme for GNNs. 4. Mixture of models: Some GNN-based applications use the combined strength of different machine learning models. For example, GNN-based computer vision tasks utilize a mixture of convolutional neural networks (CNNs) and GNN models. Such a combination leads to complex data flow.
In this dissertation, we address the above challenges through novel hardware-software codesigns on Field Programmable Gate Array (FPGA). First, to address the challenges of irregular data structures and heterogeneous computation kernels, we develop a hardware-software codesign on FPGA for GNN inference, named GraphAGILE. GraphAGILE incorporates a compiler and an accelerator design on FPGA. For a given input GNN model and an input graph, the compiler translates them into the intermediate representation (IR). Then, the compiler performs several compiler optimizations and generates a sequence of instructions for hardware execution. The hardware accelerator on FPGA executes various computation kernels through flexible customized data paths and memory organization. Second, we propose Dynasparse, an efficient codesign of runtime system and hardware to exploit the dynamic sparsity in GNN inference. The hardware design of Dynasparse has a flexible data path to execute the computation kernels of various data sparsity. The runtime system utilizes a theoretical performance model to dynamically map a GNN computation kernel to the computation primitive based on data sparsity. Third, we propose GCV-Turbo, a hardware-software codesign accelerating GNN-based computer vision (CV) tasks. GNN-based CV tasks involve a mixture of GNN layers and CNN layers. To this end, GCV-Turbo incorporates novel compiler optimizations to orchestrate the dataflow of two types of models. GCV-Turbo further incorporates novel hardware mechanisms to facilitate efficient data layout transformation between different layers. Our implementations based on our codesign methodology achieve superior performance on various GNN-based applications.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Scaling up temporal graph learning: powerful models, efficient algorithms, and optimized systems
PDF
Hardware and software techniques for irregular parallelism
PDF
Novel graph representation of program algorithmic foundations for heterogeneous computing architectures
PDF
Scaling up deep graph learning: efficient algorithms, expressive models and fast acceleration
PDF
Acceleration of deep reinforcement learning: efficient algorithms and hardware mapping
PDF
Exploiting variable task granularities for scalable and efficient parallel graph analytics
PDF
Dynamic graph analytics for cyber systems security applications
PDF
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
PDF
Accelerating reinforcement learning using heterogeneous platforms: co-designing hardware, algorithm, and system solutions
PDF
Efficient graph learning: theory and performance evaluation
PDF
An FPGA-friendly, mixed-computation inference accelerator for deep neural networks
PDF
Multi-softcore architectures and algorithms for a class of sparse computations
PDF
Graph machine learning for hardware security and security of graph machine learning: attacks and defenses
PDF
Workflow restructuring techniques for improving the performance of scientific workflows executing in distributed environments
PDF
Algorithm and system co-optimization of graph and machine learning systems
PDF
Introspective resilience for exascale high-performance computing systems
PDF
Human motion data analysis and compression using graph based techniques
PDF
Fast and label-efficient graph representation learning
PDF
Adaptive and resilient stream processing on cloud infrastructure
PDF
High throughput computational framework for synthesis and accelerated discovery of dielectric polymer materials using polarizable reactive molecular dynamics and graph neural networks
Asset Metadata
Creator
Zhang, Bingyi
(author)
Core Title
Hardware-software codesign for accelerating graph neural networks on FPGA
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Degree Conferral Date
2024-08
Publication Date
08/07/2024
Defense Date
07/01/2024
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
compiler,computer architecture,Field Programmable Gate Array,graph neural network,OAI-PMH Harvest,runtime
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Prasanna, Viktor (
committee chair
), Bogdan, Paul (
committee member
), Kannan, Rajgopal (
committee member
), Wang, Weihang (
committee member
)
Creator Email
bingyizh@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113998TA5
Unique identifier
UC113998TA5
Identifier
etd-ZhangBingy-13343.pdf (filename)
Legacy Identifier
etd-ZhangBingy-13343
Document Type
Dissertation
Format
theses (aat)
Rights
Zhang, Bingyi
Internet Media Type
application/pdf
Type
texts
Source
20240807-usctheses-batch-1193
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
computer architecture
Field Programmable Gate Array
graph neural network
runtime