Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Fast and label-efficient graph representation learning
(USC Thesis Other)
Fast and label-efficient graph representation learning
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Fast and Label-Ecient Graph Representation Learning by Sami Abu-El-Haija A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) May 2023 Copyright 2023 Sami Abu-El-Haija Acknowledgements This thesis would not have been possible without the continuous guidance and support of my advisors: Aram Galstyan and Greg Ver Steeg, who also serve in my committee. In addition, I would like to thank the remainder of my PhD committee: Irving Biederman, Laurent Itti and Jonathan May. I would like to thank my co-authors, at the USC Information Sciences Institute: Kristina Lerman, Hrayr Harutyunyan, Nazanin Alipourfard, Fred Morstatter, Mehrnoosh Mirtaheri, Wes Hardaker, Valentino Crespi, Elan Markowitz; at the CS Department: Yunhao Ge, Gan Xin, Keshav Balasubramanian; at Google: Bryan Perozzi, Amol Kapoor, Joonseok Lee, Rami Al-Rfou; at Intel: Hesham Mostafa, Marcel Nassar. I would never forget the close interactions I have had with various students and faculty at the University of Southern California and the Information Sciences Institute, which I felt was a valuable form mentor- ship. I appreciate the existence of: Spencer Stingley, Rob Brekelmans, Neal Lawton, Daniel Moyer, Kyle Reing, Shushan Arakelyan, Sarik Ghazarian, Ninareh Mehrabi, Myrl Marmarelis, Serban Stan, Yuzhong Huang, Yilei Zhang, Palash Goyal, Nazgol Tavabi, Gleb Satyukov, Aaron Ferber, Tyler LaBonte, Mahdi Soltankotabi, Shaddin Dughmi, Jon May, Ron Artstein, Kiran Lekkala, Fei Sha. I would like to thank administrative sta at USC and ISI, working hard so that students get to enjoy a smoother PhD experience, including: Lizsl De Leon, Peter Zamar, Melissa Snearl-Smith, Merideth Reitan. Last, but not least, I would like to acknowledge funding provided by the Defense Advanced Research Projects Agency (DARPA), Intelligence Advanced Research Projects Activity (IARPA), and the Army Con- tracting Command-Aberdeen Proving Grounds (ACC-APG), for funding my resarch. I hope I have fulled my duty, according to your expectations. I have been funded by agreement numbers FA8750-16-2-0204, FA8750-17-C-0106 and W911NF-18-C-0020. ii TableofContents Acknowledgements ii ListofTables viii ListofFigures x Abstract xii Chapter1: Introduction 1 1.1 Organization of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Notation Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 List of Thesis Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 PartI-Preliminaries 6 Chapter2: Primers 7 2.1 Primer on Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.3 Objective Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.4 Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Primer on Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Directed VS Undirected Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.2 Sparse Graph Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.3 Node & Edge Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.4 Node & Edge Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.5 Classical Graph Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Chapter3: MachineLearningonGraphs 20 3.1 Tasks solvable by ML on Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.1.1 General Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.1.2 Specic Tasks of our Interests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1.2.1 Link Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1.2.2 Node Classication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 Application Domains using Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2.1 Social Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2.2 Citation Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.3 World Wide Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 iii 3.2.4 Protein-Protein Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 Node Embedding Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3.1 Node Embedding Models using Direct Connections . . . . . . . . . . . . . . . . . . 27 3.3.2 Node Embedding Models Simulating Random Walks . . . . . . . . . . . . . . . . . 28 3.3.3 Node Embedding Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3.3.1 Direct Connections via Node Embedding Framework . . . . . . . . . . . 30 3.3.3.2 Random Walks via Node Embedding Framework . . . . . . . . . . . . . . 30 3.3.3.3 Graph Likelihood via Node Embedding Framework . . . . . . . . . . . . 31 3.3.4 How to Utilize Node Embeddings for Inference . . . . . . . . . . . . . . . . . . . . 32 3.4 Models that use Graphs as Regularizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.5 Models that use Message Passing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.5.1 Message Passing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.5.2 Graph Convolutional Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.6 Conclusion of Part I & Going forward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 PartII-Data-ecientModelsfornodeclassicationandlinkprediction 38 Chapter4: N-GCNModel: Multi-scaleGraphConvolutionNetworks 39 4.1 N-GCN Model Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2 N-GCN Model Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2.1 Explicit Random Walks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2.2 Network of GCNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.2.2.1 Fully-Connected Classication Network . . . . . . . . . . . . . . . . . . 43 4.2.2.2 Attention Classication Network . . . . . . . . . . . . . . . . . . . . . . 44 4.2.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2.4 GCN Replication Factorr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2.5 Relation to other GCN variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.3 N-GCN Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.3.2 Baseline Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.3.4 Node Classication Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3.5 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3.6 Tolerance to Feature Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3.7 Random Walk Steps Versus GCN Depth . . . . . . . . . . . . . . . . . . . . . . . . 53 Chapter5: MixHopModel: Layer-wisemixingGraphConvolutionNetwork 54 5.1 MixHop Model Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.2 MixHop Model Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.2.1 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.2.2 Representational Capability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.2.3 General Neighborhood Mixing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2.4 Learning Graph Convolution Architectures . . . . . . . . . . . . . . . . . . . . . . 61 5.2.4.1 Output Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2.4.2 Learning Adjacency Power Architectures . . . . . . . . . . . . . . . . . . 62 5.3 Mixhop Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.3.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 iv 5.3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.3.3.1 Results on Synthetic Graphs . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.3.3.2 Node Classication Results . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.3.3.3 Visualizing Learned Architectures . . . . . . . . . . . . . . . . . . . . . . 68 PartIII-Human&Machineeciency: Fewer&FasterExperiments 70 Chapter6: Watch-Your-StepModelforLearningtheContextDistribution 71 6.1 Watch-Your-Step Model Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.1.1 Side Note: Attention Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.2 Watch-Your-Step Model Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.2.1 Expectation on the co-occurance matrix:E[ ] . . . . . . . . . . . . . . . . . . . . . 74 6.2.2 Learning the Context Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.2.3 Graph Attention Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6.2.4 Training Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 6.2.5 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.2.6 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.3 Watch-Your-Step Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.3.1 Link Prediction Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.3.2 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.3.3 Node Classication Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Chapter7: “Convexied”GraphNeuralNetwork 84 7.1 Convexied GNNs: Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 7.2 Refresher: SVD & Model Classes to be convexied . . . . . . . . . . . . . . . . . . . . . . . 87 7.2.1 Network embedding models based on DeepWalk & node2vec . . . . . . . . . . . . 87 7.2.2 Message passing graph networks for (semi-)supervised node classication . . . . . 88 7.2.3 Truncated Singular Value Decomposition (SVD) . . . . . . . . . . . . . . . . . . . . 89 7.3 Convex rst-order approximations of GRL models . . . . . . . . . . . . . . . . . . . . . . . 90 7.3.1 Convexication of Network Embedding Models . . . . . . . . . . . . . . . . . . . . 90 7.3.2 Convexication of message passing graph networks . . . . . . . . . . . . . . . . . 91 7.4 Symbolic matrix representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 7.5 Implementation of Functional Singular Value Decomposition . . . . . . . . . . . . . . . . . 93 7.5.1 Calculating SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 7.5.2 TensorFlow Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 7.5.3 Time improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 7.5.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 7.5.4.1 Norm Regularization of Wide Models . . . . . . . . . . . . . . . . . . . . 96 7.5.4.2 Computational Complexity and Approximation Error . . . . . . . . . . . 97 7.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 7.6.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 7.6.2 Semi-supervised Node Classication . . . . . . . . . . . . . . . . . . . . . . . . . . 101 7.6.3 ROC-AUC Link Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 7.6.4 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 v Chapter8: ConvexGNNsinitializeDeeperModels 104 8.1 Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 8.2 SVD initialization for deeper models ne-tuned via cross-entropy . . . . . . . . . . . . . . 105 8.2.1 Edge function for network embedding as a (1-dimensional) Gaussian kernel . . . . 105 8.2.1.1 Approximating the integral of Gaussian 1d kernel . . . . . . . . . . . . . 106 8.2.2 Split-ReLu (deep) graph network for node classication (NC) . . . . . . . . . . . . 107 8.2.3 Creative Add-ons for node classication (NC) models . . . . . . . . . . . . . . . . . 107 8.3 Analysis & Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 8.4 Applications & Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 8.4.1 Experiments on Stanford’s OGB datasets . . . . . . . . . . . . . . . . . . . . . . . . 110 8.5 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 PartIV-Scalingtolargegraphs 114 Chapter9: GraphTraversalwithTensorFunctionals:Meta-AlgorithmforScalableLearning115 9.1 GTTF: Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 9.2 Graph Traversal via Tensor Functionals (GTTF): Derivation . . . . . . . . . . . . . . . . . . 118 9.2.1 Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 9.2.2 Learning by Providing Functions to a Stochastic Traversal Algorithm . . . . . . . . 121 9.2.2.1 AccumulateFn and BiasFn . . . . . . . . . . . . . . . . . . . . . . . . . . 122 9.2.3 SpecializingAccumulateFn &BiasFn to recover Algorithms for GRL . . . . . . . 122 9.2.3.1 Message Passing: Graph Convolutional variants . . . . . . . . . . . . . . 123 9.2.3.2 Node Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 9.3 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 9.3.1 Estimatingk th power of transition matrix . . . . . . . . . . . . . . . . . . . . . . . 125 9.3.2 Unbiased Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 9.3.3 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 9.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 9.4.1 Node Embeddings for Link Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 128 9.4.2 Message Passing for Node Classication . . . . . . . . . . . . . . . . . . . . . . . . 128 9.4.3 Experiments comparing against Sampling methods for Message Passing . . . . . . 128 9.4.4 Runtime and Memory comparison against optimized Software Frameworks . . . . 130 9.4.5 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 9.5 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 9.6 Additional GTTF Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 9.6.1 Message Passing Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 9.6.1.1 Graph Attention Networks [GAT, Veličković et al., 2018] . . . . . . . . . 134 9.6.1.2 Deep Graph Infomax [DGI, Veličković et al., 2019] . . . . . . . . . . . . . 134 9.6.2 Node Embedding Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 9.6.2.1 Node2Vec [Grover and Leskovec, 2016] . . . . . . . . . . . . . . . . . . . 135 9.6.2.2 Watch Your Step [WYS, Abu-El-Haija et al., 2018] . . . . . . . . . . . . . 136 Chapter 10: di: Py Framework for Distributed Computing of Matrices and their Implicit Decompositions 137 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 10.1.1 Utility of SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 10.1.2 Chapter’s Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 10.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 vi 10.3 Distribited Implicit (di) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 10.3.1 High-level Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 10.3.2 Interface for dening and computing matrices . . . . . . . . . . . . . . . . . . . . . 144 10.3.2.1 Input matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 10.3.2.2 Transforming matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 10.3.2.3 Materializing: writing submatrices on disk . . . . . . . . . . . . . . . . . 146 10.3.2.4 Compute engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 10.3.3 Distributing the work graph among workers . . . . . . . . . . . . . . . . . . . . . . 148 10.3.4 Application: SOTA Link Prediction GNNs . . . . . . . . . . . . . . . . . . . . . . . 150 10.3.5 Implicit Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 10.3.6 Approximations to speed-up SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 10.3.7 Engineering Highlights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 10.4 Experiment & Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 10.4.1 Runtime of implicit SVD of SciPy VSdi . . . . . . . . . . . . . . . . . . . . . . . . 159 10.4.2 Link Prediction: ogbl-citation2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 10.5 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 10.5.1 Proof of Theorem 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 10.5.2 Deviation from linearity hyperplane upon ne-tuning . . . . . . . . . . . . . . . . 161 10.5.3 Trading o exact equivalence for model capacity . . . . . . . . . . . . . . . . . . . 162 Chapter11:Conclusions 163 11.1 High-level Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 11.2 Towards Data-ecient Graph Representation Learning . . . . . . . . . . . . . . . . . . . . 165 11.3 Summary for Human Eciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 11.4 Machine Eciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 11.5 Scaling to Larger Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 Bibliography 171 vii ListofTables 3.1 Social Networks provided by [Leskovec and Krevl, 2014a]. . . . . . . . . . . . . . . . . . . 24 4.1 N-GCN Experiments: Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2 N-GCN Experiments: Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3 N-GCN Experiments: Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.4 N-GCN Experiments: Fewer Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.5 Deeper GCN and SAGE experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.1 MixHop Experiments: Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.2 MixHop Experiments: Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.3 MixHop Experiments: Node Classication Results. . . . . . . . . . . . . . . . . . . . . . . . 66 6.1 Watch Your Step Experiments: Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 7.1 Convex GNNs Experiments: Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 7.2 Convex GNNs Experiments: Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 8.1 Deepening Experiments: Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 8.2 Deepining Experiments: Results for Drug-Drug Interaction Network . . . . . . . . . . . . 111 8.3 Deepining Experiments: Results for ArXiv . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 9.1 GTTF VS Related Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 9.2 GTTF Experiments: Datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 9.3 GTTF Experiments: Results on Link Prediction tasks. . . . . . . . . . . . . . . . . . . . . . 129 viii 9.4 GTTF Experiments: Results on Node Classication tasks. . . . . . . . . . . . . . . . . . . . 129 9.5 GTTF Experiments: Hardware Requirement comparison against DGL and PyG . . . . . . . 130 10.1 Memory requirement for using existing SVD routines VS di’s. . . . . . . . . . . . . . . . . 142 10.2 Memory requirement for using existing distributed SVD routines VS di’s. . . . . . . . . . . 143 10.3 Implicit Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 10.4 di Experiments: Training time and test permance versus state-of-the-art . . . . . . . . . . 156 ix ListofFigures 2.1 Depiction of Supervised Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4.1 N-GCN model architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2 N-GCN Experiments: Noisy Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3 N-GCN Experiments: Noisy Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.1 MixHop Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.2 MixHop Experiments: Results on synthetic homophily datasets . . . . . . . . . . . . . . . 67 5.3 MixHop: Learned Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.1 Watch Your Step Experiments: Datasets & high sensitivity of hyperparameters of baselines methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 6.2 Watch Your Step Experiments: Learned Attention Coecients . . . . . . . . . . . . . . . . 78 6.3 Watch Your Step: t-SNE visualization and results for node classication datasets . . . . . . 82 6.4 Sensitivity Analysis of softmax attention model. Our method is robust to choices of both andC. We note that it consistently outperforms even an optimally set node2vec. . . . . 83 7.1 Symbolic Matrix Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 7.2 SVD Runtime with caching & Cholesky Decomposition . . . . . . . . . . . . . . . . . . . . 94 7.3 Convex GNNs: train time VS test accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 7.4 Convex GNNs: Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 9.1 GTTF Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 10.1 DAG ofM (NE) highlighting input, intermediate, and output computation nodes . . . . . . 147 x 10.2 Example work graphs, explicitly marking propagation of sub-matrices . . . . . . . . . . . 147 10.3 di code snippet and corresponding work graph . . . . . . . . . . . . . . . . . . . . . . . . . 149 10.4 SVD run-time of implicit SVD implementations of scipy versusdi (using 4 worker machines). 159 xi Abstract The goal of my thesis is to improve the eciency of machine learning (ML) algorithms on graph-structured data. Graphs are exible data structures that can represent information from various domains. To name a few, social networks, knowledge graphs, visual reasoning, biochemical networks (e.g., proteins or drugs), dependency graphs (e.g., computation), are all examples that can be stored as instantiations of the graph data structure. Experts of the aforementioned domains can utilize ML on graphs, a.k.a, graph representa- tion learning (GRL), to infer information on their graphs. Such inference provides benet to stakeholders when predicting, e.g., user interests, or binary classication whether two drugs chemically-interact. To accommodate the growth of data size and complexity, GRL models have ourished during the past decade. In this thesis, byimprovingtheeciency, I refer to dierent aspects of eciency. First,labele- ciency: enable empirically-performant models, even in the presence of just a few labeled training samples. Second,humaneciency: require less time designing and running experiments, e.g., by converting hy- perparameters to trainable parameters. Third,machineeciency: require less computational resources. Fourth, ecient scaling: for learning and inferring information on graphs that can be larger than one machine’s memory. Further, as the generality of the graph data structure should imply the portability of learning algorithms, the contributions of this thesis are experimentally validated on data spanning a wide-range of domains, including social, biochemical, citation networks, etc. The thesis could benet dierent individuals: readers that are curious about ML, graphs, and/or their intersection; researchers of GRL, to evaluate my directions and proposed solutions; or practitioners of application domains, which could apply, on their data, code linked within the contribution chapters. xii Chapter1 Introduction Graphs are general data structures that can represent diverse information from many domains. Their generality attracts a variety of practitioners spanning many elds, e.g., medicine, e-commerce, common- sense reasoning, arithmetic compute engines, online social networks, just to name a few. When practitioners (e.g., a biochemists) shape their data (e.g., drugs) onto a graph data structure (e.g., by storing each drug as an entity “node”, and each interaction as a relationship “edge”), they immediately inherit access to a range of software tools. These tools work out-of-the-box for any graph, including database systems for storing and querying graph data, tools for depicting and visualizing graph data (e.g., which could visually reveal clusters of interacting drugs), and public implementations of versatile algo- rithms that operate on graphs, also known as graph algorithms. The text of the thesis sits within the large classes of graph algorithms. Specically, it sits within Machine Learning (ML) algorithms that operate on graph data. These algorithms help practitioners to infer information missing from graphs. For instances, relating to the running example: predicting if two drugs interact with high accuracy – so as to guide biochemical experiments; In social networks, a practitioner might be motivated to increaserelevanceofadvertisements shown to users; In Knowledge Graphs, a system would like to respond toreasoning queries such as “Whichcountrycontainstheworld’smost-diversecity?”. These three tasks are practical instances where ML could help infer information on graphs. 1 Improving quality of inference for the aforementioned applications motivates rapid evolution of the subject area of: machine learning on graphs, also known as graph representation learnig (GRL). Broadly, GRL can be seen as a specialization ofmachinelearning (ML) techniques to handle data that takes the form of a graph. Fortunately, since information from various application domains are instantiations of the graph data structure, algorithms can be readily shared across domains. For instance, an algorithm developped for predicting if two drugs interact, can be immediately re-used for recommending friends in an online social network, as graph algorithms are generally agnostic to the data domain. From a Software Engineering standpoint, this “tool-sharing” is a principle known as don’t repeat yourself. One could debate the reasons for unprecedented advancement of GRL within the last decade. While the last paragraph provides one justication pertaining the portability of graph algorithms across application domains, however, one may wonder: “Why now?”. The last decade has witnessed: an abundance of graph data due to improved data gathering and sharing; availability of faster computing hardware alongside their programming interfaces; and, recent ML advances have been adopted within the GRL, including recent neural architectures e.g. residual connections, model families e.g. invertible networks, and representation techniques e.g. tangent kernels. Regardless ofwhy GRL has signicantly advanced within the past decade, the purpose of this thesis is to provide guidance forimprovingtheeciency of GRL, where eciency is viewed from four angles: 1. labeleciency: it is benecial to make accurate predictions even in cases whereverylittle labeled data is present. Consider the scenario where the interests of only a tiny fraction of users is explicitly provided, is it possible to infer the interest of all remaining users? 2. machineeciency: speeding-up algorithms for learning and inference can save power consump- tion of hardware and reduce training time. 2 3. humaneciency: your eciency is the most precious! We are motivated to design methods where you have to try-out less experiments. This includes reducing the search space by replacement of hyper-parameters with dierentiable parameters, so that their values can be directly optimized on the learning signal rather than by manual search. 4. ecient scaling: some applications store gigantic graphs with billions of objects and their rela- tionships. 1.1 Organizationofthethesis The crux of this thesis is divided into three parts, focused on machine learning models and algorithms for graph structures, preceded by a preliminary part. Specically, the parts are: PartI: Provides the reader with the background material needed to understand the remainder of the thesis. If you have expertise in the elds of Machine Learning, Graph Theory, and their intersection, then you may skim-through or completely skip this section. However, we recommend that all readers to take a look at Chapter 1.2 to familiarize themselves with the notation. PartII: Describes my contributions fordataeciency. It shows how vanilla models for GRL can be modied to enhance their representation capacity, allowing models to learn additional patterns and be robust in cases where there is little and/or noisy labeled data. PartIII: Addresses machine and human eciency: how can we obtain a good ML model using many fewer experiments? and how can we make each experiment more computationally-ecient i.e. much faster while using less memory? PartIV: Gives guidance on how to scale learning to larger graphs. What if practitioner has a very large graph, that does not t onto one machine’s memory? 3 1.2 NotationTable Symbol Description n2Z + Number of nodes in a graph m2Z + Number of edges in a graph i2 [n] Node ID, with [n] being the set of integers [1; 2;:::;n] j2 [n] Same as above A2R nn Sparse adjacency matrix w/O(m) entries.A i;j set to 1 if an edge connects nodesi andj 1 k 2R k Ones column vector with lengthk I k 2R kk kk identity matrix d2Z n Degree vector whered i equals number of neighbors to nodei. Calculated asd =A1 D2Z nn Diagonal degree matrix. Dened asD = diag(d) T 2R nn Transition matrix (row-wise normalized), calculated asT =D 1 A b A2R nn Symmetrically normalized adjacency with self-connections: b A = (D +I) 1 2 (A +I)(D +I) 1 2 X2R nd Feature matrix. Optional. Dened for graphs where nodes haved-dimensional features Z2R nz Node embedding matrix – output of a model H Model functionH :R nn R nd !R nz where denotes model parameters 1.3 ListofThesisPublications The majority of text within this thesis is adopted from peer-reviewed research papers published by Sami Abu-El-Haija, in top-tier venues, for advancing Machine Learning research. The papers are: • [Abu-El-Haija et al., 2018]: Abu-El-Haija, Perozzi, Al-Rfou, Alemi (2018). “Watch Your Step: Learn- ing Node Embeddings via Graph Attention”. in Advances in Neural Information Processing Systems. 2018. 4 • [Abu-El-Haija et al., 2019]: Abu-El-Haija, Kapoor, Perozzi, Lee (2019). “N-GCN: Multi-scale Graph Convolution for Semi-supervised Node Classication”. Uncertainty in Articial Intelligence. 2019. • [Abu-El-Haija et al., 2019]: Abu-El-Haija, Perozzi, Kapoor, Harutyunyan, Alipourfard, Lerman, Ver Steeg, Galstyan (2019). “MixHop: Higher-Order Graph Convolutional Architectures via Sparsied Neighborhood Mixing”. International Conference on Machine Learning. 2019. • [Abu-El-Haija et al., 2021a]:Abu-El-Haija, Crespi, Ver Steeg, Galstyan (2021). “FastGraphLearning with Unique Optimal Solutions”. ICLR Workshop on Geometrical and Topological Representation Learnings. 2021. • [Abu-El-Haija et al., 2021b]: Abu-El-Haija, Mostafa, Nassar, Crespi, Ver Steeg, Galstyan, A. (2021). “Implicit SVD for Graph Representation Learning”. Advances in Neural Information Processing Sys- tems. 2021. • [Markowitz et al., 2021]: Markowitz*, Balasubramanian*, Mirtaheri*, Abu-El-Haija*, Perozzi, Ver Steeg, Galstyan (2021). “Graph Traversal with Tensor Functionals: A Meta-Algorithm for Scalable Learning”. International Conference on Learning Representations. 2021. • [Abu-El-Haija et al., 2022]: Abu-El-Haija, Mostafa, Nassar, Majumdar, Ver Steeg, Crespi, Hardaker, Galstyan (2022). “di: Python Framework for Distributed Computing of Matrices and their Implicit Decompositions”. Under preparation. Research papers that relate less to the thesis topic, or papers that Sami was not the rst author, were intentionally left-out of this writing for fair evaluation of the Ph.D. dissertation. A complete list of Sami’s my publications is on Google Scholar and personal website. 5 PartI-Preliminaries This part covers the necessary background material that is required to understand the remainder of the thesis. The goal of this part is to make the thesis stand on its own. Youmayskiporskim-throughthe Chaptersinthispart, especially if you are familiar with technical concepts, such as, Machine Learning (data, model, objective function, training); Graphs (adjacency matrix, Laplacian); their intersection (learn- ing embeddings, graph convolution). However, if you are not familiar with Machine Learning (ML), then I recommend you read Chapter 2.1. If you are not familiar with Graphs as a data structure, then I recom- mend you read Chapter 2.2. Lastly, if you are not familiar with the application of ML on graphs, then I recommend you read Chapter 3. 6 Chapter2 Primers 2.1 PrimeronMachineLearning This section will brief the reader on Machine Learning using four intertwined pillars: Data, Model, Objec- tive Function, and Learning algorithm, all of which are pictorially presented in Figure 2.1. 2.1.1 Data Formally, a (training) dataset consists ofn examples which, depending on the application, can be a list of users, a list of chemical compounds, a list of images, a list of speech waveforms, etc. For the purpose of computers, each example must be converted to numeric form. For instance, a user can consist of numeric attributes age (scalar), speaks spanish (binary), speaks english (binary), number of followers (scalar), etc. Let m be the total number of such scalar and binary attributes. Following standard notation, a dataset containing features ofn examples is denotedD =fx (1) ;x (2) ;:::;x (n) g. Each of those examples can be usually ∗ stored as ad-dimensional vector,x (i) 2R d , fori = 1; 2;:::;n. Finally, it is common to combine all of then examples into one rectangular matrix: withn rows andd columns. X2R nd is also known as the features matrix, with thei th row denoted as vectorx (i) . ∗ For now, let us ignore regular structures including images, sound, or language. 7 Figure 2.1: Depiction of Supervised Machine Learning. Data example featuresx and its labely are drawn (retrieved) from datasetD. The featuresx are passed to (red) input of modelh, visualized as 3- layer neural network, that is parametrized with (green) weight matricesW 1 ;W 2 ;W 3 . The model’s (blue) output is then fed into an (orange box) objective function,L which also accepts the target labely. The objective is optimized by the learning algorithm (arrow leaving orange box) that rst computes the objec- tive gradient w.r.t. model parameters (r W j L = @L @W j ) then accordingly updates model parameters e.g. via Stochastic Gradient Descent (SGD). The red circles are also known asinputneurons or visible variables, the green circles are known as hidden or latent variables, and the blue circles are known as output variables. ... model data (x, y ) objective function Learn Supervised Machine Learning: It is common for examples to be paired with ground-truth target la- bels, which are also application-specic. For instance, we might know product interests for only some of the users, but we wish to recover it for remainder of the users. These labels are denotedy (1) ;:::;y (n) . For instance,y (i) can be binary vector [enjoys product A; enjoys product B;::: ]. These labelsy are also known as the supervision signal. When present, it is common to express the dataset as a set of pairs: D =f(x (1) ;y (1) );:::; (x (n) ;y (n) )g. 2.1.2 Model Practioners gather data and convert it to appropriate numerical form for a certain purpose: they want to infer something about the data! This something, is achieved by the model. Specically, the model is a 8 transformation function, denotedh: it accepts featuresx (i) of examplei as input, and its output for that example, denoted † h(x (i) ), is application-specic: 1. Insupervisedlearning,h(x (i) ) is the model’s guess for its ground-truthy (i) . During training, the dierence betweenh(x (i) ) andy (i) is minimized, averaged over all examplesi = 1;:::;n. 2. Indensityestimationofunsupervisedlearning, the functionh is usually dened to be aprobabil- itydensityfunction. Specically, given an examplex, it outputs a positive scalarh(x)2R + . The to- tal probability, over the entire domain of input spacex2R d , should sum up to one: R x2R d h(x) = 1. For instance, ifx are examples ofnaturalimages, the entire spaceR d covers both natural images (e.g. people, nature, etc) and non-natural images (random noise). It is desired thath should assign more mass to real examples, withh(a)h(b), ifa versusb, respectively, are uniformly drawn fromD versusR d . This thesis will focus on (i). However, the inputx represents graph objects, as will be described in the following subsections. We will now describe howh can calculate a transformation, from features of an input example (e.g., an image) to the output space (e.g., object labels). In today’s era,h will likely be a “Deep Neural Network”, which is a crude simplication of the primate cortex. Without going into how the simplication stems,h can be dened as: h(x) =(W 3 (W 2 (W 1 x))); where the matricesW 1 ,W 2 ,W 3 are called as the parameters of the model; all above are matrix-vector multiplications; and is known as an activation function and is applied element-wise (i.e., independently on each element of the vector). The presented modelh is composed of 3 layers (can be traced from right-to- left, with layerj parametrized byW j ). In reality, the number of layers can dier and there are real-world † Later chapters adopt common notation. Application ofh on alln examples can be assembled in matrixH2R nz . Further, h is not in boldface as it is a function (that maps vector to vector) 9 models with thousands of such layers He et al. [2016a]. Further, the three’s may be all the same or may dier. In addition, there are many other ways one can transform a numeric vector (x) into an output, other than a series of matrix multiplications with parameters interleaving activation functions. Crafting neural networks is an art, and online tutorials are oered [Hinton, NN-tutorials]. Nonetheless, we will stick to modelsh with the formulation displayed above, with being the standard logistic function. This formulation is known as Multi-layer Perceptron (MLP). Embedding: in addition to the presented transformation functions, which must access the featuresx, this thesis also focuses on a large-class of models, known asembeddingmodels. These modelsdonot access any values withinx, e.g., in cases no such values are present. Rather than transforming the input example, embedding methods invent a newz-dimensional vector space (coordinate) system onto which each example is projected (a.k.a, embedded). Specically, givenn training examples, embedding methods compute matrixZ2R nz , called, embedding matrix, with thei th row corresponding to thez-dimensional embedding of thei th example. 2.1.3 ObjectiveFunction Good models should achieve high accuracy, and therefore, low error. But to minimize the error, one must be able to measure it. Precisely, an objective function is an exact specication for quantifying error. We will stick to the objective being what should be minimized. It is denotedL and also known as loss (it is common to say: minimize the loss). Fortunately, there are only a handful of useful objective functions that cover most use-cases, obviating the need for most practitioners from designing their own objective functions. The two most popular objective functions for supervised machine learning are: 10 1. Cross entropy, used for classication i.e. when target labels y contain binary information, for example predicting if a user would enjoy or not certain products, based on the user’s features. It is dened as: cross entropy: L(h;D) = X (x;y)2D y log(h(x)) (1y) log(1h(x)); which works well if entries ofy are all binary, output ofh has the same dimensionality asy and values range between 0 and 1. Logarithm log(:) and hadamard multiplication are applied element- wise. 2. Euclidean squared distance, used for regression i.e. when target labels y contain continuous information, for example predicting chemical properties of a compound such as conductance or free energy Euclidean squared distance: L(h;D) = X (x;y)2D jjyh(x)jj 2 ; wherejjabjj is the Euclidean distance between same-dimensionality vectorsa andb, withjja bjj = p (a 1 b 1 ) 2 + (a 2 b 2 ) 2 +:::, wherea j is thej th entry of vectora, etc. Minimizing the above objective functions, over the parameters of the model h, encourages the model outputh(x) to be similar to the labely. 2.1.4 LearningAlgorithm The learning algorithm tweaks the parameters of the modelh so that the objective value (loss,L) is min- imized with respect to the training dataD. Specically, the model parameters (e.g., the weight matrices 11 W 1 ;W 2 ;W 3 , for a 3-layer neural network) are adjusted so that the model does as best as possible on the training data. Formally, the Learning Algorithm executes the minimization: min h L(h;D): The minimization suscripth is equivalent to writing on the subscript the parameters ofh (e.g.,W 1 ,W 2 , W 3 ). The intricacy of this minimization is outside the scope this thesis. In fact, many machine learning prac- titioners need not to worry of this detail, as modern mathematical software, such as TensorFlow [Abadi et al., 2016] and PyTorch [Paszke et al., 2019], does this minimization “behind the scenes” without need- ing ML practitioners to implement minimization routines. There are plenty of well-studied minimization algorithms, and practitioners can switch between these algorithms by changing one-line of python code. 2.2 PrimeronGraphs Each graph is composed of nodes and edges. Consider a graph composed of nodesa;b;c;d;e, as shown below. The graph will be used as a running example for this entire section. It is depicted as: a b c d e Formally, a graphG is dened as a pairG = (V;E), whereV is known as the set of nodes andE as the set of edges. The above example contains ve nodes:V = fa;b;c;d;eg. Some node pairs are connected through edges. Each edge connects two nodes. Specically, the edges are a set of node-pairs as: E =f(a;b); (a;d); (d;b); (b;e); (e;c)g. Following the set-notation, one can indicate existence of an edge e.g. (a;b)2E and an absence e.g. (b;c) = 2E. 12 For the sake of discussion, let us assume that the nodesV =fa;b;c;d;eg represent users in a social network. These 5 users are connected and in this case, we let us assume that an edge represents friendship: Since (a;b)2E, then usersa andb are friends on the social network. Conversely, since (b;c) = 2E, then usersb andc are not friends on the social network. In Section 3.2, we present other kinds of information, and explain how they can be encoded as graphs. 2.2.1 DirectedVSUndirectedGraphs If a graph isundirected, it means (u;v)2E also implies (v;u)2E. Although it is possible to explicitly state this asE =f(u;v); (v;u);:::g, but for sake of space, it is however sucient to state one-direction and just indicate “the graph is undirected”. On the other hand, there are asymmetric relationships (e.g. “Follow” on instagram: John following Matt, does not guarantee the reverse). This asymmetric information can be represented as a directed graph, formally (u;v)2E does not necessarily imply (v;u)2E. 2.2.2 SparseGraphMatrices Information about graphG can be concisely described via sets of nodesV and edgesE, as mentioned previ- ously. Inside a computer, this information can be directly stored, as (i) node and edge lists, or equivalently, as (ii)adjacencylist, or (iii)adjacencymatrix. For the rst two, see Chapter 5.2 of [Skiena, 2008]. Nonethe- less, we explain (iii) adjacency matrices, as its mathematical form makes it a good t for Machine Learning applications that we review in this document. Given a graphG = (V;E), it is straight forward to construct the adjacency matrix. Letn =jVj be the number of nodes of a graph. First, the nodes have to be ordered and often the ordering is arbitrary. Sticking to the running example, let the order of nodes (a;b;c;d;e) be (1; 2; 3; 4; 5). The adjacency matrix, denotedA, is a binary square matrix withn rows andn columns, with each entry being 0 or 1. Each row 13 ofA (1 through 5) correspond to the nodes (a throughe), and similarly columns. The adjacency matrix of the running-example graph can be written as: adjacency matrix: A = 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 a b c d e a 1 1 b 1 1 1 c 1 d 1 1 e 1 1 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 The header row and the left column are for visual aid, and are not part of the matrix. Indeed, the 5 5 binary matrix (notation:A2f0; 1g 55 ) purposefully has zero entries left-out: this is crucial, as the matrix A is usually (and should be) stored in memory as asparsematrix. When stored, sparse matrices occupy memory only for storing non-zero entries i.e. proportional to the number of edges, in big-O notation occupyingO(jEj) memory. Its counter-part, dense matrix, contains exactly the same data but when stored, it occupiesO(jVjjVj) memory. While dense matrices are the ideal storage format for most kinds of data (e.g. images, videos), most real-world graphs, on the other hand, contain much less thanjVjjVj edges, as discussed in Section 3.2. Unless otherwise stated, all matrices described in this document are stored as sparse matrices. Luckily, computational software libraries such as NumPy [van der Walt et al., 2011], scikit- learn [Pedregosa et al., 2011], TensorFlow [Abadi et al., 2016], PyTorch [Paszke et al., 2019], and others, recognize sparse matrices as native citizens and oer optimized code e.g. for sparse matrix multiplication. We follow standard subscript notation: A i;j is the (binary) value ofA at rowi and columnj. E.g. A 1;2 = 1 but A 1;3 = 0, as (a;b)2E but (a;c) = 2E, given correspondent numberings of (1; 2; 3) for (a;b;c). Since the graph is undirected, the adjacency matrix equals its transpose:A > =A. Therefore, for adjacency matrices,A i;j = A j;i for alli;j. It is also frequent to drop the comma (writeA ij rather than 14 A i;j ). It is also common to overload notation and use actual nodes at the subscript e.g. A ab (rather than their corresponding numbering). We will adopt these conventions through this thesis. Now we we discuss other matrices that will be useful in our discussion. First, thedegreeofnodeu, denoted deg(u), is the number of edges involvingu. If the graph is undirected, the notion of a degree is unambiguous. On the other hand, in directed graphs, there are two such quantities known as thein-degree and the out-degree, respectively, corresponding to an Instagram graph as number of followers and number of users followed. ThenodedegreematrixD is adiagonalmatrix: i.e. only contains entries along the diagonal. Specif- icallyD ii equals to the degree of nodei, butD ij = 0 for alli6=j. The running-example graph above has the following degree matrix: degree matrix: D = 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 a b c d e a 2 b 3 c 1 d 2 e 2 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 D can be analytically calculated fromA. Specically, the diagonal entryD ii is equal to the sum of entriesA i1 +A i2 +::: . Using the sum notation, this can be precisely written as: D ii = jVj X j=1 A ij : 15 Another useful matrix is theinversedegreematrix,D 1 , and in the case of the running example: inverse degree matrix: D 1 = 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 1 2 1 3 1 1 2 1 2 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 ; Which can be veried that it is the true inverse as it holds thatD 1 D = DD 1 = I jVj , whereI jVj is a jVjjVj identity matrix. This is true in general: to nd the inverse of a diagonal matrix, one can reciprocate the diagonal values ‡ . Finally, graphtransitionmatrixT is a stochastic matrix (rows sum to 1) and can be calculated § as T =D 1 A. For the running example, this equals: transition matrix: T = 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 a b c d e a 1 2 1 2 b 1 3 1 3 1 3 c 1 d 1 2 1 2 e 1 2 1 2 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 : This matrix is useful for many applications. This matrix signies the following: if a random walker is traversing the graph, repeatedly transitioning from one node to the next, is currently positioned at nodeu. ‡ However, some diagonal entries can be zero. Therefore, for nodesi that have no connections, it is common to add a self- connection i.e. setAii 1, before calculatingD § The denitionT = D 1 A is known as right stochastic matrix, with each row summing to 1. It is also equally common to dene asT = AD 1 , known as left stochastic matrix, with each column summing to 1. Nonetheless, the derivations in this document adheres to the former. 16 The random walker takes an edge fromu, uniformly at random, to transition to the next node. The row of T corresponding tou forms a probability distribution on the nodes that can be the next step afteru. The transition matrix yields powerful implications. It is a key ingredient for calculatingPageRank [Page et al., 1999], which estimates of the importance of every webpage on the internet. PageRank is a key ingredient for the success of Google Web Search. It has also been shown that a large class of Machine Learning algorithms can be formulated in terms of the adjacency matrix [Abu-El-Haija et al., 2018, Markowitz et al., 2021] The last matrix we need is known as thegraphLaplacianmatrix,L, dened asL =DA. For the running example, Laplacian matrix: L = 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 a b c d e a 2 1 1 b 1 3 1 1 c 1 1 d 1 1 2 e 1 1 2 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 : 2.2.3 Node&EdgeTypes The majority of applications have one node type (e.g. user) and one edge type (e.g. friendship). However, in some complex systems, there can be many node and edge types. Consider an online purchasing system, like amazon, where there can be many node types, corresponding to users, products, market place. There are various edge types between these nodes. For instance, user-product edges can be purchase, favorite, comment, and/or any relation desired to be captured by an algorithm. With multiple edge types, the notion of a node degree is uid, and is usually explicitly dened per application. Most graphs we review have a single node and edge types. Nonetheless, most of the algorithms we review can generalize from one edge type to many. 17 2.2.4 Node&EdgeFeatures So far, we described a graph (G) as a set of nodes (V) and their relationships (edges,E). It is also common in many applications, including ones discussed in Section 3.2, that nodes and/or edges have attributes. To stick to the Machine Learning nomenclature, these attributes are known as features, and they are application-dependent. For instance, following the social networks example, features for a node (corre- sponding to a user) could include age, level of education, location, spoken languages, etc. To stick to matrix notations, let node feature matrixX be a matrix withn =jVj rows and (i.e. X2R nd ). Each row inX containsd-dimensions for each node inV. For the sake of discussion of the running example, let us assume: node feature matrix: X = 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 age speaks English speaks Spanish in USA in Canada a 19 1 1 1 b 58 1 1 c 32 1 1 d 20 1 1 e 43 1 1 1 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 The actual values inX are application-dependent. In the above, theage has been presented as a scalar, the languages (English,Spanish) as multinomial, and the location (USA,Canada) as categorical. The choice of {scalar, categorical, multinomial} is also model-dependent. For example, the age can also be represented as categorical e.g. with exclusively-mutual choices ((0! 25); (26! 40);::: ). It is also possible to haveedgefeatures. For instance, a “friendship” edge in a social network can have attributes includingdateofformation. In many cases, there is a single scalar feature to the edge (sometimes callededgeweight oredgecost). Nonetheless, there is no standard notation for indicating edge features. 18 2.2.5 ClassicalGraphAlgorithms There are many classical graph algorithms that are outside the scope of our work. Perhaps the two most popular algorithms includebreadthrstsearch anddepthrstsearch, which are two techniques forsearch- ing for one or more nodes in a graph. For example, they are used by LinkedIn to show you, when browsing a prole, the trail of users can introduce you to the prole you are browsing. These algorithms can be found in a standard algorithm textbook (e.g. [Skiena, 2008]). Nonetheless, we are interested in Machine Learning algorithms over graphs. Specically, algorithms can that can learn patterns from training data, and infer unobserved outcomes on test (unlabeled) data, as will be described in Chapter 3. 19 Chapter3 MachineLearningonGraphs Applying machine learning (ML) techniques on graph data impacts a variety of domains. This chapter rst explains why why ML on graphs is practically-useful, then discusses how one couldapply ML on graphs. Specically, Chapter 3.1 rst motivates application of ML on graphs by highlighting the kinds of tasks that can be solved, and Chapter 3.2 gives concrete examples of how data from various domains can be stored as graph data structures. Then, Chapters 3.3, 3.4 and 3.5 introduce three general classes of ML models that have been extensively utilized by practitioners. 3.1 TaskssolvablebyMLonGraphs Section 3.1.1 describes what ML can generally do over graph structures. Then, Section 3.1.2 shows two tasks, node classication and link prediction, that most of this thesis is shaped around. 3.1.1 GeneralTasks • Node-level prediction. Given a graphG = (V;E), the goal is to produce predictions for one or more nodes in the graph. This include node classication e.g. predicting the political party that a user on a social network would vote for. It also includesregression to predict continuous values, e.g. user’s age based on nodes features (e.g., photo) its edges (friendships). 20 • Edge-levelprediction. Broadly falls into two categories: (i.) inferring unknown edges or (ii.) clas- sifying existing ones. The rst is more common: if a graph is given and the edges are partially observed, it is common to predict unobserved edges, e.g., which edges might form next. This is known aslinkprediction, and has major applications in social network (e.g. friend recommenda- tion) or e-shopping (e.g. predicting user-product interest). Nonetheless, most algorithms that solve one of these two tasks, can be readily applied to solve the other task. • Graph-level prediction. The goal is to produce labels on an entire graph or sub-graphs. For in- stance, given a graph describing a chemical molecule, material scientists are interested in predicting quantities about the molecule, such as its conductance or free energy. • NodeClustering. The goal is to automatically group together nodes that are well-connected: nodes (a) and (b) could be considered well-connected, even if they are not directly connected: for instance, if they have many mutual connections. The above tasks assume thegraphis (partially)provided or deterministically constructed. Literature of generatinggraphs (both nodes and edges) is well-studied, but falls outside the scope of the thesis, but the curious can read Chapter 8 of textbook of Hamilton. 3.1.2 SpecicTasksofourInterests From the general tasks described in §3.1.1, this section summarizes two tasks that are broadly-applicable: link prediction (§3.1.2.1) and node classication (§3.1.2.2). 3.1.2.1 LinkPrediction Link prediction is the task of inferring missing edges in a graph. To evaluate the strength of an ML model for link-prediction tasks, it is common to partition the graph edgesE intotrainingedgesE train andtest 21 edgesE test such thatE train [E test =E andE train \E test = . The test edges are can also be referred to as “validation edges” or “held-out edges”. Then, the ML model would be trained using onlyE train . After training, it will be evaluated on how-well it can predict the held-out edgesE test . There are a variety ofevaluationmetrics to quantify how-well the model can predictE test . Most-comon quantitative metrics areranking-based: after training onE train , does the model produce score for edges inE test higher than score it produces for negative edges? One of these metrics, which we utilize in this thesis, is known as the ROC-AUC – the area under curve of the receiver operating characteristic. 3.1.2.2 NodeClassication In a node classication task, a graph is given where only a fraction of its nodes are labeled. The goal is to predict labels for unlabeled nodes. It applies to many domains. As examples, in online social networks, one could wish to predict user interests, or in health care, one could wish to predict whether a patient should be screened for cancer. In many such cases, collecting node labels can be prohibitive. However, edges between nodes can be easier to obtain, either using an explicit graph (e.g. social network) or implicitly by calculating pairwise similarities [e.g. using a patient-patient similarity kernel, Merdan et al., 2017]. To evaluate the strength of an ML model for node classication tasks, it is common to partition the nodesV into labeled training nodesV L and unlabeled test nodesV U , usually withjV L jjV U j i.e. implying that only a small fraction of nodes are labeled. Then, an ML model is trained given only labels inV L then tested on how-well it can predict labels for nodes inV U . Node classication tasks are also commonly referred to assemi-supervisednodeclassication, as one treats the graph (e.g. edgesE or featuresX) as the “unsupervised” and labels ofV L as the “supervised” portions of the data. There are several evaluation metrics to quantify how-well the model labels the nodes inV U . The most- popular choice is theaccuracy (a.k.a, top-1 label accuracy). Given a nodev2V U , the model produces a score forv for every label in the vocabulary. The label with the highest score will be checked against the 22 ground-truth label (if it is known). The accuracy is the fraction of model predicted labels that match the ground-truth. 3.2 ApplicationDomainsusingGraphs There many kinds of information that can be encoded as graphs, spanning various domains. Many more than we can aord to enumerate in this document. However, we will give just a few examples to give the reader some avors. Further, this is no strict formula on how to create graphs. It is an somewhat of an art to choose what are the nodes and what are the edges when encoding information as graphs. 3.2.1 SocialNetworks Many social networks like Facebook, Twitter, LinkedIn, Instagram, and others, can be represented as graphs. Specically, each node is a user (possibly, with their features such as age, location, ...) and an edge between two users could incidate that they are friends on the social network platform. Luckily, for academic purposes of testing graph algorithms, many graph datasets, including real social networks, have been anonmized by companies and released for academics as downloadable. Many of these academic datasets are bundled by theStanfordLargeNetworkDatasetCollection (SNAP) [Leskovec and Krevl, 2014b], which are summarized in Table 3.1. Some applications on social graphs include • Spam detection. Given a large social network, it is in the operator’s interest to predict which nodes (users) and/or edges (friendships) are likely formed by a spammer. These tasks fall under node- prediction and/or edge-prediction. 23 Table 3.1: Social Networks provided by [Leskovec and Krevl, 2014a]. Name Nodes Edges Description ego-Facebook 4,039 88,234 Social circles from Facebook (anonymized) ego-Gplus 107,614 13,673,453 Social circles from Google+ ego-Twitter 81,306 1,768,149 Social circles from Twitter soc-Epinions1 75,879 508,837 Who-trusts-whom network of Epinions.com soc-LiveJournal1 4,847,571 68,993,773 LiveJournal online social network soc-Pokec 1,632,803 30,622,564 Pokec online social network soc-Slashdot0811 77,360 905,468 Slashdot social network from November 2008 soc-Slashdot0922 82,168 948,464 Slashdot social network from February 2009 wiki-Vote 7,115 103,689 Wikipedia who-votes-on-whom network wiki-RfA 10,835 159,388 Wikipedia Requests for Adminship (with text) gemsec-Deezer 143,884 846,915 Gemsec Deezer dataset gemsec-Facebook 134,833 1,380,293 Gemsec Facebook dataset soc-RedditHyperlinks 55,863 858,490 Hyperlinks between subreddits on Reddit soc-sign-bitcoin-otc 5,881 35,592 Bitcoin OTC web of trust network soc-sign-bitcoin-alpha 3,783 24,186 Bitcoin Alpha web of trust network comm-f2f-Resistance 451 3,126,993 Face-to-face interaction network between group of people musae-twitch 34,118 429,113 Social networks of Twitch users. musae-facebook 22,470 171,002 Facebook page-page network with page names. musae-github 37,700 289,003 Social network of Github developers. • User interest classication. Given a large social network, is it possible to predict who is a football fan and who is not? This is a node-level prediction task. These usually feed into monetary goals, such as showing advertisements to users. • Friend recommendation. Given a graph, the goal is to predict which users will form friendship on the platform. This is an edge-level prediction task, specically, a link prediction one. 3.2.2 CitationNetworks Scientic articles can be represented as a graph, referred to as a citation network. Each articile can be represented as a node, containing features like: title, abstract, main text, and potentially one or more cat- egories (e.g., computer science, nance, etc). In addition, node (a) can have an edge pointing to node (b) if article (a) cites article (b). Representing scientic articles as a graph allows us to run Machine Learning 24 algorithms designed for graph data. For instance, one might be interested to automatically infer all cate- gories of all articles (node classication task). It is important to notice here that both the node features (e.g., text) as well as the edges (incoming and outgoing citations) can be useful in automatically categorizing articles. 3.2.3 WorldWideWeb It is possible to represent various data from the World Wide Web (a.k.a., internet) as graphs. We will just name two examples: • Web pages are nodes. An edge from node (a) to node (b) can signify that there is a link from page (a) to page (b). • Emails. Each user is a node. An edge from user (a) to user (b) can signify, for instance, that user (a) sent an email to user (b) within the past year. Representing this information as graphs, allows practitioners apply standard algorithms, including, spam detection of users or web pages (node classication), or automatically detecting communities of users or of web-pages. 3.2.4 Protein-ProteinInteractions Biologists and chemists are interested to see which proteins interact with which proteins. Such discoveries can be conrmed in a laboratory. However, it is useful to automatically predict which pairs of proteins interact. Lab experts can use these predictions to set priorities, as to which pairs of proteins to try experi- ments on. Fortunately, scientists have created protein-protein graphs, where each node is a protein and an edge between two proteins indicate that they interact. On this graph, it is possible to attemptlinkprediction, for 25 predicting which pairs of proteins interact [Perozzi et al., 2014, Grover and Leskovec, 2016, Abu-El-Haija et al., 2018]. 3.3 NodeEmbeddingModels Node embedding models utilize the graphG = (V;E) or equivalently, adjacency matrixA, but do not use node featuresX. These methods construct a continuous vector spaceZ2R nz , so called the embedding space, and project each node graph nodev2V as vectorZ v 2R z . This projection is useful as graphs, in their native form (nodes and edges), contain discrete information. However, Machine Learning models (e.g. Neural Network models) are designed to operate on continuous numeric inputs. Converting from discrete to continous information, via embedding, is one feasible way for applying ML on information encoded as graphs. For instance, after learning an embedding for every node, one can train a model that maps from node embedding to node label [Perozzi et al., 2014, Abu-El-Haija et al., 2018], or maps two node embeddings onto an edge label [Perozzi et al., 2014, Grover and Leskovec, 2016, Abu-El-Haija et al., 2017]. Traditional methods utilize direct connections: the position of every node in the embedding space is aected by the position of its direct neighbors, as summarized in §3.3.1. More recent methods consider a broader context: the position of a node in the embedding space is aected by a k-step neighborhood around it. This neighborhood is also known as a subgraph patch, centered at the node. These methods are explained in §3.3.2 26 3.3.1 NodeEmbeddingModelsusingDirectConnections An early embedding method, known as Eigenmaps [Belkin and Niyogi, 2003], learns node embedding ma- trixZ2R nz , with the rowZ u 2R z being thez-dimensional embedding for nodeu2V by minimizing the following objective min Z X (u;v)2E jjZ u Z v jj 2 subject to Z > DZ =I; (3.1) The minimization positions two vectors,Z u andZ v , close to each other, ifu andv are neighbors. To avoid a trivial solution (e.g. allZ’s can be set to zero), the constraintZ > DZ =I ensures that the columns ofZ are orthonormal ∗ . The minimization above is exactly equivalent to a Generalized Eigenvalue Decomposition, of the graph Laplacian matrixL = DA. This equivalency allows practitioners to obtain node embeddings using one-line of python code. Specically, the columns ofZ are the eigenvectors ofL corresponding to theleast z eigenvalues. The equivalency is shown in Laplacian Eigenmaps [Belkin and Niyogi, 2003], justifying the name of the method. Other embedding methods, that appeared in the literature, target user-product recommendation [Ko- ren et al., 2009, He et al., 2016b]. These methods construct a user-product graph: abipartitegraph, where each edge must be between a user and a product. They minimize an objective of the form: min Z (u) ;Z (p) Z (u) Z (p) > A 2 F ; whereZ (u) 2R # usersz andZ (p) 2R # productsz learnz-dimensional vector, respectively, for every user and every product.jj:jj F represents Frobenius norm of its matrix argument. Unlike the preceeding Eigenmaps, these methods do not require the adjacency matrix to be symmetric nor a square. Specically, the adjacency ∗ after scalingZu bydu. 27 A has (# rows, # columns) equal to (# users, # products), and entryA ij = 1, if useri enjoys productj. Nonetheless, this construction also works for symmetric adjacency matrices. The minimization can be easily acheived by factorizingA e.g. via UV-Decomposition [Chapter 9 of Leskovec et al., 2014] or via Singular Value Decomposition (SVD) [Golub and Kahan, 1965], taking the topd left- and right-singular vectors, respectively, asZ (u) andZ (u) , therefore using a few lines of python code. 3.3.2 NodeEmbeddingModelsSimulatingRandomWalks Rather than taking only the direct connections, stored in the adjacency matrixA or LaplacianL, these methods extrapolate beyond direct connections. This extrapolation is achieved through random walks. If two nodesu andv are not directly connected (i.e. A uv = 0), but they have a lot of mutual connections, then many short random walks starting fromu should also visitv. These models on graphs are inspired by skipgram models on natural language Mikolov et al. [2013]. Perozzi et al. [2014] observed analogies between real-world graphs and natural language. Specically, nodes to walk sequences, have the same positional distribution as to words in sentences. For instance, n-gram statistics of nodes in walk sequences follow the natural Zipf Law. The recipe introduced by Perozzi et al. [2014] simulates many random walks given an input graph. Each random walk starts from a randomly-sampled start node: e.g. v 0 Uniform(V). The walk then repeatedly sampling an edge to transition to next node as v i+1 T v i . The random walk results in a node sequence v 0 ! v 1 ! v 2 ! ::: . Each random walk generates a sequence of nodes. Sequences are converted to textual paragraphs, and are passed to a word2vec-style embedding learning algorithm [Mikolov et al., 2013]. These embedding algorithms can be summarized in three steps: 1. Given a graph, simulate many random walks on graph, generating many node sequences. 28 2. Pass these sequences, as-if they were sentences, to natural language embedding algorithm, such as word2vec [Mikolov et al., 2013] or GloVe [Pennington et al., 2014], which have open-source imple- mentations. 3. Assign the node embeddings, to the output of the language embedding algorithm. These methods have remarkably achieved state-of-the-art on many tasks, such as, link-prediction and node classication, for biological, social, citation, and many other real-world graphs. Abu-El-Haija et al. [2018] studied this general class of graph embedding algorithms, showing that the above three steps are equivalent to, in expectation, the decomposition of a random walk co-occurrence statistics matrix. The square matrix is the weighted sum of the powerseriesT;T 2 ;T 3 ;::: . It can be written as an expectation: E[ ]/E qQ [(T ) q ] =E qQ D 1 A q ; (3.2) where 2R nn be the co-occurrence matrix from random walks, with each entry vu containing the number of times nodes v and u are co-visited within context distance qQ, in all simulated random walks.T =D 1 A is the row-normalized transition matrix (a.k.a right-stochastic adjacency matrix), and Q is a “context distribution” that is determined by random walk hyperparameters, such as the length of the random walk. The expectation therefore weights the importance of one node on another as a function of how well-connected they are, and the distance between them. 3.3.3 NodeEmbeddingFramework This section shows a formalization of anodeembeddingframework, which makes text in subsequent chap- ters easier to present. For a more general formalization, we refer the reader to [Chami et al., 2022]. 29 Given an unweighted graphG = (V;E), its (sparse) adjacency matrix A2f0; 1g nn can be con- structed according toA vu = 1[(v;u)2E], where the indicator function 1[:] evaluates to 1 i its boolean argument is true. In general, graph embedding methods minimize an objective: min Z L(f(A);g(Z)); (3.3) whereZ2R nz is az-dimensional node embedding dictionary;f :R nn !R nn is a transformation of the adjacency matrix;g :R nz !R nn is a pairwise edge function; andL :R nn R nn !R is a loss function. Many popular embedding methods can be viewed via this framework. We show a couple of special- izations in §3.3.3.1 and §3.3.3.2. 3.3.3.1 DirectConnectionsviaNodeEmbeddingFramework For instance, a stochastic † version of Singular Value Decomposition (SVD) is an embedding method, and can be cast into our framework by setting f(A) = A; decomposing Z into two halves, the left and right representations ‡ as Z = [LjR] with L;R 2 R jVj d 2 then setting g to their outer product g(Z) =g([LjR]) =LR > ; and nally settingL to the Frobenius norm of the error, yielding: min L;R jjALR > jj F 3.3.3.2 RandomWalksviaNodeEmbeddingFramework Embedding methods utilizing random walks, can also be viewed using the framework of equation 3.3. For example, to get Node2vec [Grover and Leskovec, 2016], we can setf(A) = , set the edge function to the † Orthonormality Constraints are not shown. ‡ Also coined in NLP [Mikolov et al., 2013] as the “input” and “output” embedding representations. 30 embeddings outer productg(Z) =ZZ > , and set the loss function to negative log likelihood of softmax, yielding: min Z 2 4 logZ X v;u2V (Z > v Z u ) vu 3 5 ; (3.4) where partition functionZ = P v;u exp Z > v Z u can be estimated with negative sampling [Mikolov et al., 2013, Grover and Leskovec, 2016]. 3.3.3.3 GraphLikelihoodviaNodeEmbeddingFramework A recently-proposed objective for learning embeddings is the graph likelihood Abu-El-Haija et al. [2017]: Y v;u2V (g(Z) v;u ) vu (1(g(Z) v;u )) 1[(v;u)= 2E] ; (3.5) whereg(Z) v;u is the output of the model evaluated at edge (v;u), given node embeddingsZ; the activation function(:) is the logistic; Maximizing the graph likelihood pushes the model scoreg(Z) v;u towards 1 if value vu is large and pushes it towards 0 if (v;u) = 2E. Framework in Eq.3.3 can express the minimization of the negative log of Eq.3.5, as: min Z jj log ((g(Z))) 1[A = 0] log (1(g(Z)))jj 1 ; (3.6) which we minimize w.r.t node embeddingsZ2R nz , where is the Hadamard product; and the L1-norm jj:jj 1 of a matrix is the sum of its entries. The entries of this matrix are positive because 0<(:)< 1. Ma- trix 2R nn can be created “by simulation” (rst step in Random Walk methods, §3.3.2). Alternatively, it can be analytically expressed, which allows the objective function to update the value of the context distributionQ, as explained in Chapter 6. 31 3.3.4 HowtoUtilizeNodeEmbeddingsforInference Learning embeddings, using the graph structureG = (V;E) without task-specic labels can be con- sidered as unsupervised learning. After running a node embedding algorithm (from the above), we obtain Z2R nz , with one embedding per node. Afterwards, it is possible then to train further models to map the node embeddings onto some task. Specically, consider: • Node Classication. Given a graph,G = (V;E), with only a portion of the nodes labeled: the goal is to recover the labels for unlabeled nodes. One can rst train embeddingsZ2R nz , then, train a neural networkh :R z ! [0; 1] y , wherey is the number of classes, for example, using cross entropy as an objective and SGD as a learning algorithm. This has been done in [Perozzi et al., 2014, Grover and Leskovec, 2016]. There is also an alternative to avoid training the neural networkh and instead use avoting mechanism: the label of an unlabeled node, could be the average vote of nearest neighbors (in the learned embedding space), as presented in [Abu-El-Haija et al., 2018]. • Link Prediction. Similar to the above, after training the embeddings, one can train a neural net- work. The neural network can be symmetric i.e. h(Z u ;Z v ) = h(Z v ;Z u ) like in [Grover and Leskovec, 2016] (the rst layer computes a hadamard product) which could work better for undi- rected graphs, or be asymmetric i.e.h(Z u ;Z v )6=h(Z v ;Z u ) which has a large advantage in model- ing information in directed graphs like in [Abu-El-Haija et al., 2017] (h(Z u ;Z v ) =Z > u MZ v , where M is a (low-rank) square matrix and forms the parameters ofh). • Visualization. Euclidean spaces can be visualized on a piece-of-paper (as a 2D or even 3D projec- tion). This is possible ifz was small (i.e. 1, 2, or 3, as shown in [Abu-El-Haija et al., 2017]), or one could use a largez (e.g. 200), then further minimize the dimensions (onto 2) using t-SNE, which itself invents a new embedding coordinates with an objective that approximately preserves node distance orderings [van der Maaten and Hinton, 2008]: in particular, if in the learned (200-dimensional) the 32 nodes have coordinates such that the closest neighbors to nodeu are nodesv 1 ;v 2 ;v 3 ;::: , in that or- der, then t-SNE constructs a (2-dimensional) space where the closest neighbors tou are maintained: v 1 ;v 2 ;v 3 ;::: . The above three are extremely useful for many applications. In general, it is “easier” to work with nodes when they are in a vector space, as there are many methods with handling continuous vectors. Moreover, even though we described that these tasks can be solved in two steps: unsupervised embedding learning, followed by neural network modelh, it is possible to train the embeddings jointly with the model, as shown by Abu-El-Haija et al. [2017]. If so, one must be aware that in order to use the standard learning algorithm (stochastic gradient descent), one must either careful initialize the node embeddingsZ, or alternatively, use an adaptable learning-rate algorithm such as ADAM [Ba and Kingma, 2015] or PercentDelta [Abu- El-Haija, 2017], as gradients of the objective w.r.t. Z are always proportional to average norm of rows of Z. 3.4 ModelsthatuseGraphsasRegularizers Unlike the section above, many methods utilize the node feature matrixX2R nd , and train a model e.g. neural networkh :R d !R y , mapping from node features to node labels, using cross-entropy objective if the labels are binary (or, using Euclidean distance objective, if labels are continuous). If no graph was given, the objective would be: min h L(h;D) where the datasetD =f(x (1) ;y (1) ); (x (2) ;y (2) );:::g is such that the feature vectorsx (1) ;x (2) ;::: are rows inX for which labels are available. It need not be that all rows inX have associated labels, in which case, the problem is known assemi-supervisednodeclassication. 33 So far, the minimization above does not utilize the graph informationG = (V;E). Belkin et al. [2006] propose to use the graph to regularize the objective and amending of the above, they propose the mini- mization: min h 0 @ L(h;D) + X (u;v)2E jjh(X u )h(X v )jj 2 1 A ; where the rst term under the minimization is the same as the previous equation (e.g., can be cross entropy or Euclidean squared distance) and the second term is known as agraphregularization term, with a positive scalar 0 as a hyper-parameter to be manually-tuned to weigh the contribution graph regularization. The above expression can also be expressed in terms of the graph Laplacian matrix,L, as: min h L(h;D) +h(X) > Lh(X); where applyingh on matrixX overloads the notation (and frequently appears), and is equivalent to ap- plyingh on independently on each row of X, then concatenating the output vectors, row-by-row, into matrix, implying in this situation thath :R nd !R ny . Recall,h can be a neural network mapping each example from the input features ofd dimensions to the output ofy dimensions. The termregularizer is often used in Machine Learning, and is almost always applied on the model parameters (e.g. like, L2 penalty, which is not covered in this document). Formally, a regularizer is a term in the objective function that strictly lives outside the modelh and only aects the learning of the model, but it is completely discarded after the training nishes. Even though the classical Belkin et al. [2006] proposes applying the graph regularization at theoutput of the model, with the deep learning era, other researchers including Bui et al. [2018] have proposed applying the graph regularization in the middle of the model (e.g. the neural network). For example, 34 assume that modelh can be written as a composition ash(x) =g(f(x)), where eachg andf are Neural Networks. Following this, the minimization could be written as: min h L(h;D) +f(X) > Lf(X); where the minimization subscript imply the parameters ofh, also implies the parameters of its constituents, f andg. 3.5 ModelsthatuseMessagePassing In the previous two Subsections, 3.3 and 3.4, we summarized two broad use-cases where ML practitioners can utilize graphs. Respectively, to use the graph for projecting nodes onto a continuous vector space (a.k.a, an embedding space), and to use the graph in order to regularize a model – the model, otherwise, does not know about the graph, but only the objective function processes the graph. In this section, we will focus on a third class of ML algorithms utilizing graphs. Specically, algorithms that models nodes in a graph as objects that communicate via edges. 3.5.1 MessagePassing Message Passing algorithms can be used to learn models over graphs Gilmer et al. [2017]. In such models, each graph node (and optionally edge) holds a latent feature vector, initialized to the node’s input features. Each node repeatedlypasses its current latent vector to, and aggregates incoming messages from, its imme- diate neighbors. Afterl steps of message passing and feature aggregation, every node outputs a represen- tation which can be used for an upstream task e.g. node classication, or entire graph classication. Thel steps (message passing and aggregation) can be parametrized and trained via Backprop-Through-Structure 35 algorithms Goller and Kuchler [1996], to minimize an objective measured using the node representations as output by thel’th step. 3.5.2 GraphConvolutionalNetworks These models are among the most popular among message passing models. Graph Convolution [Bruna et al., 2014, Ortega et al., 2018] generalizes convolution from Euclidean domains to graph-structured data. Convolving a “lter” over a signal on graph nodes can be calculated by transforming both the lter and the signal to the Fourier domain, multiplying them, and then transforming the result back into the discrete domain. The signal transform is achieved by multiplying with the eigenvectors of the graph Laplacian. The transformation requires a quadratic eigendecomposition of the symmetric Laplacian; however, the low- rank approximation of the eigendecomposition can be calculated using truncated Chebyshev polynomials [Hammond et al., 2011]. For instance, Kipf and Welling [2017] calculates a rank-1 approximation of the decomposition. They propose a multi-layer Graph Convolutional Networks (GCNs) for semi-supervised graph learning. Every layer computes the transformation: H (l+1) = b AH (l) W (l) ; (3.7) whereH (l) 2R nd l is the input activation matrix to thel-th hidden layer with rowH (l) i containing a d l -dimensional feature vector for vertexi2V, andW (l) 2R d l d l+1 is the layer’s trainable weights. The rst hidden layerH (0) is set to the input featuresX. A softmax on the last layer is used to classify labels. All layers use the same “normalized adjacency” b A, obtained by the “renormalization trick” utilized by Kipf and Welling [2017], as b A =D 1 2 AD 1 2 . § Eq. 3.7 is a rst order approximation of convolving lterW (l) over signalH (l) [Hammond et al., 2011, Kipf and Welling, 2017]. The left-multiplication with b A averages node features with their direct neighbors; § Self-connections added asAii = 1, similarly to Kipf and Welling [2017]. 36 this signal is then passed through a non-linearity function() (e.g, ReLU(z) = max(0;z)). Successive layers eectively diuse signals from nodes to neighbors. 3.6 ConclusionofPartI&Goingforward Part I prepared the reader with the technical material required for understanding the thesis. Specically, Chapter 3 introduces a variety of methods for Graph Representation Learning (GRL). Part I lists two broad tasks in GRL,linkprediction andnodeclassication, and explains how several applications from many domains can be cast into those two tasks. Going forward, the remaining parts improve state-of-the-art GRL methods in many ways. Part II presents label-ecient models fornodeclassication, that improve upon baselines presented in Chapter 3. Moreover, Part III shows how one can train faster the methods presented in 3 and using less experi- ments, i.e., improving eciency of the machine and the human practitioner. Finally, Part III scales these GRL methods to handle large graphs. 37 PartII-Data-ecientModelsfornodeclassicationandlinkprediction Part II of the thesis presents two models that show state-of-the-art performance for node classication tasks. Specically, Chapters 4 & 5 both extend Graph Convolution models (§7.2) to allow every node obtain its representation from its immediate neighbors as well as further neighbors. These models were evaluated on a number of datasets, including citation networks and synthetic homophily graphs, show- casing strong model performance when only a small fraction of nodes are labeled, and also in the presence of data noise. 38 Chapter4 N-GCNModel: Multi-scaleGraphConvolutionNetworks Graph Convolutional Networks (GCNs; §3.5.2) have shown signicant improvements in semi-supervised node classication for graph-structured data. Concurrently, node embeddings when learned via simulating random walks (§3.3.2) show stronger empirical performance than direct embedding methods (§3.3.1). This chapter describes a model: Network of GCNs (N-GCN), which marries these two lines of work. The N-GCN model over semi-superised node classication tasks. The contents of this chapter can be found standalone in [Abu-El-Haija et al., 2019] and the source-code for reproducing the results is available on https:// github.com/samihaija/mixhop At its core, N-GCN trains multiple instances of GCNs over node pairs discovered at dierent distances in random walks, and learns a combination of the instance outputs which optimizes the classication objective. Experiments show that N-GCN model improves state-of-the-art baselines on all considered node classication datasets: Cora, Citeseer, Pubmed, and PPI. Moreover, the method has other desirable properties, including generalization to other GCN models (such as GraphSAGE). In addition to strong empirical performance on the uncorrupted data, the N-GCN models are robust in the presence of noise, in realistic scenarios where features or edge information is noisy. 39 4.1 N-GCNModelIntuition Fig. 4.1 depicts the N-GCN model for semi-supervised node classication builds on the GCN module pro- posed by Kipf and Welling [2017], which operates on the normalized adjacency matrix b A, as in GCN( b A), where b A = D 1 2 AD 1 2 , andD is diagonal matrix of node degrees. Our proposed extension of GCNs is inspired by the recent advancements in random walk based graph embeddings [e.g. Perozzi et al., 2014, Grover and Leskovec, 2016, Abu-El-Haija et al., 2018]. We make a Network of GCN modules (N-GCN), feeding each module a dierent power of b A, as infGCN( b A 0 ); GCN( b A 1 ); GCN( b A 2 );:::g. Thek-th power contains statistics from thek-th step of a random walk on the graph. Therefore, our N-GCN model is able to combine information from various step-sizes (i.e. graph scales). We then combine the output of all GCN modules into a classication sub-network, and we jointly train all GCN modules and the classication sub- network on the upstream objective for semi-supervised node classication. Weights of the classication sub-network give us insight on how the N-GCN model works. For instance, in the presence of input per- turbations, we observe that the classication sub-network weights shift towards GCN modules utilizing higher powers of the adjacency matrix, eectively widening the “receptive eld” of the (spectral) convo- lutional lters. We achieve state-of-the-art on several semi-supervised graph learning tasks, showing that explicit random walks enhance the representational power of vanilla GCN’s. The rest of this chapter is organized as follows. §4.2 describes our N-GCN model and §4.3 presents an experimental evaluation of the NGCN model. Before we dive into N-GCN, we remind the reader of a vanilla GCN model. Two-layer GCN model (as proposed by [Kipf and Welling, 2017]) can take the form: GCN 2-layer ( ^ A;X;) = softmax ^ A( ^ AXW (0) )W (1) (4.1) 40 I X ^ A GCN GCN GCN N C0 C1 C2 Classication Network N C (a) N-GCN Architecture. The classication networks for N- GCN fc and N-GCN a , respectively, are described in Sections 4.2.2.1 and 4.2.2.2. (b) t-SNE visualization of fully-connected (fc) hidden layer of NGCN when trained over Cora graph. Figure 4.1: (a) Model architecture, where b A is the normalized normalized adjacency matrix,I is the identity matrix,X is node features matrix, and is matrix-matrix multiplication. We calculateK = 3 powers of the b A, feeding each power intor = 1 GCNs, along withX. The output of allKr GCNs can be concatenated along the column dimension, then fed into fully-connected layers, outputtingC channels per node, where C is size of label space. We calculate cross entropy error between rowspredictionnC with known labels, and use them to update parameters of classication sub-network and all GCNs. (b) Demonstration of a pre-RELU activations after the rst fully-connected layer of a 2-layer classication sub-network trained on Cora. Activations are PCA-ed to 50-d then visualized using t-SNE with vertex featuresX and normalized adjacency ^ A, where the GCN parameters = W (0) ;W (1) are trained to minimize the cross-entropy error over labeled examples. The output of the GCN model is a matrixR nC , whereC is the number of labels. Each row contains the label scores for one node, assuming there areC classes. 4.2 N-GCNModelConstruction Recent work has showed that random walk statistics can be very powerful for learning an unsupervised representation of nodes that can preserve the structure of the graph [Perozzi et al., 2014, Grover and Leskovec, 2016, Abu-El-Haija et al., 2018]. The chapter oers a way to enhance Graph Convolutional Networks (GCNs) of Kipf and Welling [2017], with random walks. This enables the GCN model to capture information from broader neighborhoods, when making node-level predictions. 41 Under special conditions, it is possible for the GCN model to learn random walks. In particular, consider a two-layer GCN dened in Eq.4.1 with the assumption that rst-layer activation is identity as(z) = z and weightW (0) is an identity matrix (either explicitly set or learned to satisfy the upstream objective). Under these two identity conditions, the model reduces to: GCN 2-layer-special ( b A;X) = softmax b A 2 XW (1) (4.2) where b A 2 can be expanded as b A 2 = D 1 2 AD 1 2 D 1 2 AD 1 2 =D 1 2 A D 1 A D 1 2 =D 1 2 ATD 1 2 : (4.3) By multiplying the adjacencyA with the transition matrixT before, theGCN 2-layer-special is eectively doing a one-step random walk, diusing signals from nodes to neighbors without non-linearities, then applying a non-linear graph convolution layer. 4.2.1 ExplicitRandomWalks The special conditions described above are not true in practice. Although stacking hidden GCN layers allows information to ow through graph edges, this ow is indirect as the information goes through feature reduction (matrix multiplication) and a non-linearity (activation function ()). Therefore, the vanilla GCN cannot directly learn high powers of b A, and could struggle with modeling information across distant nodes. We hypothesize that making the GCN directly operate on random walk statistics will allow the network to better utilize information across distant nodes, in the same way that node embedding methods (e.g. DeepWalk, Perozzi et al. [2014], §3.3.1) that simulate random walks are superior to traditional embedding methods that form the object directly on the adjacency matrix (e.g. Eigenmaps, Belkin and 42 Niyogi [2003], §3.3.2). Therefore, in addition to feeding only b A to the GCN model as proposed by Kipf and Welling [2017] (see Eq. 4.1), we propose to feed aK-degree polynomial of b A toK instantiations of GCN. Generalizing Eq. 4.3 to arbitrary powerk gives: b A k =D 1 2 AT k1 D 1 2 : (4.4) We also dene b A 0 to be the identity matrix. Similar to Kipf and Welling [2017], we add self-connections and convert directed graphs to undirected ones, making b A and hence b A k symmetric matrices. The eigen- decomposition of symmetric matrices is real. Therefore, the low-rank approximation of the eigendecom- position Hammond et al. [2011] is still valid, and a one layer of Kipf and Welling [2017] utilizing b A k should still approximate multiplication in the Fourier domain. 4.2.2 NetworkofGCNs ConsiderK instantiations offGCN( b A 0 ;X), GCN( b A 1 ;X),::: , GCN( b A K1 ;X)g. Each GCN outputs a matrixR NC k , where thev-th row describes a latent representation of that particular GCN for nodev2V, andC k is the latent dimensionality. ThoughC k can be dierent for each GCN, we set allC k to be the same for simplicity. We then combine the outputs of allK GCNs and feed them into a classication sub-network, allowing us to jointly train all GCNs and the classication sub-network via backpropagation. This should allow the classication sub-network to choose features from the various GCNs, eectively allowing the overall model to learn a combination of features using the raw (normalized) adjacency, dierent steps of random walks (i.e. graph scales), and the input featuresX (as they are multiplied by identity b A 0 ). 4.2.2.1 Fully-ConnectedClassicationNetwork From a deep learning prospective, it is intuitive to represent the classication network as a fully-connected layer. We can concatenate the output of theK GCNs along the column dimension, i.e. concatenating all 43 GCN(X; b A k ), each2R nC k into matrix2R nC K whereC K = P k C k . We add a fully-connected layer f fc :R nC K !R nC , with trainable parameter matrixW fc 2R C K C , written as: N-GCN fc ( b A;A;W fc ;) = softmax (4.5) GCN( b A 0 ;X; (0) ) GCN( b A 1 ;X; (1) ) ::: W fc : The classier parameters W fc are jointly trained with GCN parameters = f (0) ; (1) ;:::g. We use subscriptfc on N-GCN to indicate the classication network is a fully-connected layer. 4.2.2.2 AttentionClassicationNetwork We also propose a classication network based on “softmax attention”, which learns a convex combination of the GCN instantiations. Our attention model (N-GCN a ) is parametrized by vectorm2R K , one scalar for each GCN. It can be written as: N-GCN a ( b A;X;m;) = X k m k GCN( b A k ;X; (k) ) where the vectorm is the output of a softmax,m = softmax(e m), and e m is a vector of weights that are updated as parameters in the model. This softmax attention is similar to “Mixture of Experts” model, especially if we set the number of output channels for all GCNs equal to the number of classes, as inC 0 =C 1 = =C. This allows us to add cross entropy loss terms on all GCN outputs in addition to the loss applied at the output N-GCN, forcing all GCN’s to be independently useful. It is possible to set them2R K parameter vector “by hand” using the validation split, especially for reasonableK such asK 6. One possible choice might be settingm 0 to some small value and remainingm 1 ;:::;m K1 to the harmonic series 1 k ; another choice may be linear decay Kk K1 . These are respectively similar to the context distributions of GloVe [Pennington et al., 2014] 44 and word2vec [Mikolov et al., 2013, Levy et al., 2015]. We note that if on average a node’s information is captured by its direct or nearby neighbors, then the output of GCNs consuming lower powers of b A should be weighted highly. 4.2.3 Training We minimize the cross entropy between our model output and the known training labelsY as: min diag(V L ) h Y log N-GCN(X; b A) i ; (4.6) where is Hadamard product, and diag(V L ) denotes a diagonal matrix with entry at (i;i) set to 1 ifi2V L and 0 otherwise. In addition, we can apply intermediate supervision for the N-GCN a to attempt make all GCNs become independently useful, yielding minimization objective: min m; diag(V L ) Y log N-GCN a ( b A;X;m;) + X k Y log GCN( b A k ;X; (k) ) : 4.2.4 GCNReplicationFactorr To simplify notation, our N-GCN derivations (e.g. Eq. 4.5) assume that there is one GCN per b A power. However, our implementation feeds every b A tor GCN modules, as shown in Fig. 4.1. 4.2.5 RelationtootherGCNvariants In addition to vanilla GCNs [e.g. Kipf and Welling, 2017], our derivation also applies to other graph models including GraphSAGE [Hamilton et al., 2017]. Algorithm 1 shows a generalization that allows us to make a network of arbitrary graph models (e.g. GCN, SAGE, or others). Algorithms 2 and 3, respectively, show 45 pseudo-code for the vanilla GCN [Kipf and Welling, 2017] and GraphSAGE ∗ [Hamilton et al., 2017]. Finally, Algorithm 4 denes our full Network of GCN model (N-GCN) by plugging Algorithm 2 into Algorithm 1. Similarly, Algorithm 5 denes our N-SAGE model by plugging Algorithm 3 in Algorithm 1. Algorithm1 General Implementation: Network of Graph Models Require: b A is a normalization ofA 1: functionNetwork(GraphModelFn, b A,X,L,r = 4,K = 6,ClassifierFn=FcLayer) 2: P I 3: GraphModels [] 4: fork = 1 toK do 5: fori = 1 tor do 6: GraphModels.append(GraphModelFn(P;X;L)) 7: P b AP 8: returnClassifierFn(GraphModels) Algorithm2 GCN Model [Kipf and Welling, 2017] Require: b A is a normalization ofA 1: functionGcnModel( b A,X,L) 2: Z X 3: fori = 1 toLdo 4: Z ( b AZW (i) ) 5: returnZ Algorithm3 SAGE Model [Hamilton et al., 2017] Require: b A is a normalization ofA 1: functionSageModel( b A,X,L) 2: Z X 3: fori = 1 toLdo 4: Z ( h Z b AZ i W (i) ) 5: Z L2NormalizeRows(Z) 6: returnZ Algorithm4 N-GCN 1: functionNgcn(A,X,L = 2) 2: D diag(A1) . Sum rows 3: b A D 1=2 AD 1=2 4: returnNetwork(GcnModel; b A;X;L) Algorithm5 N-SAGE 1: functionNsage(A,X) 2: D diag(A1) . Sum rows 3: b A D 1 A 4: returnNetwork(SageModel; b A;X; 2) We can recover the original algorithms GCN [Kipf and Welling, 2017] and SAGE [Hamilton et al., 2017], respectively, by using Algorithms 4 (N-GCN) and 5 (N-SAGE) withr = 1,K = 1, identityClassifierFn, and modifying line 2 in Algorithm 1 toP b A. Moreover, we can recover original DCNN [Atwood and Towsley, 2016] by calling Algorithm 4 withL = 1,r = 1, modifying line 3 to b A D 1 A, and keeping ∗ Our implementation assumes mean-pool aggregation by Hamilton et al. [2017], which performs on-par to their top performer max-pool aggregation. In addition, our Algorithm 3 lists a full-batch implementation whereas [Hamilton et al., 2017] oer a mini- batch implementation. 46 K > 1 as their proposed model operates on the power series of the transition matrix i.e. unmodied random walks, like ours. 4.3 N-GCNExperiments 4.3.1 Datasets We experiment on three citation graph datasets: Pubmed, Citeseer, Cora, and a biological graph: Protein- Protein Interactions (PPI). We choose the aforementioned datasets because they are available online and are used by our baselines. The citation datasets are prepared by Yang et al. [2016a], and the PPI dataset is prepared by Hamilton et al. [2017]. Table 4.1 summarizes dataset statistics. Each node in the citation datasets represents an article published in the corresponding journal. An edge between two nodes represents a citation from one article to another, and a label represents the subject of the article. Each dataset contains a binary Bag-of-Words (BoW) feature vector for each node. The BoW are extracted from the article abstract. Therefore, the task is to predict the subject of articles, given the BoW of their abstract and the citations to other (possibly labeled) articles. Following Yang et al. [2016a] and Kipf and Welling [2017], we use 20 nodes per class for training, 500 (overall) nodes for validation, and 1000 nodes for evaluation. We note that the validation set is larger than trainingjV L j for these datasets. The PPI graph, as processed and described by Hamilton et al. [2017], consists of 24 disjoint subgraphs, each corresponding to a dierent human tissue. 20 of those subgraphs are used for training, 2 for validation, and 2 for testing, as partitioned by Hamilton et al. [2017]. 4.3.2 BaselineMethods For the citation datasets, we copy baseline numbers from Kipf and Welling [2017]. These include label propagation (LP, Zhu et al. [2003]); semi-supervised embedding (SemiEmb, Weston et al. [2012]); manifold regularization (ManiReg, Belkin et al. [2006]); skip-gram graph embeddings [DeepWalk Perozzi et al., 2014]; 47 Table 4.1: Datasets for experiments. For citation datasets, 20 training nodes per class are observed (jV L j = 20C). Dataset Type Nodes Edges Classes Features Labelednodes jVj jEj C F jV L j Citeseer citaction 3,327 4,732 6 (single class) 3,703 120 Cora citaction 2,708 5,429 7 (single class) 1,433 140 Pubmed citaction 19,717 44,338 3 (single class) 500 60 PPI biological 56,944 818,716 121 (multi-class) 50 44,906 Iterative Classication Algorithm [ICA, Lu and Getoor, 2003]; Planetoid [Yang et al., 2016a]; vanilla GCN [Kipf and Welling, 2017]. For PPI, we copy baseline numbers from Hamilton et al. [2017], which include GraphSAGE with LSTM aggregation (SAGE-LSTM) and GraphSAGE with pooling aggregation (SAGE). Further, for all datasets, we use our implementation to run baselines DCNN [Atwood and Towsley, 2016], GCN [Kipf and Welling, 2017], and SAGE [with pooling aggregation, Hamilton et al., 2017], as these baselines can be recovered as special cases of our algorithm, as explained in Section 4.2.5. 4.3.3 Implementation We use TensorFlow[Abadi et al., 2016] to implement our methods, which we use to also measure the performance of baselines GCN, SAGE, and DCNN. For our methods and baselines, all GCN and SAGE modules that we train are 2 layers † , where the rst outputs 16 dimensions per node and the second outputs the number of classes (dataset-dependent). DCNN baseline has one layer and outputs 16 dimensions per node, and its channels (one per transition matrix power) are concatenated into a fully-connected layer that outputs the number of classes. We use 50% dropout and L 2 regularization of 10 5 for all of the aforementioned models. † except as clearly indicated in Table 4.5 48 Table 4.2: Node classication performance (% accuracy for the rst three citation datasets and F1 micro- averaged for multiclass PPI), using data splits of Yang et al. [2016a], Kipf and Welling [2017] and Hamilton et al. [2017]. We report the test accuracy corresponding to the run with the highest validation accuracy. Results above the horizontal line are copied from Kipf and Welling [2017] and from Hamilton et al. [2017] for the transductive and inductive case respectively. Because our code can recover other algorithms (as explained in Section 4.2.5) we show our implementations of these baselines. Finally, our proposed models are at the end of each table. Method(Transductive) Citeseer Cora Pubmed ManiReg [Belkin et al., 2006] 60:1 59:5 70:7 SemiEmb [Weston et al., 2012] 59:6 59:0 71:1 LP [Zhu et al., 2003] 45:3 68:0 63:0 DeepWalk [Perozzi et al., 2014] 43:2 67:2 65:3 ICA [Lu and Getoor, 2003] 69:1 75:1 73:9 Planetoid [Yang et al., 2016a] 64:7 75:7 77:2 GCN [Kipf and Welling, 2017] 70:3 81:5 79:0 DCNN (our implementation) 71:1 81:3 79:3 GCN (our implementation) 71:2 81:0 78:8 SAGE (our implementation) 63:5 77:4 77:6 N-GCN (ours) 72:2 83:0 79:5 N-SAGE (ours) 71:0 81:8 79:4 Method(Inductive) PPI SAGE-LSTM [Hamilton et al., 2017] 61:2 SAGE [Hamilton et al., 2017] 60:0 DCNN (our implementation) 44:0 GCN (our implementation) 46:2 SAGE (our implementation) 59:8 N-GCN (ours) 46:8 N-SAGE (ours) 65:0 4.3.4 NodeClassicationAccuracy Table 4.2 shows node classication accuracy results. We run 20 dierent random initializations for every model (baselines and ours), train using Adam optimizer [Ba and Kingma, 2015] with learning rate of 0.01 for 600 steps, capturing the model parameters at peak validation accuracy to avoid overtting. For our models, we sweep our hyperparameters r, K, and choice of classication sub-network2 ffc; ag. For baselines and our models, we choose the model with the highest accuracy on validation set, and use it to record metrics on the test set in Table 4.2. 49 Table 4.3: Sensitivity Analysis: Color-coded model performance with varying random walk stepsK = f1; 2; 3; 4; 5g and replication factorr =f1; 2; 4g. Darker color means better performance, and color-scale is comparable only within the same matrix. Overall, model performance increases with largerK andr. In addition, having random walk steps (largerK) boosts performance more than increasing model capacity (largerr). Dataset N-GCN N-SAGE Citeseer 5 1 2 4 r 1 2 3 4 5 K 65 66 67 68 69 70 4 K 3 2 1 1 2 4 r 5 1 2 4 r 1 2 3 4 5 K 60 62 64 66 68 4 K 3 2 1 1 2 4 r Cora 5 1 2 4 r 1 2 3 4 5 K 76 77 78 79 80 81 82 4 K 3 2 1 1 2 4 r 5 1 2 4 r 1 2 3 4 5 K 66 68 70 72 74 76 78 4 K 3 2 1 1 2 4 r Pubmed 5 1 2 4 r 1 2 3 4 5 K 76.5 77.0 77.5 78.0 78.5 79.0 4 K 3 2 1 1 2 4 r 5 1 2 4 r 1 2 3 4 5 K 75.0 75.5 76.0 76.5 77.0 77.5 4 K 3 2 1 1 2 4 r Table 4.4: Node classication accuracy (in %) for our largest dataset (Pubmed) as we vary size of training data jVj C 2f5; 10; 20; 100g. We report mean and standard deviations on 10 runs. We use a dierent random seed for every run (i.e. selecting dierent labeled nodes), but the same 10 random seeds across models. Convolution-based methods (e.g. SAGE) work well with few training examples, but unmodied random walk methods (e.g. DCNN) work well with more training data. Our methods combine convolution and random walks, making them work well in both conditions. NodeperClass Method 5 10 20 100 DCNN 63:0 1:0 72:3 0:4 79:2 0:2 82:6 0:3 GCN 64:6 0:3 70:0 3:7 79:1 0:3 81:8 0:3 SAGE 69:0 1:4 72:0 1:3 77:2 0:5 80:7 0:7 N-GCN a 65:1 0:7 71:2 1:1 79:7 0:3 83:0 0:4 N-GCN fc 65:0 2:1 71:7 0:7 79:7 0:4 82:9 0:3 N-SAGE a 66:9 0:4 73:4 0:7 79:0 0:3 82:5 0:2 N-SAGE fc 70:7 0:4 74:1 0:8 78:5 1:0 81:8 0:3 50 10 30 50 70 90 % Features Removed 50 60 70 80 Accuracy mean ± std N-GCN fc N-GCN a N-SAGE fc DCNN GCN SAGE Figure 4.2: Classication accuracy for the Cora dataset with 20 labeled nodes per class (jVj = 20C), but features removed at random, averaging 10 runs. We use a dierent random seed for every run (i.e. removing dierent features per node), but the same 10 random seeds across models. 10 30 50 70 90 0.10 0.15 0.20 0.25 0.30 N-GCN a m 0 m 1 m 2 m 3 m 4 m 5 Figure 4.3: Attention weights (m) for N-GCN a when trained with feature removal perturbation on the Cora dataset. Removing features shifts the attention weights to the right, suggesting the model is relying more on long range dependencies. Table 4.2 shows that N-GCN outperforms GCN [Kipf and Welling, 2017] and N-SAGE improves on SAGE for all datasets, showing that unmodied random walks indeed help in semi-supervised node clas- sication. Finally, our proposed models acheive state-of-the-art on all datasets. 4.3.5 SensitivityAnalysis We analyze the impact of random walk length K and replication factor r on classication accuracy in Table 4.3. In general, model performance improves when increasingK andr. We note utilizing random walks by settingK > 1 improves model accuracy due to the additional information, not due to increased model capacity: ContrastK = 1;r > 1 (i.e. mixture of GCNs, no random walks) withK > 1;r = 1 (i.e. N-GCN on random walks) – in both scenarios, the model has more capacity, but the latter shows better performance. The same conclusion holds for SAGE. 51 Table 4.5: Performance of deeper GCN and SAGE models, both using our implementation. Deeper GCN (or SAGE) does not consistently improve classication accuracy, suggesting that N-GCN and N-SAGE are more performant and are easier to train. They use shallower convolution models that operate on multiple scales of the graph. Graph Conv Layer Dimensions 6464 6464 Dataset Model 64C C 64C Citeseer GCN 0.699 0.632 0.659 Citeseer SAGE 0.668 0.660 0.674 Cora GCN 0.803 0.800 0.780 Cora SAGE 0.761 0.763 0.757 Pubmed GCN 0.762 0.771 0.781 Pubmed SAGE 0.770 0.776 0.775 PPI GCN 0.460 0.461 0.466 PPI SAGE 0.658 0.672 0.650 4.3.6 TolerancetoFeatureNoise We test our method under feature noise perturbations by removing node features at random. This is prac- tical, as article authors might forget including relevant terms in the article abstract, and more generally not all nodes will have the same amount of detailed information. Fig. 4.2 shows that when features are removed, methods utilizing unmodied random walks (N-GCN, N-SAGE, and DCNN) outperform convo- lutional methods including GCN and SAGE. Moreover, the performance gap widens as we remove more features. This suggests that our methods can somewhat recover removed features by directly pulling-in features from nearby and distant neighbors. We visualize in Fig. 4.3 the attention weights as a function of features removed. With little feature removal, there is some weight on b A 0 , and the attention weights for b A 1 ; b A 2 ;::: decay. Maliciously dropping features causes our model to shift its attention weights towards higher powers of b A. 52 4.3.7 RandomWalkStepsVersusGCNDepth K-step random walk will allow every node to accumulate information from its neighbors, up to distance K. Similarly, aK-layer GCN [Kipf and Welling, 2017] will do the same. The former averages node feature vectors according to the random walk co-visit statistics, whereas the latter creates non-linearities and matrix multiplies at every step. So far, we display experiments where our models (N-GCN and N-SAGE) are able to use information from distant nodes (e.g.K = 5), but for all GCN and SAGE modules, we use 2 GCN layer for baselines and our models. Even though the authors of GCN [Kipf and Welling, 2017] and SAGE [Hamilton et al., 2017] suggest using two GCN layers, according by holdout validation, for a fair comparison with our models, we run experiments utilizing deeper GCN and SAGE are models so that its “receptive eld” is comparable to ours. Table 4.5 shows test accuracies when training deeper GCN and SAGE models, using our implementa- tion. We notice that, unlike our method which benets from a wider “receptive eld”, there is no direct correspondence between depth and improved performance. 53 Chapter5 MixHopModel: Layer-wisemixingGraphConvolutionNetwork MixHop learns higher-order message passing, where nodes receive latent representations from their im- mediate (rst-degree) neighbors and from further N-degree neighbors at every message passing step. In this Chapter, we motivate and detail a model with trainable aggregation parameters that can choose how to mix latent information from neighbors at various distances. The contents of this chapter can be found standalone in [Abu-El-Haija et al., 2019] and the source-code for reproducing the results is available on https://github.com/samihaija/mixhop Mixing messages from distant and near-by neighbors gives MixHop an advantage in learning repre- sentations. Specically, the popular vanilla GCN model of [Kipf and Welling, 2017] cannot learn feature detectors that re ondierences in node features. To mention two examples why this lack of representation might be sub-optimal: (1) in social networks, it might be useful to represent users on social boundaries dierently (e.g., ones with many friends as “german speakers” and friends of friends as “english speakers”) and (2) if we represent images as graphs, applying the vanilla GCN would fail to learn Gabor-lter like patterns, which are evidently useful in the human visual cognitive system [Daugman, 1980, 1985] 54 5.1 MixHopModelIntuition We build intuition for the MixHop model by pointing its advantage over the in vanilla GCN models (e.g. of [Kipf and Welling, 2017]). Specically, the MixHop model is able to learn some class of features that vanilla GCN cannot. This class of features encapsulates Gabor-lters, which are neurologically important for the visual cortex system of humans. We mathematically formalize the class of features as follows. We dene a Delta Operator, a subtraction operation between node features collected from dierent distances. Denition 1. Representing Two-hop Delta Operator: A model is capable of representing a two-hop Delta Operator if there exists a setting of its parameters and an injective mappingf, such that the output of the network becomes f b AX b A 2 X ; (5.1) givenany adjacency matrix b A, featuresX, and activation function. Learning such an operator should allow models to represent feature dierences among neighbors, which is necessary, for example, for learning Gabor-like lters on the graph manifold. To provide a con- crete example regarding graphs, consider an online social network. In this setting, Delta Operators allow a model to represent users that live around the “boundary” of social circles [Perozzi and Akoglu, 2018]. To learn an approximate feature for American person with a popular German friend, who might have most immediate friends speaking English, but many friends-of-friends speaking German. This person can be represented by learning a convolutional lter contrasting the English and German languages of one-hop and two-hop neighbors. Note that in the Denition 1 we allow not learning the direct form of two-hop Delta Operators, but a transformation of it, as long as that transformation can be inverted (i.e.f is injective). 55 H (i1) b A W (i) H (i) (a) GCN layer. H (i1) b A 0 W (i) 0 b A 1 W (i) 1 b A 2 W (i) 2 H (i) (b) Our proposed layer Figure 5.1: Vanilla GC layer (a), using adjacency b A, versus our GC layer (b), using powers of b A. Orange denotes an input activation matrix, with one row per node; green denotes the trainable parameters; and red denotes the layer output. Left vs right-multiplication is specied by the relative position of the multiplicand to the operator. In Sections 5.2 - 5.2.2, we analyze the extent to which various GCN models can learn the Delta Operator. We generalize this denition and analysis in Section 5.2.3. 5.2 MixHopModelConstruction We propose replacing the Graph Convolution (GC) layer dened in Equation??, with: H (i+1) = j2P b A j H (i) W (i) j ; (5.2) where the hyper-parameterP is a set of integer adjacency powers, b A j denotes the adjacency matrix b A multiplied by itselfj times, andk denotes column-wise concatenation. The dierence between our pro- posed layer and a vanilla GCN is shown in Figure 5.1. Note that settingP =f1g exactly recovers the original GC layer. Further, note that b A 0 is the identity matrixI n , wheren is the number of nodes in the graph. We depict a model withP =f0; 1; 2g in Figure 5.1b. In our model, each layer containsjPj distinct parameter matrices, each of which can be a dierent size. By default, we set alljPj matrices to have the same dimensionality; however, in Section 5.2.4.2, we explain how we utilize sparsifying regularizers on 56 Algorithm6 MixHop Graph Convolution Layer Inputs:H (i1) , b A Parameters:fW (i) j g j2P j max := maxP B :=H (i1) forj = 1toj max do B := b AB ifj2P then O j :=BW (i) j H (i) := k j2P O j Return:H (i) the learnable weight matrices to produce dataset-specic model architectures that slightly outperform our default settings. 5.2.1 ComputationalComplexity There is no need to calculate b A j , which might be a dense matrix i.e. with quadratic number of non- zero entries. We calculate b A j H (i) via right-to-left multiplication. Specically, if j = 3, we calculate b A 3 H (i) as b A b A b AH (i) . Since we store b A as a sparse matrix withm non-zero entries, an ecient implementation of our layer (Equation 5.2) takesO(j max ms i ) computational time, wherej max is the largest element inP ands i is the feature dimension ofH (i) . Under the realistic assumptions ofj max m ands l m, running anl-layer model takesO(lm) computational time. This matches the computational complexity of the vanilla GCN. 5.2.2 RepresentationalCapability Since each layer outputs the multiplication of dierent adjacency powers in dierent columns, the next layer’s weights can learn arbitrary linear combinations of the columns. By assigning a positive coecient to a column produced by some b A power, and assigning a negative coecient to another, the model can 57 learn a Delta Operator. In contrast, vanilla GCNs are not capable of representing this class of operations, even when stacked over multiple layers. Theorem1. ThevanillaGCNdenedbyEquation3.7isnotcapableofrepresentingtwo-hopDeltaOperators. Theorem2. MixHop GCN (using layers dened in Equation 5.2) can represent two-hop Delta Operators. ProofofTheorem1. The output of anl-layer vanilla GCN has the following form: ( b A(( b A( b AXW (0) ) )W (l2) )W (l1) ): For the simplicity of the proof, let’s assume that8i;s i = n. In a particular case, when (x) = x and X = I n , this reduces to b A l W , whereW = W (0) W (1) W (l1) . Suppose the network is capable of representing a two-hop Delta Operator. This means that there exists an injective mapf and a value for W , such that8 b A; b A l W =f( b A b A 2 ). Setting b A =I n , we get thatW =f(0). Let b C 1;2 , 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 0:5 0:5 0 0 0:5 0:5 0 0 0 0 1 0 . . . . . . . . . . . . . . . 0 0 0 1 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 be the symmetrically normalized adjacency matrix with self-connections corresponding to the graph hav- ing a single edge between vertices 1 and 2. Setting b A = b C 1;2 , we get b C 1;2 W = f(0). Since we already have thatf(0) =W , we get that (I n b C 1;2 )W = 0, which proves that thew 1 =w 2 , wherew i is the i-th row ofW . Since the choice of vertices 1 and 2 was arbitrary, we have that all rows ofW are equal to each other. Therefore, rank( b A l W ) 1, which implies that outputs of mappingf should be at most rank-one matrices. Thus,f cannot be injective, proving that vanilla GCN cannot represent two-hop Delta 58 Operators. ProofofTheorem2. A two-layer model, dened using Equation 5.2 withP =f0; 1; 2g recovers the two-hop delta operator dened in Equation 5.1. We start by redening the feature vectorH (1) learned by the rst layer of the model by pulling out the element-wise activation function and expanding the concatenation operator found in the layer denition: H (1) = j2f0;1;2g b A j XW (0) j = 0 B B @ j2f0;1;2g b A j XW (0) j 1 C C A = I n XW (0) 0 b AXW (0) 1 b A 2 XW (0) 2 ; We can now setW (0) 0 = 0 (zero matrix) andW (0) 1 =W (0) 2 =I s 0 . The expression above can be simplied toH (1) = 0 b AX b A 2 X . The feature vectorH (1) can be plugged into the equation for the second layer that has linear activation function: H (2) = I n H (1) W (1) 0 b AH (1) W (1) 1 b A 2 H (1) W (1) 2 : Setting the weights for the second layer asW (1) 1 =W (1) 2 =0, and W (1) 0 = 2 6 6 6 6 6 6 4 0 I s 0 I s 0 3 7 7 7 7 7 7 5 ; (5.3) 59 makes H (2) = b AX b A 2 X 0 0 : This shows that our GCN can successfully represent the two-hop Delta Operators according to the Denition 1. 5.2.3 GeneralNeighborhoodMixing We generalize Denition 1 from two-hops to multiple hops: Denition2. General layer-wise Neighborhood Mixing: A Graph Convolutional Network is capable of rep- resenting layer-wise neighborhood mixing if for any scalars 0 ; 1 ;:::; m , there exists a setting of its pa- rameters and an injective mappingf, such that the output of the network becomes equal to f 0 @ m X j=0 j b A j X 1 A (5.4) forany adjacency matrix b A, featuresX, and activation function. Theorem3. GCNs dened using Equation 3.7 are not capable of representing general layer-wise neighbor- hood mixing. Theorem 4. GCNs dened using our proposed method (Equation 5.2) are capable of representing general layer-wise neighborhood mixing. Proof of Theorem 3. This trivially follows from Theorem 1: if the vanilla GCN cannot recover a two-hop Delta Operator, dened in Equation 5.1, it cannot recover the Delta Operator generalization in Equation 5.4. 60 Proof of Theorem 4. The proof steps closely resemble the proof of Theorem 2. Our GCN with P =f0;:::;mg can represent the target function, by setting the rst layer weight matrices asW (0) j = I s 0 ;8j2P and setting all but the zeroth second layer weight matrices asW (1) 1 =W (1) 2 = =W (1) m = 0. In other words, we utilize only zero-hops in the second layer, setting the zeroth-power weight matrix the following way: W (1) 0 = 2 6 6 6 6 6 6 4 0 I s 0 . . . m I s 0 3 7 7 7 7 7 7 5 (5.5) This setting of parameters exactly recover the expression in Equation 5.4, for any adjacency matrix b A and featuresX. We note that the generalized Delta Operator in Denition 2 does not explicitly specify feature dif- ferences as in Denition 1; rather, the generalized form denes linear combinations of features (which includes subtraction). 5.2.4 LearningGraphConvolutionArchitectures We have discussed a single layer of our model. In practice, one would stack multiple layers and interleave them with standard neural operators such as BatchNorm Ioe and Szegedy [2015], element-wise activation, and Dropout Srivastava et al. [2014]. In this section, we discuss approaches to turning the MixHop GC layer into a MixHop GCN. 5.2.4.1 OutputLayer The nal layer of a GCN performs a key role for learning the learned latent space of the model on the dataset that is being trained on. As MixHop uniquely mixes features from dierent sets of information, we theorized that constraining the output layer may result in better outcomes for dierent tasks. In order to 61 leverage this property, we dene our output layer in the following way: We divides l columns into sets of sizec and compute e H = P s l =c k=1 q k H (l) ;(id l =c : (i+1)s l =c) , thenH = softmax( e H). Here the subscript onH (l) selectsc contiguous columns and the scalarsq k 2 [0; 1] dene a valid distribution (output of a softmax). This results in the model being forced to choose which features it wants to prioritize by putting more weight on that feature. We obtain the model parametersW (j) i for alli;j andq 1 ;:::q s l c , by minimizing cross-entropy loss, measured only on nodes with known labels i.e. similar to [Kipf and Welling, 2017]. 5.2.4.2 LearningAdjacencyPowerArchitectures As mentioned, our model learns multiple weight matrices W (i) j , one per adjacency power used in the model. By default, we set all W (i) j to be the same size, which eectively assigns the same capacity to adjacency powers b A j for allj2 P . We intuit that dierent sizes ofW (i) j may be more appropriate for dierent tasks and datasets; as such, we are interested in learning how to automatically sizeW (i) j . For vanilla GCNs, such an architecture search is relatively inexpensive - the parameters are the number of layers and their widths. In contrast, searching over the architecture space of our model is multiplica- tivelyO(ljPj) more expensive, as each architecture involves choices on how to divide each layer width s i among the adjacency powers. To address this limitation, we propose using a lasso regularization to automatically learn an architecture for our model. In particular, we train our architecture in stages: 1. Construct a wide network (e.g. 200 dimensions for each adjacency power, at each layer), only making choices on the depth. 2. Train the network on the task while applying L2 Group Lasso regularization over each column of eachW (l) j . This will drop values of entire columns (close) to zero. 3. At the peak validation accuracy, measure the L2 norm of eachW (l) j . Pick a threshold, and count the number of columns in eachW (l) j with norm higher than the threshold. In our experiments, we 62 pick a threshold such that the size of the shrunken model equals size of our baseline model (i.e. with P =f1g). 4. Shrink the weight matrices by removing columns with norms below thek’th percentile. 5. Substitute L2 Group Lasso with standard L2 regularization. Restart training. We discuss the learned architectures in Section 5.3.3.3. 5.3 MixhopExperiments Given description of MixHop, a number of natural questions may arise. In this section, we aim to design experiments which answer the following hypothesises: • H1: The MixHop model learns delta operators. • H2: Higher order graph convolutions using neighborhood mixing can outperform existing ap- proaches (e.g. vanilla GCNs) on real semi-supervised learning tasks. • H3: When learning a model architecture for Mixhop the best performing architectures dier for each graph. To answer these questions, we design three experiments. • SyntheticExperiments: This experiment uses a family of synthetic graphs which allow us to vary the correlation (or homophily) of the edges in a generated graph, and observe how dierent graph convolutional approaches respond. As homophily is decreased in the network, nodes are more likely to connect to those with dierent labels, and a model that better captures delta operators should have superior performance. • Real-WorldExperiments: This experiment evaluates MixHop’s performance on a variety of noisy real world datasets, comparing against challenging baselines. 63 • ModelVisualizationExperiment: This experiment shows how an appropriately regularized Mix- Hop model can learn dierent, task-dependent, architectures. 5.3.1 Datasets We conduct semi-supervised node classication experiments on synthetic and real-world datasets. SyntheticDatasets: Our synthetic datasets are generated following Karimi et al. [2017]. We generate 10 graphs, each with a dierent homophily coecient (ranging from 0.0 to 0.9 at 0.1 intervals) that indicates the likelihood of a node forming a connection to a neighbor with the same label. For example, a node in the homophily = 0:9 graph with 10 edges, will have on average 9 edges to a same-label neighbor. All graphs contain 5000 nodes. The features for all synthetic nodes were sampled from overlapping multi-Gaussian distributions. We randomly partition each graph into train, test, and validation node splits, all of equal size. See Appendix of our main paper [Abu-El-Haija et al., 2019] for more information. Real World Datasets: The experiments with real-world datasets follow the methodology proposed in Yang et al. [2016a]. In addition to using the classic dataset split, (which have 20 samples per label), we evaluate against against a set of random splits with 100 samples per label. We will release our test splits. 5.3.2 Training For all experiments, we construct a 2-layer network of our model using TensorFlow [Abadi et al., 2016]. We train our models using a Gradient Descent optimizer for a maximum of 2000 steps, with an initial learning rate of 0.05 that decays by 0.0005 every 40 steps. We terminate training if validation accuracy does not improve for 40 consecutive steps; as a result, most runs nish in less than 200 steps. We use 5 10 4 L2 regularization on the weights, and dropout input and hidden layers. We note that the citation datasets are extremely sensitve to initializations; as such, we run all models 100 times, sort by the validation accuracy, and nally report the test accuracy for the top 50 runs. For all models we ran (our models in Tables 1 64 Table 5.1: Experiments run on Node Classication citation datasets created by Yang et al. [2016a]. y The learned architecture for Citeseer is equivalent to default architecture, so the results are the same. Model Citeseer Cora Pubmed ManiReg [Belkin et al., 2006] 60:1 59:5 70:7 SemiEmb [Weston et al., 2012] 59:6 59:0 71:1 LP [Zhu et al., 2003] 45:3 68:0 63:0 DeepWalk [Perozzi et al., 2014] 43:2 67:2 65:3 ICA [Lu and Getoor, 2003] 69:1 75:1 73:9 Planetoid [Yang et al., 2016a] 64:7 75:7 77:2 Vanilla GCN [Kipf and Welling, 2017] 70:3 81:5 79:0 MixHop withP =f1g (baseline) 70:70:73 81:10:84 79:90:78 MixHop: default architecture (ours) 71:40:81 81:80:62 80:01:1 MixHop: learned architecture (ours) 71:40:81 y 81:90:40 80:80:58 & 3, and all models in Table 3), we use a latent dimension of 60; Our default architecture evenly divided 60 dimensions are divided evenly to alljPj powers. Our learned architectures spread them unevenly, see Section 5.3.3.3. 5.3.3 ExperimentalResults 5.3.3.1 ResultsonSyntheticGraphs We present our results on the synthetic datasets in Figure 5.2b. We show average accuracy for each baseline against the homophily of the graph. We use a dense (MLP) model that does not ingest any adjacency information as a control. As expected, all models perform better as the homophily of the synthetic graph increases. At low levels of homophily, when nodes are rarely adjacent to neighbors with the same label, we observe that MixHop performs signicantly better than the most competitive baseline. Interestingly, we notice that the GAT model performs signicantly worse than the features-only control. This suggests that the added attention mechanism of the GAT model relies heavily on homophily in node neighborhoods. For each level of homophily, we measured the number of delta operators learned by our model. We present these metrics in Figure 5.2a. We observe that for low levels of homophily, our model uses 2.5X of its 65 Table 5.2: Dataset statistics. Numbers of nodes (n), edges (m), features, classes (c), and labeled nodes (jV P I j from the Planetoid splits,jV R I j from our random splits). Dataset nodes edges features c jV P I j jV R I j Citeseer 3,327 4,732 3,703 6 120 600 Cora 2,708 5,429 1,433 7 140 700 Pubmed 19,717 44,338 500 3 60 300 Table 5.3: Classication results on random partitions of [Yang et al., 2016a] datasets. Model Citeseer Cora Pubmed 2-Layer MLP 70:61 69:01:1 78:30:54 Chebyshev [Deerrard et al., 2016] 74:20:5 85:50:4 81:80:5 Vanilla GCN [Kipf and Welling, 2017] 76:70:43 86:10:34 82:20:29 GAT [Veličković et al., 2018] 74:80:42 83:01:1 81:80:18 MixHop: default architecture (ours) 76:30:41 87:00:51 83:60:68 MixHop: learned architecture (ours) 77:00:54 87:20:32 83:80:44 model capacity on learning delta operators compared with higher homophily. This follows intuition: as the nodes cluster around like-labeled neighbors, the need to identify meaningful feature dierences between neighbors at dierent distances drops signicantly. These results strongly suggest that the learned delta operators play a role in the success of MixHop in Figure 5.2b. For this experiment, we trained our model over the synthetic datasets under one constraint: input layer weightsW (j) 0 are shared across all powers j2 P . This allows us to examine sub-columns in the following layerW (j) 1 . Specically, we count the number of times a feature, coming out of the rst layer, is assigned values of opposite signs inW (j) 1 . We restrict the analysis to only values ofW (j) 1 with magnitude larger than the median in the corresponding column. 5.3.3.2 NodeClassicationResults We show two sets of semi-supervised node classcation results using dierent splits of our datasets. Be- cause these datasets are taken from the real world, they are inherently noisy, and it is unlikely that achiev- ing 100% classication accuracy is possible even when given a signicant amount of labeled training data. 66 Homophily # Learned Delta Ops (a) Amount of model capacity devoted to learning delta operators at dierent levels of homophily. Test Accuracy Homophily (b) Synthetic dataset results. MLP does not utilize graph with (homophilic) edges, but only node features. Figure 5.2: Experiments on synthetic homophily datasets Instead, we are interested in the sparse classication task, namely how well our model is able to improve on previous work while being resilient to noise, even with limited information. In Table 5.2, we demonstrate how our model performs on common splits taken from Yang et al. [2016a]. Accuracy numbers above double-line are copied from Kipf and Welling [2017]. Numbers below the double- line are our methods, withP =f1g being equivalent to vanilla GCNs. represents the standard deviation of 50 runs with dierent random initializations. All MixHop models are of same capacity. These splits utilize only 20 labeled nodes per class during training. We achieve a test accuracy of 71.4%, 81.9%, and 80.8% on Citeseer, Cora, and Pubmed respectively. Interestingly, for Citeseer, we see that the learned architecture was equal to the original architecture (and so the models performed the same). In Table 5.3, we demonstrate how our model performs using random splits with more training information available. These splits utilize 100 nodes per class during training. We achieve a test accuracy of 77.0%, 87.2%, and 83.9% on Citeseer, Cora, and Pubmed respectively. As MixHop is able to pull in linear combinations of features from farther distances, it can extract mean- ingful signals in extremely sparse settings. We believe this explains why MixHop outperforms baseline methods in both sets of dataset splits. The results of these experiments conrm our hypothesis (H2) that 67 MixHop: Higher-OrderGraphConvolutionArchitecturesviaSparsifiedNeighborhoodMixing Model Citeseer Cora Pubmed 2-LayerMLP 70.6±1 69.0±1.1 78.3±0.54 Chebyshev(Defferrardetal.,2016) 74.2±0.5 85.5±0.4 81.8±0.5 VanillaGCN(Kipf&Welling,2017) 76.7±0.43 86.1±0.34 82.2±0.29 GAT(Velickovicetal.,2018) 74.8±0.42 83.0±1.1 81.8±0.18 MixHop: defaultarchitecture(ours) 76.3±0.41 87.0±0.51 83.6±0.68 MixHop: learnedarchitecture(ours) 77.0±0.54 87.2±0.32 83.8±0.44 Table3: Classificationresultsonrandompartitionsof(Yangetal.,2016)datasets. X 500 n=19717 ⇥ b A 0 ⇥ 500 17 W (1) 0 ⇥ b A 1 ⇥ 23 W (1) 1 ⇥ b A 2 ⇥ 20 W (1) 2 n 60 ⇥ b A 0 ⇥ 60 3 W (2) 0 ⇥ b A 1 ⇥ 6 W (2) 1 ⇥ b A 2 ⇥ 3 W (2) 2 n 12 (a)Pubmed X 1433 n=2708 ⇥ b A 0 ⇥ 1433 24 W (1) 0 ⇥ b A 1 ⇥ 18 W (1) 1 ⇥ b A 2 ⇥ 18 W (1) 2 n 60 ⇥ b A 0 ⇥ 60 0 W (2) 0 ⇥ b A 1 ⇥ 7 W (2) 1 ⇥ b A 2 ⇥ 7 W (2) 2 n 14 (b)Cora Figure5: LearnedMixHopArchitectures. Notehowdifferentparametersizes(greenboxes)arelearnedforthetwodatasets. Forexample,Group-LassoregularizationonCoraremovesallcapacityforthezerothpowerinthesecondGClayer. For space,allmatricesareplottedtransposedandoutputlayer(Section4.1)hasbeenommitted. splits. Theresultsoftheseexperimentsconfirmourhypoth- esis(H2)thathigherordergraphconvolutionmethodswith neighborhoodmixingcanoutperformexistingmethodson realdatasets. 6.3.VisualizingLearnedArchitectures Figure5depictsthelearnedarchitecturesfortwoofthecita- tiondatasets. Wenotethateachdatasetprefersitsownarchi- tecture. Forexample,Corapreferstohavezero-capacityon the0thpoweroftheadjacencymatrix(effectivelyignoring thefeaturesofeachnode)inthesecondlayer. Notshown (for space reasons) is Citeseer, which prefers the default parameter settings with the same weight capacity across allpowers. Allthreerealdatasetshaddifferentfinalarchi- tectures,whichconfirmsourhypothesis(H3)thatdifferent architecturesareoptimalfordifferentgraphdatasets. 7.RelatedWork Abu-El-Haijaetal.(2018b)usesadjacencypowersbutfor embeddinglearning. Abu-El-Haijaetal.(2018a);Atwood &Towsley(2016)useadjacencypowersforfeaturepropa- gation on graphs, but they combine the powers at the end of the network (right before classification), and Lee et al. (2018)combinethemattheinput. Weintermixinformation fromthepowerslayer-wise,enablingourmethodtolearn neighborhoodmixinge.g. deltaoperators, whichcontrast thefeaturesofimmediateneighborsfromthosefurtheraway. Defferrardetal.(2016)usesmoreChebyshevpolynomials (i.e. higher-rank) Graph Convolution, but their model un- derperformsourbaseline(Kipf&Welling,2017),allowing ustohypothesizethatmessagepassingalongedgesoutper- formsexplicitalignmentontothegraphFourierBasis. 8.Conclusion Inthiswork,weanalyzedtheexpressivepowerofpopular methods for semi-supervised learning with Graph Neural Networksandweshowedtheycannotlearngeneralneigh- borhood mixing functions. To address this, we have pro- posedagraphconvolutionallayerthatutilizesmultiplepow- ers of the adjacency matrix. Repeated application of this layer allows a model to learn general mixing of neighbor- hoodinformation,includingaveraginganddeltaoperators inthefeaturespace,withoutadditionalmemoryorcompu- tationalcomplexity. UtilizingL2grouplassoregularization onthesestackedlayersallowsustolearnauniquearchitec- ture that is optimized for each dataset. Our experimental resultsshowedthathigherordergraphconvolutionmethods can achieve state of the art performance on several node classificationtasks. Ouranalysisoftheexperimentalresults showed that neighborhood difference operators are espe- ciallyusefulingraphswhichdonothavehighhomophily (correlationbetweenedgesandlabels). Whilewefocused this paper on applying our proposal to the most popular models for graph convolution, it is possible to implement ourmethodinmoresophisticatedframeworksincludingthe recentGAT(Velickovicetal.,2018). Otherrecentworklike (Yingetal.,2018),whichfocusesonhierarchicalpooling for community-aware graph representation might also be extendedtousegeneralneighborhoodmixinglayers. (a) Pubmed MixHop: Higher-OrderGraphConvolutionArchitecturesviaSparsifiedNeighborhoodMixing Model Citeseer Cora Pubmed 2-LayerMLP 70.6±1 69.0±1.1 78.3±0.54 Chebyshev(Defferrardetal.,2016) 74.2±0.5 85.5±0.4 81.8±0.5 VanillaGCN(Kipf&Welling,2017) 76.7±0.43 86.1±0.34 82.2±0.29 GAT(Velickovicetal.,2018) 74.8±0.42 83.0±1.1 81.8±0.18 MixHop: defaultarchitecture(ours) 76.3±0.41 87.0±0.51 83.6±0.68 MixHop: learnedarchitecture(ours) 77.0±0.54 87.2±0.32 83.8±0.44 Table3: Classificationresultsonrandompartitionsof(Yangetal.,2016)datasets. X 500 n=19717 ⇥ b A 0 ⇥ 500 17 W (1) 0 ⇥ b A 1 ⇥ 23 W (1) 1 ⇥ b A 2 ⇥ 20 W (1) 2 n 60 ⇥ b A 0 ⇥ 60 3 W (2) 0 ⇥ b A 1 ⇥ 6 W (2) 1 ⇥ b A 2 ⇥ 3 W (2) 2 n 12 (a)Pubmed X 1433 n=2708 ⇥ b A 0 ⇥ 1433 24 W (1) 0 ⇥ b A 1 ⇥ 18 W (1) 1 ⇥ b A 2 ⇥ 18 W (1) 2 n 60 ⇥ b A 0 ⇥ 60 0 W (2) 0 ⇥ b A 1 ⇥ 7 W (2) 1 ⇥ b A 2 ⇥ 7 W (2) 2 n 14 (b)Cora Figure5: LearnedMixHopArchitectures. Notehowdifferentparametersizes(greenboxes)arelearnedforthetwodatasets. Forexample,Group-LassoregularizationonCoraremovesallcapacityforthezerothpowerinthesecondGClayer. For space,allmatricesareplottedtransposedandoutputlayer(Section4.1)hasbeenommitted. splits. Theresultsoftheseexperimentsconfirmourhypoth- esis(H2)thathigherordergraphconvolutionmethodswith neighborhoodmixingcanoutperformexistingmethodson realdatasets. 6.3.VisualizingLearnedArchitectures Figure5depictsthelearnedarchitecturesfortwoofthecita- tiondatasets. Wenotethateachdatasetprefersitsownarchi- tecture. Forexample,Corapreferstohavezero-capacityon the0thpoweroftheadjacencymatrix(effectivelyignoring thefeaturesofeachnode)inthesecondlayer. Notshown (for space reasons) is Citeseer, which prefers the default parameter settings with the same weight capacity across allpowers. Allthreerealdatasetshaddifferentfinalarchi- tectures,whichconfirmsourhypothesis(H3)thatdifferent architecturesareoptimalfordifferentgraphdatasets. 7.RelatedWork Abu-El-Haijaetal.(2018b)usesadjacencypowersbutfor embeddinglearning. Abu-El-Haijaetal.(2018a);Atwood &Towsley(2016)useadjacencypowersforfeaturepropa- gation on graphs, but they combine the powers at the end of the network (right before classification), and Lee et al. (2018)combinethemattheinput. Weintermixinformation fromthepowerslayer-wise,enablingourmethodtolearn neighborhoodmixinge.g. deltaoperators, whichcontrast thefeaturesofimmediateneighborsfromthosefurtheraway. Defferrardetal.(2016)usesmoreChebyshevpolynomials (i.e. higher-rank) Graph Convolution, but their model un- derperformsourbaseline(Kipf&Welling,2017),allowing ustohypothesizethatmessagepassingalongedgesoutper- formsexplicitalignmentontothegraphFourierBasis. 8.Conclusion Inthiswork,weanalyzedtheexpressivepowerofpopular methods for semi-supervised learning with Graph Neural Networksandweshowedtheycannotlearngeneralneigh- borhood mixing functions. To address this, we have pro- posedagraphconvolutionallayerthatutilizesmultiplepow- ers of the adjacency matrix. Repeated application of this layer allows a model to learn general mixing of neighbor- hoodinformation,includingaveraginganddeltaoperators inthefeaturespace,withoutadditionalmemoryorcompu- tationalcomplexity. UtilizingL2grouplassoregularization onthesestackedlayersallowsustolearnauniquearchitec- ture that is optimized for each dataset. Our experimental resultsshowedthathigherordergraphconvolutionmethods can achieve state of the art performance on several node classificationtasks. Ouranalysisoftheexperimentalresults showed that neighborhood difference operators are espe- ciallyusefulingraphswhichdonothavehighhomophily (correlationbetweenedgesandlabels). Whilewefocused this paper on applying our proposal to the most popular models for graph convolution, it is possible to implement ourmethodinmoresophisticatedframeworksincludingthe recentGAT(Velickovicetal.,2018). Otherrecentworklike (Yingetal.,2018),whichfocusesonhierarchicalpooling for community-aware graph representation might also be extendedtousegeneralneighborhoodmixinglayers. (b) Cora Figure 5.3: Learned MixHop Architectures. Note how dierent parameter sizes (green boxes) are learned for the two datasets. For example, Group-Lasso regularization on Cora removes all capacity for the zeroth power in the second GC layer. For space, all matrices are plotted transposed and output layer (Section 5.2.4.1) has been ommitted. higher order graph convolution methods with neighborhood mixing can outperform existing methods on real datasets. 5.3.3.3 VisualizingLearnedArchitectures Figure 5.3 depicts the learned architectures for two of the citation datasets. We note that each dataset prefers its own architecture. For example, Cora prefers to have zero-capacity on the 0th power of the adjacency matrix (eectively ignoring the features of each node) in the second layer. Not shown (for space 68 reasons) is Citeseer, which prefers the default parameter settings with the same weight capacity across all powers. All three real datasets had dierent nal architectures, which conrms our hypothesis (H3) that dierent architectures are optimal for dierent graph datasets. 69 PartIII-Human&Machineeciency: Fewer&FasterExperiments This part of the thesis focuses on the eciency of the practitioner and the machine. In particular, it is com- mon for practitioners to run a variety of experiments on their graph dataset, where each experiment utilizes a dieret set of hyperparameters, model architecture, sampling parameters, random seed, etc. The goal of improving human eciency is to reduce the number of required experiments, for obtaining state-of- the-art models. Chapters 6 & 7 remove essential hyperparameters from the training process. Specically, Chapter 6 replaces sampling hyperparameters ∗ with trainable parameters. Further, Chapter 7 removes hy- perparameters of the training algorithm † as it proposes linearized Graph Neural Networks (GNNs) where optimal model parameters can be estimated in closed-form. The goal of improving machine eciency is to enable training GNNs using less memory, faster and/or using less computational resources, e.g., lower-power or cheaper hardware. Chapter 7 can train GNNs in-closed forms, by computing the decomposition of large matrices but without computing the matrices themselves. Therefore, training is achieved without calculating gradients ‡ . Then, Chapter 8 shows that parameters calculated per Chapter 7 can be used to initialize deeper GNNs, which require only a handful of ne-tuning iterations to obtain state-of-the-art empirical performance. ∗ E.g., length of random walk, the probability of sampling a node pair given their distance † Choice of optimization algorithm,e.g., Adam [Ba and Kingma, 2015] and its hyperparameters, such as the learning rate, rst and second scalar moments ‡ Backprop Algorithm for gradient calculation is expensive, as it keeps all-layer activations in memory. 70 Chapter6 Watch-Your-StepModelforLearningtheContextDistribution Node embedding methods represent nodes in a continuous vector space, preserving dierent types of relational information from the graph. Embedding methods that utilize random walks are accompanied by hyper-parameters,e.g., the length of a random walk, which have to be manually tuned for every graph. In this chapter, we replace previously xed hyper-parameters with trainable ones that we automatically learn via backpropagation. In particular, we propose a novel attention model on the power series of the graph transition matrixT . This model can be thought as guidance to the random walk, optimized on an upstream objective. Unlike previous approaches to attention models, the method that we propose utilizes attention parameters exclusively on the data itself (e.g. on the random walk), and are not used by the model for inference. We experiment on link prediction tasks, as we aim to produce embeddings that best- preserve the graph structure, generalizing to unseen information. We improve state-of-the-art results on a comprehensive suite of real-world graph datasets including social, collaboration, and biological networks, where we observe that our graph attention model can reduce the error by up to 20%-40%. We show that our automatically-learned attention parameters can vary signicantly per graph, and correspond to the optimal choice of hyper-parameter if we manually tune existing methods. The contents of this chapter can be found standalone in [Abu-El-Haija et al., 2018] and the source-code for reproducing the results is linked within. 71 6.1 Watch-Your-StepModelIntuition Unsupervised graph embedding methods seek to learn representations that encode the graph structure. These embeddings have demonstrated outstanding performance on a number of tasks including node clas- sication [Perozzi et al., 2014, Grover and Leskovec, 2016], knowledge-base completion [Luo et al., 2015], semi-supervised learning [Yang et al., 2016a], and link prediction [Abu-El-Haija et al., 2017]. In general, as introduced by Perozzi et al Perozzi et al. [2014], these methods operate in two discrete steps: First, they sample pair-wise relationships from the graph through random walks and counting node co-occurances. Second, they train an embedding model e.g. using Skipgram of word2vec Mikolov et al. [2013], to learn representations that encode pairwise node similarities. While such methods have demonstrated outstanding results on a number of tasks, their empirical performance can signicantly vary based on the setting of their hyper-parameters. For example, Perozzi et al. [2014] observed that the quality of learned representations is dependent on the length of the random walk (C). In practice, DeepWalk Perozzi et al. [2014] and many of its extensions [e.g. Grover and Leskovec, 2016] use word2vec implementations Mikolov et al. [2013]. Accordingly, it has been revealed by Levy et al. [2015] that the hyper-parameterC, refered to astrainingwindowlength in word2vec Mikolov et al. [2013], actually controls more than a xed length of the random walk. Instead, it parameterizes a function, we term the context distribution and denoteQ, which controls the probability of sampling a node-pair when visited within a specic distance ∗ . Implicitly, the choices of C andQ, create a weight mass on every node’s neighborhood. In general, the weight is higher on nearby nodes, but the specic form of the mass function is determined by the aforementioned hyper-parameters. In this work, we aim to replace these hyper-parameters with trainable parameters, so that they can be automatically learned for each graph. To ∗ To clarify, as noted by Levy et al Levy et al. [2015] – studying the implementation of word2vec reveals that rather than using C as constant and assuming all nodes visited within distanceC are related, a desired context distanceci is sampled from uniform (ciUf1;Cg) for each node pairi in training. If the node pairi was visited more thanci-steps apart, it is not used for training. Many DeepWalk-style methods inherited this context distribution, as they internally utilize standard word2vec implementations. 72 do so, we pose graph embedding as end-to-end learning, where the (discrete) two steps of random walk co- occurance sampling, followed by representation learning, are joint using a closed-form expectation over the graph adjacency matrix. Our inspiration comes from the successful application of attention models in domains such as Natural Language Processing (NLP) [e.g. Bahdanau et al., 2015, Yang et al., 2016b], image recognition [Mnih et al., 2014], and detecting rare events in videos [Ramanathan et al., 2016]. To the best of our knowledge, the approach we propose is signicantly dierent from the standard application of attention models. Instead of using attention parameters to guide the model where to look when making a prediction, we use attention parameters to guide our learning algorithm to focus on parts of the data that are most helpful for optimizing an upstream objective. We show the mathematical equivalence between the context distribution and the co-ecients of power series of the transition matrix. This allows us to learn the context distribution by learning an attention model on the power series. The attention parameters “guide” the random walk, by allowing it to focus more on short- or long-term dependencies, as best suited for the graph, while optimizing an upstream objective. To the best of our knowledge, this work is the rst application of attention methods to graph embedding. Specically, our contributions are the following: 1. We propose an extendible family of graph attention models that can learn arbitrary (e.g. non- monotonic) context distributions. 2. We show that the optimal choice of context distribution hyper-parameters for competing methods, found by manual tuning, agrees with our automatically-found attention parameters. 73 3. We evaluate on a number of challenging link prediction tasks comprised of real world datasets, including social, collaboration, and biological networks. Experiments show we substantially improve on our baselines, reducing link-prediction error by 20%-40%. 6.1.1 SideNote: AttentionModels We mention attention models that have appeared in computer vision, by the time of this paper’s writing [Abu-El-Haija et al., 2018], [e.g. Mnih et al., 2014, Ramanathan et al., 2016, Veličković et al., 2018], where an attention function is employed to suggest positions within the input example that the classication function should pay attention to, when making inference. This function is used during the training phase in the forward pass and in the testing phase for prediction. The attention function and the classier are jointly trained on an upstream objective e.g. cross entropy. In our case, the attention mechanism is only guides the learning procedure, and not used by the model for inference. Our mechanism suggests parts of the data to focus on, during training, as explained next. 6.2 Watch-Your-StepModelConstruction Following the embedding framework (Eq 3.3), we setg(Z) =g ([L j R]) =LR > andf(A) =E[ ], the expectation on co-occurrence matrix produced from simulated random walk. Using this closed form, we extend the the Negative Log Graph Likelihood (NLGL) loss (Eq. 3.6) to include attention parameters on the random walk sampling. 6.2.1 Expectationontheco-occurancematrix:E[ ] Rather than obtaining by simulation of random walks and sampling co-occurances, we formulate an expectation of this sampling, asE[ ]. In general. this allows us to tune sampling parameters living inside of the random walk procedure including number of stepsC. 74 LetT be the transition matrix for a graph, which can be calculated by normalizing the rows ofA to sum to one. This can be written as: T =D 1 A: (6.1) Given an initial probability distributionp (0) 2R jVj of a random surfer, it is possible to nd the distribu- tion of the surfer after one step conditioned onp (0) asp (1) =p (0) > T and afterk steps asp (k) =p (0) > (T ) k , where (T ) k multiplies matrixT with itselfk-times. We are interested in an analytical expression forE[ ], the expectation over co-occurrence matrix produced by simulated random walks. A closed form expression for this matrix will allow us to perform end-to-end learning. In practice, random walk methods based on DeepWalk [Perozzi et al., 2014] do not useC as a hard limit; instead, given walk sequence (v 1 ;v 2 ;::: ), they samplec i Uf1;Cg separately for each anchor nodev i and potential context nodes, and only keep context nodes that are withinc i -steps ofv i . In expectation then, nodesv i+1 ;v i+2 ;v i+3 ;::: , will appear as context for anchor nodev i , respectively with probabilities 1; 1 1 C ; 1 2 C ;::: . We can write an expectation onD2R jVjjVj : E DeepWalk ;C = C X k=1 Pr(ck) ~ P (0) (T ) k ; (6.2) which is parametrized by the (discrete) walk lengthC; where Pr(ck) indicates the probability of node with distancek from anchor to be selected; and ~ P (0) 2R jVjjVj is a diagonal matrix (the initial positions matrix), with ~ P (0) vv set to the number of walks starting at node v. Since Pr(c =k) = 1 C for all k = f1; 2;:::;Cg, we can expand Pr(ck) = P C j=k P (c =j), and re-write the expectation as: E D DeepWalk ;C = ~ P (0) C X k=1 1 k 1 C (T ) k : (6.3) 75 Eq. (6.3) is derived, step-by-step, in the Appendix. We are not concerned by the exact denition of the scalar coecient, 1 k1 C , but we note that the coecient decreases withk. Instead of keepingC a hyper-parameter, we want to analytically optimize it on an upstream objective. Further, we are interested to learn the co-ecients to (T ) k instead of hand-engineering a formula. As an aside, running the GloVe embedding algorithm [Pennington et al., 2014] over the random walk sequences, in expectation, is equivalent to factorizing the co-occurance matrix: E GloVe ;C = ~ P (0) C X k=1 1 k (T ) k : (6.4) 6.2.2 LearningtheContextDistribution We want to learn the co-ecients to (T ) k . Let the context distributionQ be supported on integers 1;:::;C, by deningQ as aC-dimensional vector asQ = (Q 1 ;Q 2 ; ;Q C ) withQ k 0 and P k Q k = 1. We assign co-ecientQ k to (T ) k . Formally, our expectation on is parameterized with, and is dierentiable w.r.t.,Q: E [ ;Q 1 ;Q 2 ;:::Q C ] = ~ P (0) C X k=1 Q k (T ) k = ~ P (0) E kQ [(T ) k ]; (6.5) Training embeddings over random walk sequences, using word2vec or GloVe, respectively, are special cases of Equation 6.5, withQ xed apriori asQ k = 1 k1 C orQ k / 1 k . 6.2.3 GraphAttentionModels To learnQ automatically, we propose an attention model which guides the random surfer on “where to attend to” as a function of distance from the source node. Specically, we dene a Graph Attention Model as a process which models a node’s context distributionQ as the output of softmax: (Q 1 ;Q 2 ;Q 3 ;::: ) = softmax((q 1 ;q 2 ;q 3 ;::: )); (6.6) 76 Dataset jVj jEj nodes edges wiki-vote 7; 066 103; 663 users votes ego-Facebook 4; 039 88; 234 users friendship ca-AstroPh 17; 903 197; 031 researchers co-authorship ca-HepTh 8; 638 24; 827 researchers co-authorship PPI [Stark et al., 2006] 3; 852 20; 881 proteins chemical interaction (a) Datasets used in our experiments: wiki-vote is di- rected but all others are undirected graphs. 1 2 3 4 5 6 7 8 9 10 C 0.9900 0.9905 0.9910 0.9915 0.9920 0.9925 ROC-AUC facebook 1 2 3 4 5 6 7 8 9 10 C 0.70 0.75 0.80 0.85 ppi 1 2 3 4 5 6 7 8 9 10 C 0.60 0.61 0.62 0.63 0.64 wiki-vote (b) Test ROC-AUC as a function of C using node2vec. Figure 6.1: In 6.1a we present statistics of our datasets. In 6.1b, we motivate our work by showing the necessity of setting the parameterC for node2vec (d=128, each point is the average of 7 runs). where the variables q k are trained via backpropagation, jointly while learning node embeddings. Our hypothesis is as follows. If we don’t impose a specic formula on Q = (Q 1 ;Q 2 ;:::Q C ), other than (regularized) softmax, then we can use very large values ofC and allow every graph to learn its own form ofQ with its preferred sparsity and own decay form. Should the graph structure require a smallC, then the optimization would discover a left-skewedQ with all of probability mass onfQ 1 ;Q 2 g and P k>2 Q k 0. However, if according to the objective, a graph is more accurately encoded by making longer walks, then they can learn to use a largeC (e.g. using uniform or even right-skewed Q distribution), focusing more attention on longer distance connections in the random walk. To this end, we propose to train softmax attention model on the innite power series of the transition matrix. We dene an expectation on our proposed random walk matrix softmax[1] as † : E h softmax[1] ; q 1 ;q 2 ;q 3 ;::: i = ~ P (0) lim C!1 C X k=1 softmax(q 1 ;q 2 ;q 3 ;::: ) k (T ) k ; (6.7) whereq 1 ;q 2 ;::: are jointly trained with the embeddings to minimize our objective. 77 Table 6.1: Results on Link Prediction Datasets. Shown is the ROC-AUC. Each row shows results for one dataset results on one dataset when training embedding with We bold the highest accuracy per dataset- dimension pair, including when the highest accuracy intersects with the mean standard deviation. We use the train:test splits of [Abu-El-Haija et al., 2017], hosted on http://sami.haija.org/graph/splits MethodsUse:A D E[D] Error Reduction Dataset dim Eigen Maps SVD DNGR n2v C = 2 n2v C = 5 Asym Proj Graph Attention (this chapter) 64 61:3 86:0 59:8 64:4 63:6 91:7 93:80:13 25.2% wiki-vote 128 62:2 80:8 55:4 63:7 64:6 91:7 93:80:05 25.2% ego-Facebook 64 96:4 96:7 98:1 99:1 99:0 97:4 99:40:10 33.3% 128 95:4 94:5 98:4 99:3 99:2 97:3 99:50:03 28.6% 64 82:4 91:1 93:9 97:4 96:9 95:7 97:90:21 19.2% ca-AstroPh 128 82:9 92:4 96:8 97:7 97:5 95:7 98:10:49 24.0% ca-HepTh 64 80:2 79:3 86:8 90:6 91:8 90:3 93:60:06 22.0% 128 81:2 78:0 89:7 90:1 92:0 90:3 93:90:05 23.8% 64 70:7 75:4 76:7 79:7 70:6 82:4 89:81:05 43.5% PPI 128 73:7 71:2 76:9 81:8 74:4 83:9 91:00:28 44.2% ego-Facebook ca-HepTh ca-AstroPh PPI wiki-vote 10 3 10 2 10 1 10 0 Attention Probability Mass softmax Q 1 Q 2 Q 3 Q 4 Q 5 (a) Learned Attention weightsQ (log scale). 1 2 3 4 5 6 7 8 9 10 Q 0.0 0.5 1.0 Attention Probability Mass ego-Facebook 1 2 3 4 5 6 7 8 9 10 Q PPI 1 2 3 4 5 6 7 8 9 10 Q wiki-vote = 0.3 = 0.5 = 0.7 (b)Q with varying the regularization (linear scale). Figure 6.2: (a) shows learned attention weightsQ, which agree with grid-search of node2vec (Figure 6.1b). (b) shows how varying aects the learnedQ. Note that distributions can quickly tail o to zero (ego- Facebook and PPI), while other graphs (wiki-vote) contain information across distant nodes. 6.2.4 TrainingObjective The nal training objective for the Softmax attention mechanism, coming from the NLGL Eq. (3.6), min L;R;q jjqjj 2 2 + E[D;q] log (LR > ) 1[A = 0] log 1(LR > ) 1 (6.8) is minimized w.r.t attention parameter vector q = (q 1 ;q 2 ;::: ) and node embeddings L;R2 R jVj d 2 . Hyper-parameter2R applies L2 regularization on the attention parameters. We emphasize that our † We do not actually unroll the summation in Eq. 6.7 an innite number of times. Our experiments show that unrolling it 10 or 20 times is sucient to obtain state-of-the-art results. 78 attention parameters q live within the expectation over data D, and are not part of the model (L;R) and are therefore not required for inference. The constraint P k Q k = 1, through the softmax activation, preventsE[D softmax ] from collapsing into a trivial solution (zero matrix). 6.2.5 ComputationalComplexity The naive computation of (T ) k requiresk matrix multiplications and so isO(jVj 3 k). However, as most real-world adjacency matrices have an inherent low rank structure, a number of fast approximations to computing the random walk transition matrix raised to a powerk have been proposed [e.g. Tong et al., 2006]. Alternatively SVD can decomposeT asT =UV > and then thek th power can be calculated by raising the diagonal matrix of singular values tok as (T ) k =U() k V > sinceV > U =I. Furthermore, the SVD can be approximated in time linear to the number of non-zero entries [Halko et al., 2011]. Therefore, we can approximate (T ) k inO(jEj). In this work, we compute (T ) k without approximations. Our algo- rithm runs in seconds over the given datasets (at least 10X faster than node2vec [Grover and Leskovec, 2016], DVNE [?], DNGR [Cao et al., 2016]). We leave stochastic and approximation versions of our method as future work. 6.2.6 Extensions As presented, our proposed method can learn the weights of the context distribution Q. However, we briey note that such a model can be trivially extended to learn the weight of any other type of pair- wise node similarity (e.g. Personalized PageRank, Adamic-Adar, etc). In order to do this, we can extend the denition of the contextQ with an additional dimensionQ k+1 for the new type of similarity, and an additional element in the softmaxq k+1 to learn a joint importance function. 6.3 Watch-Your-StepExperiments 79 6.3.1 LinkPredictionExperiments We evaluate the quality of embeddings produced when random walks are augmented with attention, through experiments on link prediction [Liben-Nowell and Kleinberg, 2007]. Link prediction is a chal- lenging task, with many real world applications in information retrieval, recommendation systems and social networks. As such, it has been used to study the properties of graph embeddings [Perozzi et al., 2014, Grover and Leskovec, 2016]. Such an intrinsic evaluation emphasizes the structure-preserving prop- erties of embedding. Our experimental setup is designed to determine how well the embeddings produced by a method captures the topology of the graph. We measure this in the manner of Grover and Leskovec [2016]: remove a fraction (=50%) of graph edges, learn embeddings from the remaining edges, and measure how well the embeddings can recover those edges which have been removed. More formally, we split the graph edgesE into two partitions of equal sizeE train andE test such that the training graph is connected. We also sample non existent edges ((u;v) = 2 E) to makeE train andE test . We use (E train , E train ) for training and model selection, and use (E test ,E test ) to compute evaluation metrics. Training: We train our models using TensorFlow, with PercentDelta optimizer [Abu-El-Haija, 2017]. For the results Table 6.1, we use = 0:5,C = 10, and ~ P (0) = diag(80), which corresponds to 80 walks per node. We analyze our model’s sensitivity in Section 6.3.2. To ensure repeatability of results, we have released our model and instructions ‡ . Datasets: Table 6.1a describes the datasets used in our experiments. Datasets available from SNAP https://snap.stanford.edu/data. Baselines: We evaluate against many baselines. For all methods, we calculateg(Z)2R jVjjVj , and extract entries fromg(Z) corresponding to positive and negative test edges, then use them to compute ROC AUC. We compare against following baselines. We mark symmetric models withy. Their counterparts, ‡ Available athttp://sami.haija.org/graph/context 80 asymmetric models including ours, can learng(Z) vu 6= g(Z) uv , which we expect to perform relatively better on the directed graph wiki-vote. –yEigenMaps [Belkin and Niyogi, 2003]. Minimizes Euclidean distance of adjacent nodes ofA. – SVD. Singular value decomposition ofA. Inference is through the functiong(Z) =USV, where (U;S;V) is a low-rank SVD decomposiiton corresponding to thed largest singular values. –yDNGR [Cao et al., 2016]. Non-linear (i.e. deep) embedding of nodes, using an auto-encoder onA. We use author’s code to learn the deep embeddingsZ and use for inferenceg(Z) =ZZ T . –yn2v: node2vec [Grover and Leskovec, 2016] is a popular baseline. It simulates random walks and uses word2vec to learn node embeddings. Minimizes objective in Eq. (3.4). For Table 6.1, we use author’s code to learn embeddingsY then useg(Y) = YY > . We run withC = 2 andC = 5. § – AsymProj [Abu-El-Haija et al., 2017]. Learns edges as asymmetric projections in a deep embedding space, trained by maximizing the graph likelihood (Eq. 3.5). Results: Our results, summarized in Table 6.1, show that our proposed methods substantially out- perform all baseline methods. Specically, we see that the error is reduced by up to 45% over baseline methods which have xed context denitions. This shows that by parameterizing the context distribution and allowing each graph to learn its own distribution, we can better preserve the graph structure (and thereby better predict missing edges). Discussion: Figure 6.2a shows how the learned attention weightsQ vary across datasets. Each dataset learns its own attention form, and the highest weights generally correspond to the highest weights when doing a grid search overC for node2vec (as in Figure 6.1b). The hyper-parameterC determines the highest power of the transition matrix, and hence the max- imum context size available to the attention model. We suggest using large values for C, since the at- tention weights can eectively use a subset of the transition matrix powers. For example, if a network § We sweepC in Figure 6.1b, showing that there are no good default forC that works best across datasets. 81 (a) node2vec, Cora (b) Graph Attention (ours), Cora Dataset n2v C = 5 Graph Attention (ours) Cora 63.1 67.9 Citeseer 45.6 51.5 (c) Classication accuracy Figure 6.3: Node Classication. Fig. (a)/(b): t-SNE visualization of node embeddings for Cora dataset. We note that both methods are unsupervised, and we have colored the learned representations by node labels. Fig. (c) However, quantitatively, our embeddings achieves better separation. needs only 2 hops to be accurately represented, then it is possible for the softmax attention model to learn Q 3 ;Q 4 ; 0. Figure 6.2b shows how varying the regularization term allows the softmax attention model to “attend to” only what each dataset requires. We observe that for most graphs, the majority of the mass gets assigned toQ 1 ;Q 2 . This shows that shorter walks are more benecial for most graphs. How- ever, on wiki-vote, better embeddings are produced by paying attention to longer walks, as its softmaxQ is uniform-like, with a slight right-skew. 6.3.2 SensitivityAnalysis So far, we have removed two hyper-parameters, the maximum window sizeC, and the form of the context distributionU. In exchange, we have introduced other hyper-parameters – specically walk length (also C) and a regularization term for the softmax attention model. Nonetheless, we show that our method is robust to various choices of these two. Figures 6.2a and 6.2b both show that the softmax attention weights drop to almost zero if the graph can be preserved using shorter walks, which is not possible with xed-form distributions. Figure 6.4 examines this relationship in more detail ford = 128 dimensional embeddings, sweeping our hyper-parameters C and , and comparing results to the best and worst node2vec embeddings for C2 [1; 10]. (Note that node2vec lines are horizontal, as they do not depend on.) We observe that all the accuracy metrics are within 1% to 2%, when varying these hyper-parameters, and are all still well-above our baseline (which sample from a xed-form context distribution). 82 0.3 0.5 0.7 0.9 0.965 0.970 0.975 0.980 0.985 ROC-AUC ca-AstroPh 0.3 0.5 0.7 0.9 0.90 0.91 0.92 0.93 0.94 ca-HepTh 0.3 0.5 0.7 0.9 0.70 0.75 0.80 0.85 0.90 PPI 0.3 0.5 0.7 0.9 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 wiki-vote softmax[C=5] softmax[C=10] softmax[C=20] softmax[C=30] node2vec[best C] node2vec[worst C] Figure 6.4: Sensitivity Analysis of softmax attention model. Our method is robust to choices of both and C. We note that it consistently outperforms even an optimally set node2vec. 6.3.3 NodeClassicationExperiments We conduct node classication experiments, on two citation datasets, Cora and Citeseer, with the following statistics: Cora contains (2; 708 nodes, 5; 429 edges and K = 7 classes); and Citeseer contains (3; 327 nodes, 4; 732 edges andK = 6 classes). We learn embeddings from only the graph structure (nodes and edges), without observing node features nor labels during training. Figure 6.3 shows t-SNE visualization of the Cora dataset, comparing our method with node2vec [Grover and Leskovec, 2016]. For classication, we follow the data splits of Yang et al. [2016a]. We predict labels e L2R jVjK as: e L = exp (g(Z))L train , where L train 2 f0; 1g jVjK contains rows of ones corresponding to nodes in training set and zeros elsewhere. The scalar2R is manually tuned on the validation set. The classication results, summarized in Table 6.3c, show that our model learns a better unsupervised representation than previous methods, that can then be used for supervised tasks. We do not compare against other semi-supervised methods that utilize node features during training and inference [incl. Yang et al., 2016a, Kipf and Welling, 2017], as our method is unsupervised. Our classication prediciton function contains one scalar parameter. It can be thought of a “smooth” k-nearest-neighbors, as it takes a weighted average of known labels, where the weights are exponential of the dot-product similarity. Such a simple function should introduce no model bias. 83 Chapter7 “Convexied”GraphNeuralNetwork Chapters 4, 5 and 6 have discussed strong models for solving two machine learning tasks: link predic- tion and node classication. Beside the strength of the discussed models among many others, the train- ing requires signicant computational resources, e.g., for calculating gradients via backprop over many data epochs. Meanwhile, Singular Value Decomposition (SVD) can nd closed-form solutions to convex problems, using merely a handful of epochs. This chapter outlines a method for making GRL more com- putationally tractable for those with modest hardware. We design a framework that computes SVD of implicitly dened matrices, and apply this framework to several GRL tasks. For each task, we derive linear approximation of a SOTA model, where we design (expensive-to-store) matrixM and train the model, in closed-form, via SVD of M, without calculating entries of M. By converging to a unique point in one step, and without calculating gradients, our models show competitive empirical test performance over various graphs such as article citation and biological interaction networks. The contents of this chapter can be found standalone in [Abu-El-Haija et al., 2021a] and the source-code for reproducing the results is available on http://github.com/samihaija/tf-fsvd We consider two model families: message passing for node classication and network embedding for link prediction. For each, we pick a popular model that we: (i) linearize and (ii) and switch its training 84 objective toFrobeniusnormerrorminimization. These simplications can cast the training into nding the optimal parameters in closed-form. 7.1 ConvexiedGNNs: Intuition Many recent graph representation learning (GRL) models are creative and theoretically-justied [Kipf and Welling, 2017, Hamilton et al., 2017, Veličković et al., 2018, Qiu et al., 2018, Xu et al., 2019, Abu-El-Haija et al., 2019, Chen et al., 2020]. Unfortunately, however, they contain hyperparameters that need to be tuned (such as learning rate, regularization coecient, depth and width of the network), and training takes a long time e.g. minutes even on small academic datasets. On the other hand, Truncated Singular Value Decomposition (SVD) provides solutions to a variety of mathematical problems, including computing a matrix rank, its pseudo-inverse, or mapping its rows and columns onto the orthonormal singular bases for low-rank approximations. Machine Learning (ML) soft- ware frameworks (such as TensorFlow) oer ecient SVD implementations, as SVD can estimate solutions for a variaty of tasks, e.g., in computer vision [Turk and Pentland, 1991], weather prediction [Molteni et al., 1996], recommendation [Koren et al., 2009], language [Deerwester et al., 1990, Levy and Goldberg, 2014], and more-relevantly, graph representation learning (GRL) [Qiu et al., 2018]. SVD’s benets include training models, without calculating gradients, to arrive at globally-unique so- lutions, optimizing Frobenius-norm objectives (§7.2.3), without requiring hyperparameters for the learning process, such as the choice of the learning algorithm, step-size, regularization coecient, etc. Typically, one constructs adesignmatrixM, such that, its decomposition provides a solution to a task of interest. Unfortunately, existing popular ML frameworks [Abadi et al., 2016, Paszke et al., 2019] cannot calculate the SVD of an arbitrary linear matrix given its computation graph: they compute the matrix (entry-wise) then ∗ its decomposition. This limits the scalability of these libraries in several cases of interest, such as in ∗ TensorFlow can caluclate matrix-free SVD if one implements aLinearOperator, as such, our code could be re-implemented as a routine that can convertTensorGraph toLinearOperator. 85 GRL, when explicit calculation of the matrix is prohibitive due to memory constraints. These limitations render SVD as impractical for achieving state-of-the-art (SOTA) for tasks at hand. This has been circum- vented by Qiu et al. [2018] by samplingM entry-wise, but this produces sub-optimal estimation error and experimentally degrades the empirical test performance (§7.6: Experiments). We design a software library that allows symbolic denition ofM, via composition of matrix oper- ations, and we implement an SVD algorithm that can decomposeM from said symbolic representation, without need to computeM. This is valuable for many GRL tasks, where the design matrixM is too large, e.g., quadratic in the input size. With our implementation, we show that SVD can perform learning, orders of magnitudes faster than current alternatives. SVD allows us to (i) quickly train (ii) competitive GRL models by posing convex objectives and estimating optimal solutions in closed-form, hence (iii)relieving practitioners from hyperparameter tuning or convergence checks. SVD periodically appears within powerful yet simple methods, competing on state-of-the-art. The common practice is to design a matrixM, such that its decomposition (via SVD), provides an estimate for learning a model given an objective. For instance, Levy and Goldberg [2014] show that the learning of NLP skipgram models such as word2vec [Mikolov et al., 2013] and GloVe [Pennington et al., 2014], can be approximated by the SVD of a Shifted Positive Pointwise Mutual Information matrix. In GRL, Chen et al. [2017], Qiu et al. [2018], Abu-El-Haija et al. [2018] have approximated methods of DeepWalk [Perozzi et al., 2014] and Node2Vec [Grover and Leskovec, 2016] via decomposition of some matrix M. However, their decomposition requires M to be either (a) exactly calculated or (b) sampled entry-wise, but (a) is unnecessarily expensive for real-world large networks (due to Small World Phe- nomenon, [Travers and Milgram, 1969]) and (b) incurs unnecessary estimation errors. On the other hand, randomized algorithms can decomposeany matrixMwithoutexplicitlyknowingM. Specically, it is suf- cient to provide a functionf M (:) =hM;:i that can multiplyM with arbitrary vectors (§7.5). We, argue 86 that if the popular frameworks (e.g., TensorFlow) implement a functional SVD, that acceptf M (:) rather thanM, then modern practitioners may nd it useful. §7.2 reviews powerful GRL methods that areconvexied in §7.3, allowing the use of (randomized) SVD for obtaining (approximate) optimum solutions. The contributions of this chapter are: 1. A functor implementation (§7.5) of the randomized SVD algorithm of Halko et al. [2009]. 2. Node embedding and message passing models are approximated via SVD (§7.3), showing competitive performance with state-of-the-art, yet much faster to train (§7.6). 3. An analysis that learning is fast and approximation error can be made arbitrarily small (§7.5.4.2). 7.2 Refresher: SVD&ModelClassestobeconvexied We review model classes: (1) network embedding and (2) message passing that we dene as follows. The rst inputs a graph (A,X) and outputs node embedding matrixZ2R nz withz-dimensions per node.Z is then used for an upstream task, e.g., link prediction. The second class utilizes a functionH :R nn R nd !R nz where the functionH(A;X) is usually directly trained on the upstream task, e.g., node classication. In general, the rst class is transductive while the second is inductive. 7.2.1 NetworkembeddingmodelsbasedonDeepWalk&node2vec The seminal work of DeepWalk [Perozzi et al., 2014] embeds nodes of a network using a two-step process: (i) simulate random walks on the graph – each walk generating a sequence of node IDs then (ii) pass the walks (node IDs) to a language word embedding algorithm, e.g. word2vec [Mikolov et al., 2013], as-if each walk is a sentence. This work was extended by node2vec [Grover and Leskovec, 2016] among others. It has 87 been shown by Abu-El-Haija et al. [2018] that the learning outcome of the two-step process of DeepWalk is equivalent, in expectation, to optimizing a single objective † : min Z=fL;Rg X (i;j)2[n][n] E qQ [T q ] log(LR > )(1A) log 1(LR > ) ij ; (7.1) whereL;R2R n z 2 are named by word2vec as the input and output embedding matrices, is Hadamard product, and the log(:) and the standard logistic(:) = (1 + exp(:)) 1 are applied element-wise. The objective above is weighted cross-entropy where the (left) positive term weighs the dot-productL > i R j by the (expected) number of random walks simulated fromi and passing throughj, and the (right) negative term weighs non-edges (1A) by scalar2R + . The context distributionQ stems from step (ii) of the process. In particular, word2vec accepts hyperparametercontextwindowsizeC for its stochastic sampling: when it samples a center token (node ID), it then samples its context tokens that are up-to distancec from the center. The integerc is sampled from a coin ip uniform on the integers [1; 2;:::;C] – as detailed by Sec.3.1 of [Levy et al., 2015]. Therefore, P Q (q j C) / Cq+1 C . Since q has support on [C], then P Q (qjC) = 2 (C+1)C Cq+1 C . 7.2.2 Messagepassinggraphnetworksfor(semi-)supervisednodeclassication We are also interested in a class of (message passing) graph network models taking the general form: forl = 0; 1;:::L: H (l+1) = l g(A)H (l) W (l) ; H (0) =X; H =H (L) ; (7.2) whereL is the number of layers, W (l) ’s are trainable parameters, l ’s denote element-wise activations (e.g. logistic or ReLu), andg is some (possibly trainable) transformation of adjacency matrix. GCN [Kipf and Welling, 2017] setg(A) = b A, GAT [Veličković et al., 2018] setg(A) = A MultiHeadedAttention † Derivation is in [Abu-El-Haija et al., 2018]. Unfortunately, matrix in Eq. 7.1 is dense withO(n 2 ) nonzeros. 88 and GIN [Xu et al., 2019] asg(A) =A + (1 +)I with> 0. For node classication, it is common to set L = softmax (applied row-wise), specify the size ofW L s.t. H2R ny wherey is number of classes, and optimize cross-entropy objective: . min fW j g L j=1 [Y logH (1Y) log(1H)]; whereY is a binary matrix with one-hot rows indicating node labels. In semi-supervised settings where not all nodes are labeled, before measuring the objective, subset of rows can be kept inY andH that correspond to labeled nodes. 7.2.3 TruncatedSingularValueDecomposition(SVD) SVD is an algorithm that approximates any matrixM2R rc as a product of three matrices: SVD k (M), arg min U;S;V jjMUSV > jj F subject to U > U =V > V =I k ; S = diag(s 1 ;:::;s k ): Theorthonormal matricesU2R rk andV2R ck , respectively, are known as the left- and right-singular bases. The values along diagonal matrixS2R kk are known as the singular values. Due to theorem of Eckart and Young [1936], SVD recovers the best rank-k approximation of inputM, as measured by the Frobenius normjj:jj F . Further, ifk rank(M))jj:jj F = 0. Popular SVD implementations follow Random Matrix Theory algorithm of Halko et al. [2009]. The pro- totype algorithm starts with a random matrix and repeatedly multiplies it byM and byM > , interleaving these multiplications with orthonormalization. Our SVD implementation (in Appendix) also follows the prototype of [Halko et al., 2009], but with two modications: (i) we replace the recommended orthonor- malization step from QR decomposition to Cholesky decomposition, giving us signicant computational speedups and (ii) our implementation accepts symbolic representation ofM (§7.4), in lieu of its explicit value (contrast to TensorFlow and PyTorch, requiring explicitM). In §7.3, we derive linear rst-order approximations of models reviewed in §7.2.1 & §7.2.2 and explain how SVD can train them. In Chapter 8, we show how they can be used to initialize deeper models. 89 7.3 Convexrst-orderapproximationsofGRLmodels 7.3.1 ConvexicationofNetworkEmbeddingModels We can interpret objective 7.1 as self-supervised learning, since node labels are absent. Specically, given a nodei2 [n], the task is to predict its neighborhood as weighted by the row vectorE q [T q ] i , representing the subgraph ‡ aroundi. Another interpretation is that Eq. 7.1 is a decomposition objective: multiplying the tall-and-thin matrices, asLR > 2R nn , should give a larger value at (LR > ) ij =L > j R i when nodes i andj are well-connected but a lower value when (i;j) is not an edge. We propose a matrix such that its decomposition can incorporate the above interpretations: c M (NE) =E qjC [T q ](1A) = 2 (C + 1)C C X q=1 Cq + 1 C T q (1A) (7.3) If nodes i;j are nearby, share a lot of connections, and/or in the same community, then entry c M (NE) ij should be positive. If they are far apart, then c M (NE) ij =. To embed the nodes onto a low-rank space that approximates this information, one can decompose c M (NE) into two thin matrices (L;R): LR > c M (NE) () (LR > ) i;j =hL i ;R j i c M (NE) ij for all i;j2 [n]: (7.4) SVD gives low-rank approximations that minimize the Frobenius norm of error (§7.2.3). The remaining challenge is computational burden: the right term (1A), a.k.a, graph compliment, has n 2 non-zero entries and the left term has non-zero at entry (i;j) if nodes i;j are within distance C away, as q has support on [C] – for reference Facebook network has an average distance of 4 [Backstrom et al., 2012] i.e. yieldingT 4 withO(n 2 ) nonzero entries – Nonetheless, Section §7.4 presents a framework for decomposing c M from its symbolic representation, without explicitly computing its entries. Before moving forward, we ‡ Eq[T q ]i is a distribution on[n]: entryj equals prob. of walk starting ati ending atj if walk lengthU[C]. 90 note that one can replaceT in Eq. 7.3 by its symmetrically normalized counterpart b A, recovering a basis whereL =R. This symmetric modeling might be emperically preferred for undirected graphs. Learning can be performed via SVD. Specically, the node at thei th row and the node at thej th th column will be embedded, respectively, inL i andR j computed as: U;S;V SVD k ( c M (NE) ); L US 1 2 ; R VS 1 2 (7.5) In thisk-dim space of rows and columns, Euclidean measures are plausible:Inference of nodes’ similarity at rowi and columnj can be modeled asf(i;j) =hL i ;R j i =U > i SV j ,hU i ;V j i S . 7.3.2 Convexicationofmessagepassinggraphnetworks Removing all l ’s from Eq. 7.2 and settingg(A) = b A gives outputs of layers 1, 2, andL, respectively, as: b AXW (1) , b A 2 XW (1) W (2) , and b A L XW (1) W (2) :::W (L) : (7.6) Without non-linearities, adjacent parameter matrices can be absorbed into one another. Further, the model output can concatenate all layers, like JKNets [Xu et al., 2018], giving nal model output of: H (NC) linearized = X b AX b A 2 X ::: b A L X c W , c M (NC) c W; (7.7) where the linearized model implicitly constructs design matrix c M (NC) 2 R nF and multiplies it with parameter c W 2 R Fy – here, F = d + dL. Crafting design matrices is a creative process (§8.2.3). 91 Learning can be performed by minimizing the Frobenius norm:jjH (NC) Yjj F =jj c M (NC) c WYjj F . Moore-Penrose Inverse (a.k.a, the psuedoinverse) provides one such minimizer: c W = argmin c W c M c WY F = c M y YVS + U > Y; (7.8) withU;S;V SVD k ( c M). NotationS + reciprocates non-zero entries of diagonalS [Golub and Loan, 1996]. Multiplications in the right-most term should, for eciency, be executed right-to-left. The pseu- doinverse c M y VS + U > recovers the c W with least norm (§7.5.4.2, Theorem 5). The becomes = whenk rank( c M). In semi-supervised settings, one can take rows subset of either (i)Y andU, or of (ii)Y andM, keep- ing only rows that correspond to labeled nodes. Option (i) is supported by existing frameworks (e.g., tf.gather()) and our symbolic framework (§7.4) supports (ii) by implicit row (or column) gather – i.e., calculating SVD of submatrix ofM without explicitly computingM nor the submatrix. Inference over a (possibly new) graph (A;X) can be calculated by (i) (implicitly) creating the design matrix c M correspond- ing to (A;X) then (ii) multiplying by the explicitly calculated c W . As explained in §7.4, c M need not to be explicitly calculated for computing multiplications. 7.4 Symbolicmatrixrepresentation To compute the SVD of any matrix M using algorithm prototypes presented by Halko et al. [2009], it suces to provide functions that can multiply arbitrary vectors with M and M > , and explicit calculation ofM isnotrequired. Our software framework can symbolically representM as a directed acyclic graph (DAG) of computations. On this DAG, each node can be one of two kinds: 1. Leaf node (no incoming edges) that explicitly holds a matrix. Multiplications against leaf nodes are directly executed via an underlying math framework (we utilize TensorFlow). 92 a = scipy.sparse.csr_matrix(...) d = scipy.sparse.diags(a.sum(axis=1)) t = (1/d).dot(a) t, a = F.leaf(t), F.leaf(a) row1 = F.leaf(tf.ones([1, a.shape[0]])) q1, q2, q3 = np.array([3, 2, 1]) / 6.0 M = q1∗ t + q2∗ t@t + q3∗ t@t@t M−= lamda∗ (row1.T @ row1− A) a 1 T t ^ ^ • • + - 3 2 T x - • • a 1 T t x x • • + - T x - • • a 1 T t ^ ^ • • + - 3 2 T x - • • a 1 T t x x • • + - T x - • • Figure 7.1: Symbolic Matrix Representation. Left: code using our framework to implicitly construct the design matrix M = c M (NE) with our framework. Center: DAG corresponding to the code. Right: An equivalent automatically-optimized DAG (via lazy-cache, Fig. 7.2) requiring fewer oating point opera- tions. The rst 3 lines of code create explicit input matrices (that t in memory): adjacencyA, diagonal degreeD, and transitionT . Matrices are imported into our framework with F.leaf (depicted on com- putation DAGs in blue). Our classes overloads standard methods (+, -, *, @, **) to construct computation nodes (intermediate in grey). The output node (in red) needs not be exactly calculated yet can be eciently multiplied by any matrix by recursive downward traversal. 2. Symbolic node that only implicitly represents a matrix as as a function of other DAG nodes. Multiplications are recursively computed, traversing incoming edges, until leaf nodes. For instance, suppose leaf DAG nodesM 1 andM 2 , respectively, explicitly contain row vector2R 1n and column vector2R n1 . Then, their (symbolic) product DAG nodeM = M 2 @M 1 is2R nn . Although storing M explicitly requiresO(n 2 ) space, multiplications against M can remain withinO(n) space if eciently implemented ashM;:i =hM 2 ;hM 1 ;:ii. Figure 7.1 shows code snippet for composing DAG to represent symbolic node c M (NE) (Eq. 7.3), from leaf nodes initialized with in-memory matrices. 7.5 ImplementationofFunctionalSingularValueDecomposition We do not have to explicitly calculate the matrices c M. Rather, we only need to implement product func- tionsf c M (v) =h c M;vi that can multiply c M with arbitrary (appropriately-sized) vectorv. We implement a (TensorFlow)functional version of the randomized SVD algorithm of Halko et al. [2009], that accepts f c M rather than c M. We show that it can train our models quickly and with arbitrarily small approxima- tion error (in linear time of graph size, in practice, with less than 10 passes over the data) and can yield 93 DDI FB AstroPh PPI 0.00 0.25 0.50 0.75 1.00 ( ☓; QR) ( ☓; Cholesky) ( ✓; Cholesky) Figure 7.2: SVD runtime congs of (lazy caching; orthonormalization) as a ratio of SVD’s common default (QR decomposition) l2-regularized solutions for classication (see Appendix). We now need the (straightforwardf c M (WYS) and f c M (JKN) . We leave the second outside this writing. For the rst, the non-edges term, (1A), can be re-written by explicit broadcasting as (11 > A) giving f c M (WYS) (v) = X i (T ) i v |{z} O(im) c i 1 (1 > v) | {z } O(n) + Av |{z} O(m) : (7.9) All matrix-vector products can be eciently computed whenA is sparse. 7.5.1 CalculatingSVD SVD ofM yields its left and right singular orthonormal (basis) vectors, respectively, in columns ofU andV. SinceU andV, respectively, are the eigenvectors ofMM > andM > M, then perhaps the most intuitive algorithms for SVD are variants of thepoweriteration, including Arnoldi iteration [Arnoldi, 1951] and Lanczos algorithm [Lanczos, 1950]. In practice, randomized algorithms for estimating SVD run faster 94 than these variants, including the algorithm of Halko et al. [2009] which is implemented in scikit-learn. None of these methods require individual access toM’s entries, but rather, require two operations: ability to multiply any vector withM and withM > . Therefore, it is only a practical gap that we ll in this section: we open-source a TensorFlow implementation that accept product and transpose operators. 7.5.2 TensorFlowImplementation Since we do not explicitly calculate the c M matrices displayed in Equations 7.3 and 7.7, as doing so con- sumes quadratic memoryO(n 2 ), we implement a functional form SVD of the celebrated randomized SVD algorithm of Halko et al. [2009]. To run our Algorithm 7, one must specifyk2N + (rank of decomposition), as well as functionsf;l;s that the program provider promises they operate as: 1. Product functionf that exactly computesf(v) =hM;vi for anyv2R c (recall:M2R rc ) 2. Transpose § functiont.8v2R r , (tf)(v) =hM > ;vi 3. Shape (constant) functions that knows and returns (r;c). Once transposes, should return (c;r). Algorithm7 Functional Randomized SVD, following prototype of Halko et al. [2009] 1: input: product fnf :R c !R r , transpose fnt : (R c !R r )! (R c !R r ), shape fns, rankk2N + 2: procedure fSVD(f;t;s;k) 3: (r;c) s() 4: QN (0; 1) c2k . IID Gaussian. Shape: (c 2k) 5: fori 1toiterationsdo 6: Q; _ tf.linalg.qr(f(Q)) . (r 2k) 7: Q; _ tf.linalg.qr((tf)(Q)) . (c 2k) 8: Q; _ tf.linalg.qr(f(Q)) . (r 2k) 9: B ((tf)(Q)) > . (2kc) 10: U;s;V > tf.linalg.svd(B) 11: U QU . (r 2k) 12: return U[:; :k];s[:k];V [:; :k] > § An alternative tot can be a left-multiply functionl(u) =hu;Mi, however, in practice, TensorFlow is optimized for CSR matrices and computationally favors sparse-times-dense 95 7.5.3 Timeimprovements We replaced the recommended orthonormalization of [Halko et al., 2009] from QR decomposition, to Cholesky decomposition. Further, we implementedcaching to avoid computing sub-expressions if already calculated. Speed-ups are shown in Fig. 7.2. Details are in the Appendix of [Abu-El-Haija et al., 2021b] 7.5.4 Analysis 7.5.4.1 NormRegularizationofWideModels If c M is too wide, then we neednot to worry much about overtting, due to the following Theorem. Theorem5. (Min. Norm)Ifsystem c M c W =Yisunderdetermined ¶ withrowsof c Mbeinglinearlyindepen- dent, then solution space c W = n c W c M c W =Y o has innitely many solutions. Then, fork rank( c M), matrix c W , recovered by Eq.7.8 satises: c W = argmin c W2 c W jj c Wjj 2 F . Theorem 5 implies that, even though one can design a wide c M (NC) (Eq.7.7), i.e., with many layers, the recovered parameters with least norm should be less prone to overtting. Recall that this is the goal of L2 regularization. Analysis and proofs are in the Appendix. Proof AssumeY = y is a column vector (the proof can be generalized to matrixY by repeated column-wise application ∥ ). SVD( c M;k),k rank( c M), recovers the solution: c W = c M y y = c M > c M c M > 1 y: (7.10) TheGrammatrix c M c M > is nonsingular as the rows of c M are linearly independent. To prove the claim let us rst verify that c W 2 c W : c M c W = c M c M > c M c M > 1 y =y: ¶ E.g., if the number of labeled examples i.e. height ofM andY is smaller than the width ofM. ∥ The minimizer for the Frobenius norm is composed, column-wise, of the minimizersargmin c MW :;j =Y :;j jjW:;jjj 2 2 for allj. 96 Let c W p 2 c W . We must show thatjj c W jj 2 jj c W p jj 2 . Since c M c W p = y and c M c W = y, their subtraction gives: c M( c W p c W ) = 0: (7.11) It follows that ( c W p c W )? c W : ( c W p c W ) > c W = ( c W p c W ) > c M > c M c M > 1 y = ( c M( c W p c W )) > | {z } =0 due to Eq. 7.11 c M c M > 1 y = 0 Finally, using Pythagoras Theorem (due to?): jj c W p jj 2 2 =jj c W + c W p c W jj 2 2 =jj c W jj 2 2 +jj c W p c W jj 2 2 jj c W jj 2 2 As a consequence, solution for classication models recovered by SVD follow a strong standard Gaus- sian prior, which may be regarded as a form of regularization. 7.5.4.2 ComputationalComplexityandApproximationError Theorem6. (Linear Time) Functional SVD (Alg. 7) trains our convexied GRL models in time linear in the graph size. Proof of Theorem 6 for our two model families: 1. For rank-k SVD overf c M (WYS) : Let cost of runningf c M (WYS) = T mult . The run-time to compute SVD, as derived in Section 1.4.2 of [Halko et al., 2009], is: O(kT mult + (r +c)k 2 ): (7.12) 97 Sincef c M (WYS) can be dened asC (context window size) multiplications with sparsenn matrixT withm non-zero entries, then running fSVD(f c M (WYS) ;k) costs: O(kmC +nk 2 ) (7.13) 2. For rank-k SVD overf c M (JKN) : Suppose feature matrix containsd-dimensional rows. One can calculate c M (JKN) 2R nLd withL sparse multiplies inO(Lmd). Calculating and running SVD [see Section 1.4.1 of Halko et al., 2009] on c M (JKN) costs total of: O(ndL log(k) + (n +dL)k 2 +Lmd): (7.14) Therefore, training time is linear inn andm. Contrast with methods of WYS [Abu-El-Haija et al., 2018] (Chapter 6) and NetMF [Qiu et al., 2018], which require assembling a densenn matrix requiringO(n 2 ) time to decompose. However, using a linear-time decomposition, how much error are we sacricing compared to the full quadratic SVD? The following bounds the error. Theorem 7. (Exponentially-decaying Approx. Error) Rank-k randomized SVD algorithm of Halko et al. [2009] gives an approximation error that can be brought down, exponentially-fast, to no more than twice of the approximation error of the optimal (true) SVD. Proof is in Theorem 1.2 of Halko et al. [2009]. Consequently, compared to ^ NetMF of [Qiu et al., 2018], which incurs unnecessary estimation error, our estimation error can be brought-down exponentially by increasing theiters parameter of Alg. 7 7.6 Experiments 98 10 0 10 1 10 2 10 3 Train Runtime (seconds) 65 70 75 80 85 90 95 100 Test ROC AUC Datasets: AstroPh Facebook HepTh PPI Methods: fSVD(k=32) fSVD(k=100) NetMF NetMF WYS n2v Figure 7.3: ROC-AUC versus train time of methods on datasets. Each dataset has a distinct shape, with shape size proportional to graph size. Each method uses a dierent color. Our methods are in blue (dark uses SVD rankk = 32, light usesk = 100, trading estimation accuracy for train time). Ideal methods should be placed on top-left corner (i.e., higher test ROC-AUC and faster training). 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Num Layers (=L) 0.5 0.6 0.7 0.8 Test Accuracy Citeseer Cora Pubmed 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 fSVD Rank (=k) 0.75 0.80 0.85 0.90 0.95 Test ROC AUC AstroPh HepTh PPI Facebook Figure 7.4: Sensitivity Analysis. Left: Test Accuracy VS Depth of model dened by c M (JKN) . Right: Test ROC-AUC VS rank of SVD on c M (WYS) . 7.6.1 Datasets We apply our functional SVD on popular datasets that can be trained using our simplied (i.e., convexied) models. Specically either (1) semi-supervised node-classication datasets, where features are present, or (2) link-prediction datasets where features are absent. It is possible to convexify other setups, e.g., link prediction when node features are present, but we leave this as future work. We run experiments on seven graph datasets: • Protein-Protein Interactions (PPI): a large graph where every node is a protein and an edge between two nodes indicate that the two proteins interact. Processed version of PPI was downloaded from [Grover and Leskovec, 2016]. 99 Table 7.1: Test Performance. Left: accuracy (training time) for Semi-supervised Node Classication, over citation datasets. Right: ROC-AUC for link prediction when embedding withz=64=2k. Cora Citeseer Pubmed Planetoid 75.7 (13s) 64.7 (26s) 77.2 (25s) GCN 81.5 (4s) 70.3 (7s) 79.0 (83s) GAT 83.2 (1m23s) 72.4 (3m27) 77.7 (5m33s) MixHop 81.9 (26s) 71.4 (31s) 80.8 (1m16s) GCNII 85.5 (2m29s) 73.4 (2m55s) 80.3 (1m42s) f c M (JKN) 82.4 (0.28s) 72.2 (0.13s) 79.7 (0.14s) FB AstroPh HepTh PPI WYS 99.4 97.9 93.6 89.8 n2v 99.0 97.8 92.3 83.1 NetMF 97.6 96.8 90.5 73.6 ^ NetMF 97.0 81.9 85.0 63.6 f c M (WYS) 98.7 92.1 89.2 87.9 (k=100) 98.7 96.0 90.5 86.2 • Three citation networks that are extremely popular: Cora, Citeseer, Pubmed. Each node is an article and each (directed) edge implies that an article cites another. Additionally, each article is accompa- nied with a feature vector (containing NLP-extracted features of the article’s abstract), as well as a label (article type). • Two collaboration datasets: ca-AstroPh and ca-HepTh, where nodes are researchers and an edge between two nodes indicate that the researchers co-published together at least one article, in the areas Astro-Physics and High Energy Physics. • ego-Facebook: an ego-centered social network. For citation networks, we processed node features and labels. For all other datasets, we did not process features during training nor inference. For train/validation/test partitions: we used the splits of Yang et al. [2016a] for Citeseer, Cora, Pubmed; we used the splits of Abu-El-Haija et al. [2018] for PPI, Facebook, ca- AstroPh and ca-HepTh; we used the splits of OGB [Hu et al., 2020] for ogbl-ddi. All datasets and statistics are summarized in Table 7.2. In §7.6.2 and §7.6.3, unless otherwise noted, we download authors’ source code from github, modify ∗∗ it to record wall-clock run-time, and run on GPU NVidia Tesla k80. Thankfully, downloaded code has one script to run each dataset, or hyperparameters are clearly stated in the source paper. ∗∗ Modied les are in our code repo 100 Table 7.2: Dataset Statistics Dataset Nodes Edges Source PPI 3,852 proteins 20,881 chemical interactions http://snap.stanford.edu/node2vec ego-Facebook 4,039 users 88,234 friendships ca-AstroPh 17,903 researchers 197,031 co-authorships http://snap.stanford.edu/data ca-HepTh 8,638 researchers 24,827 co-authorships Cora 2,708 articles 5,429 citations Planetoid [Yang et al., 2016a] Citeseer 3,327 articles 4,732 citations Pubmed 19,717 articles 44,338 citations 7.6.2 Semi-supervisedNodeClassication We consider a transductive setting where a graph is entirely visible (all nodes and edges). Additionally, some of the nodes are labeled. The goal is to recover the labels of unlabeled nodes. All nodes have feature vectors. Baselines: We download code of GAT [Veličković et al., 2018], MixHop [Abu-El-Haija et al., 2019], GCNII [Chen et al., 2020] and re-ran them, with slight modications to record training time. However, for baselines Planetoid [Yang et al., 2016a] and GCN [Kipf and Welling, 2017], we copied them from the GCN paper [Kipf and Welling, 2017]. In these experiments, to train our method, we run our functional SVD twice per graph. We take the feature matrixX bundled with the datasets, and concatenate to it two matrices, b L and b R, calculated per Equation 7.5: the calculation itself invokes our functional SVD (the rst time) onf c M (WYS) with rank = 32. Hyperparameters off c M (WYS) are (negative coecient) andC (context window-size). After concatenating b L and b R intoX, we PCA the resulting matrix to 1000 dimensions, which forms our newX. Finally, we express our model as the linearL-layer messaging passing network (Eq. 7.7) and learn its parameters via rankk SVD onf c M (NC) (the second time), as explained in §7.3.2. We use the validation partition to tuneL, k,, andC. 101 Table 7.1 (left) summarizes the performance of our approach (f c M (JKN) ) against aforementioned baselines, showing both test accuracy and training time. While our method is competitive with state-of-the-art, it trains much faster. 7.6.3 ROC-AUCLinkPrediction Given a partial graph: only of a subset of edges are observed. The goal is to recover unobserved edges. This has applications in recommender systems: when a user expresses interest in products, the system wants to predict other products the user is interested in. The task is usually setup by partitioning the edges of the input graph into train and test edges. Further, it is common to samplenegativetestedges e.g. uniformly from the graph compliment. Lastly, a GRL method for link prediction can be trained on the train edges partition, then can be asked to score the test partition edges versus the negative test edges. The quality of the scoring can be quantied by a ranking metric, e.g., ROC-AUC. Baselines: We utilize code of WYS [Abu-El-Haija et al., 2018] and update it to for TensorFlow-2.0. We download code of Qiu et al. [2018] and denote their methods as NetMF and ^ NetMF, where the rst computes complete matrixM before SVD decomposition and the second sampleM entry-wise – the second is faster for larger graphs but sacrices on estimation error and performance. For node2vec (n2v), we use its PyG implementation [Fey and Lenssen, 2019]. Table 7.1 (right) summarizes results test ROC-AUC. For our method (denotedf c M (NE) ), we call our func- tional SVD (Alg. 7) and pass itf c M (NE) as dened in Eq. 7.3. Embeddings are set to the SVD basis (as in, §??) and edge score of nodes (u;v) is/ dot-product of embeddings. The last row of the table shows re- sults when svd rank =100. Lastly, we set the context window hyperparameter (a.k.a, length of walk) as follows. For WYS, we trained with their default context (as WYS learns the context), but for all others (NetMF, n2v, ours) we used context window of lengthC=5 for datasets Facebook and PPI (for us, this sets c = [5; 4; 3; 2; 1]) andC=20 for AstroPh and HepTh. 102 7.6.4 SensitivityAnalysis While in §7.6.2 we tune the number of layers (L) using the performance on the validation partition, in this section, we show impact of varyingL on test accuracy. According to the summary in Figure 7.4 (left), accuracy of classifying a node improves when incorporating information from further nodes. We see little gains beyond L > 6. Note that L = 0 corresponds to ignoring the adjacency matrix altogether when runningf c M (JKN) . Here, we xed =C = 1 and averaged 5 runs. The (tiny) error bars show the standard deviation. Further, while in §7.6.3 we do SVD onf c M (WYS) with rankk = 32 ork = 100, Figure 7.4 (right) shows test accuracy while sweepingk 32. In general, increasing the rank improves estimation accuracy and test performance. However, if k is larger than the inherit dimensionality of the data, then this could cause overtting (though perfect memorization of training edges). TheNormRegularization note (§7.5.4.1) applies only to pseudoinversion i.e. our classication models. 103 Chapter8 ConvexGNNsinitializeDeeperModels Chapter 7 presented linearization of strong GNN models, for link prediction and node classication, for the purposes of (1) training faster and (2) removing training hyperparameters while guaranteeing convergence. However, this came at a cost: the linear networks are shallow rather than deep, therefore reducing the representation capacity of the models and causing slightly worse emperical performance on the tasks at hand. This chapter presents how to “deepen” a shallow GNN model i.e. convert it into a deep network. The deeper model can then be ne-tuned by back-propagation. Overall, training a convexied model (per Chapter 7), deepening it, then ne-tuning it, gets the best of both worlds: a model that trains very fast yet acheives state-of-the-art empirical performance. The contents of this chapter can be found standalone in [Abu-El-Haija et al., 2021b] and the source-code for reproducing the results is available onhttps://github. com/samihaija/isvd This chapter shows how SVD can initialize a deeper model, that is architected to be non-linear almost everywhere, though behaves linearly when its parameters reside on a hyperplane, onto which SVD ini- tializes. The deeper model can then be ne-tuned within only a few epochs. Overall, our procedure trains hundreds of times faster than state-of-the-art methods, while competing on empirical test performance. 104 8.1 Intuition Currently, SOTA GRL models are generally graph neural networks trained to optimize cross-entropy ob- jectives. Their inter-layer non-linearities place their (many) parameters onto a non-convex objective sur- face where convergence is rarely veried ∗ . Nonetheless, these models can be convexied (Chapter 7) and trained via SVD, if we remove nonlinearities between layers and swap the cross-entropy objective with Frobenius norm minimization. Undoubtedly, such linearization incurs a drop of accuracy on empirical test performance. Nonetheless, we show that the (convexied) model’s parameters learned by SVD can provide initialization to deeper (non-linear) models, which then can be ne-tuned on cross-entropy ob- jectives. The non-linear models are endowed with our novel Split-ReLu layer, which has twice as many parameters as a ReLu fully-connected layer, and behaves as a linear layer when its parameters reside on some hyperplane (§8.2.2). Training on modest hardware (e.g., laptop with 8 GB RAM) is sucient for this learning pipeline (convexify! SVD! ne-tune) yet it trains much faster than current approaches that are commonly trained on expensive hardware. 8.2 SVDinitializationfordeepermodelsne-tunedviacross-entropy 8.2.1 Edgefunctionfornetworkembeddingasa(1-dimensional)Gaussiankernel SVD provides decent solutions to link prediction tasks. As we saw in the previous chapter (deningM (NE) per Equation 7.3), computing U;S;V SVD(M (NE) ) is much faster than training SOTA models for link prediction, yet, simple edge-scoring functionf(i;j) =hU i ;V j i S yields competitive empirical (test) performance. We proposef with =f;sg: f ;s (i;j) =E xN (;s) hU i ;V j i S x =U > i E x [S x ]V j =U > i Z S x N (xj;s)dx V j ; (8.1) ∗ Practitioners rarely verify thatr J = 0, whereJ is mean train objective and are model parameters. 105 whereN is the truncated normal distribution (we truncate to = [0:5; 2]). The integral can be approx- imated by discretization and applying softmax (see §8.2.1.1). The parameters 2 R; 2 R >0 can be optimized on cross-entropy objective for link-prediction: min ;s E (i;j)2A [log ((f ;s (i;j)))]k (n) E (i;j)= 2A [log (1(f ;s (i;j)))]; (8.2) where the left- and right-terms, respectively, encouragef to score high for edges, and the low for non- edges. k (n) 2N >0 controls the ratio of negatives to positives per batch (we usek (n) = 10). If the opti- mization sets = 1 ands 0, thenf reduces to no-op. In fact, we initialize it as such, and we observe thatf convergeswithinoneepoch, on graphs we experimented on. If it converges as< 1, VS> 1, respectively, thenf would eectively squash, VS enlarge, the spectral gap. 8.2.1.1 ApproximatingtheintegralofGaussian1dkernel The integral from Equation 8.1 can be approximated using softmax, as: Z S x N (xj;s)dx = R S x exp 1 2 x s 2 dx R exp 1 2 y s 2 dy P x2 S x exp 1 2 x s 2 dx P y2 exp 1 2 y s 2 dy =fS x g x2 softmax x2 (x) 2 exp(s) where the rst equality expands the denition of truncated normal, i.e. divide by partition function, to make the mass within sum to 1. In our experiments, we use = [0:5; 2]. The comes as we use dis- cretized =f0:5; 0:505; 0:51; 0:515;:::; 2:0g (i.e., with 301 entries). Finally, the last expression contains a constant tensor (we create it only once) containingS raised to every power in , stored along (tensor) axis which gets multiplied (via tensor product i.e. broadcasting) against softmax vector (also of 301 entries, corresponding to ). We parameterize with two scalars;s2R i.e. implyings = log 1 2s 2 106 8.2.2 Split-ReLu(deep)graphnetworkfornodeclassication(NC) c W from SVD (Eq.7.8) can initialize anL-layer graph network with input:H (0) =X, with: message passing (MP) H (l+1) = h b AH (l) W (l) (p) i + h b AH (l) W (l) (n) i + ; (8.3) output H = l=L X l=0 h H (l) W (l) (op) i + h H (l) W (l) (on) i + (8.4) initialize MP W (l) (p) I;W (l) (n) I; (8.5) initialize output W (l) (op) c W [dl :d(l+1)] ;W (l) (on) c W [dl :d(l+1)] ; (8.6) Element-wise [:] + = max(0;:). Further,W [i :j] denotes rows from (i) th until (j1) th ofW. The deep network layers (Eq. 8.3&8.4) use our Split-ReLu layer which we formalize as: SplitReLu(X;W (p) ;W (n) ) = XW (p) + XW (n) + ; (8.7) where the subtraction is calculated entry-wise. The layer has twice as many parameters as standard fully- connected (FC) ReLu layer. In fact, learning algorithms can recover FC ReLu from SplitReLu by assigning W (n) = 0. More importantly, the layer behaves as linear inX whenW (p) =W (n) . On this hyperplane, this linear behavior allows us to establish the equivalency: the (non-linear) modelH is equivalent to the linearH linearized at initialization (Eq. 8.5&8.6) due to Theorem 8. Following the initialization, model can be ne-tuned on cross-entropy objective. as in §7.2.2. 8.2.3 CreativeAdd-onsfornodeclassication(NC)models Labelre-use (LR): Let c M (NC) LR , c M (NC) ( b A (D +I) 1 )Y [train] ( b A (D +I) 1 ) 2 Y [train] . This follows the motivation of Wang and Leskovec [2020], Huang et al. [2021b], Wang [2021] and their 107 empirical results on ogbn-arxiv dataset, whereY [train] 2R ny contains one-hot vectors at rows corre- sponding to labeled nodes but contain zero vectors for unlabeled (test) nodes. Our scheme is similar to concatenatingY [train] intoX, but with care to prevent label leakage from rowi ofY to rowi of c M, as we zero-out the diagonal of the adjacency multiplied byY [train] . Pseudo-Dropout (PD): Dropout [Srivastava et al., 2014] reduces overtting of models. It can be related todataaugmentation, as each example is presented multiple times. At each time, it appears with a dierent set of dropped-out features – input or latent feature values, chosen at random, get replaced with zeros. As such, we can replicate the design matrix as: c M > c M > PD( c M) > . This row-wise concatenation maintains the width of c M and therefore the number of model parameters. In the above add-ons, concatenations, as well as PD, can be implicit or explicit (see §A.3). 8.3 Analysis&Discussion Theorem8. (Non-linear init) The initialization Eq. 8.5&8.6 yieldsH (NC) linearized =H via Eq. 8.5&8.6 . Theorem 8 implies that the deep (nonlinear) model is the same as the linear model, at the initialization of (per Eq. 8.5&8.6, using c W as Eq. 7.8). Cross-entropy objective can then ne-tune. This end-to-end process, of (i) computing SVD bases and (ii) training the networkf on singular values, advances SOTA on competitve benchmarks, with (i) converging (quickly) to a unique solution and (ii) containing merely a few parameters – see §8.4. Proof. of Theorem 8: The layer-to-layer “positive” and “negative” weight matrices are initialized as: W (l) (p) =W (l) (n) = I. Therefore, at initialization: 108 H (l+1) = h b AH (l) W (l) (p) i + h b AH (l) W (l) (n) i + = h b AH (l) i + h b AH (l) i + =1 [ b AH (l) 0] b AH (l) 1 [ b AH (l) 0] b AH (l) =1 [ b AH (l) 0] b AH (l) +1 [ b AH (l) 0] b AH (l) = 1 [ b AH (l) 0] +1 [ b AH (l) 0] b AH (l) = b AH (l) ; where the rst line comes from the initialization; the second line is an alternative denition of relu: the indicator function1 is evaluated element-wise and evaluates to 1 in positions its argument is true and to 0 otherwise; the third line absorbs the two negatives into a positive; the fourth by factorizing; and the last by noticing that exactly one of the two indicator functions evaluates to 1 almost everywhere, except at the boundary condition i.e. at locations where b AH (l) = 0 but there the right-term makes the Hadamard product 0 regardless. It follows that, sinceH (0) = X, thenH (1) = b AX, H (2) = b A 2 X, ..., H (L) = b A L X. The layer-to-output positive and negative matrices are initialized as:W (l) (op) =W (l) (on) = c W [dl :d(l+1)] . Therefore, at initialization, the nal output of the model is: H = l=L X l=0 h H (l) W (l) (op) i + h H (l) W (l) (on) i + = l=L X l=0 h H (l) c W [dl :d(l+1)] i + h H (l) c W [dl :d(l+1)] i + = l=L X l=0 H (l) c W [dl :d(l+1)] ; where the rst line comes from the denition and the initialization. The second line can be arrived by following exactly the same steps as above: expanding the re-writing the ReLu using indicator notation, absorbing the negative, factorizing, then lastly unioning the two indicators that are mutually exclusive 109 Table 8.1: Dataset Statistics Dataset Nodes Edges Source Task ogbn-ArXiv 169,343papers 1,166,243citations OGB SSC ogbl-DDI 4,267drugs 1,334,889interactions OGB LP almost everywhere. Finally, the last summation can be expressed as a block-wise multiplication between two (partitioned) matrices: H = H (0) H (1) ::: H (L) 2 6 6 6 6 6 6 6 6 6 6 4 c W [0 :d] c W [d : 2d] ::: c W [dL :d(L+1)] 3 7 7 7 7 7 7 7 7 7 7 5 = X b AX ::: b A L X c W = c M (NC) c W =H (NC) linearized 8.4 Applications&Experiments We experiment on two datasets from Stanford’s Open Graph Benchmark [OGB, Hu et al., 2020], summa- rized in Table 8.1. Both datasets contain >1 million edges. We use the ocial train-test-validation splits and evaluator of OGB. We attempt LP and SSC, respectively, on Drug Drug Interactions (ogbl-DDI) and ArXiv citation network (ogbn-ArXiv). For these larger datasets, we use the SVD basis as an initialization that we netune, as described in §8.2. For time comparisons, we train all models on a Tesla K80 GPU. 8.4.1 ExperimentsonStanford’sOGBdatasets We summarize experiments ogbl-DDI and ogbn-ArXiv, respectively, in Tables 8.2 and 8.3. Forbaselines, we copy numbers from the public leaderboard, where the competition is erce. We then follow links 110 Table 8.2: Test Hits@20 for link prediction over Drug-Drug Interactions Network (ogbl-ddi). Graphdataset: ogbl-DDI Baselines: HITS@20 tr.time DEA+JKNet [Yang et al., 2021] 76.722.65 (60m) LRGA+n2v [Hsu and Chen, 2021] 73.858.71 (41m) MAD [Luo et al., 2021] 67.812.94 (2.6h) LRGA+GCN [Puny et al., 2020] 62.309.12 (10m) GCN+JKNet [Xu et al., 2018] 60.568.69 (21m) Ourmodels: HITS@20 tr.time (a) iSVD 100 ( c M (NE) ) (§7.3.1, Eq. 7.5) 67.860.09 (6s) (b) + netunef ;s (S) (§8.2.1, Eq. 8.1 & 8.2; sets = 1:15) 79.090.18 (24s) (c) + updateU,S,V (on validation, keepsf ;s xed) 84.090.03 (30s) on the leaderboard to download author’s code, that we re-run, to measure the training time. For our models on ogbl-DDI, we (a) rst calculate U;S;V SVD 100 ( c M (NE) ) built only from training edges and score test edge (i;j) usingU > i SV j . Then, we (b) then netunef ;s (per §8.2.1, Eq. 8.1 & 8.2)) for onlyasingleepoch and score usingf ;s (i;j). Then, we (c) update the SVD basis to include edges from the validation partition and also score usingf ;s (i;j). We report the results for the three steps. For the last step, the rules of OGB allows using the validation set for training, but only after the hyperparameters are nalized. The SVD has no hyperparameters (except the rank, which was already determined by the rst step). More importantly, this simulates a realistic situation: it is cheaper to obtain SVD (of an implicit matrix) than back-propagate through a model. For a time-evolving graph, one could run the SVD more often than doing gradient-descent epochs on a model. Forourmodelsonogbn-ArXiv, we (a) compute U;S;V SVD 250 ( c M (NC) LR ) where the implicit matrix is dened in §8.2.3. We (b) repeat this process where we replicate c M (NC) LR once: in the second replica, we replace theY [train] matrix with zeros (as-if, we drop-out the label with 50% probability). We (c) repeat the process where we concatenate two replicas of c M (NC) LR into the design matrix, each with dierent dropout seed. We (d) ne-tune the last model over 15 epochs using stochastic GTTF [Markowitz et al., 2021]. Discussion: Our method competes or sets SOTA, while training much faster. 111 Table 8.3: Test classication accuracy over ogbn-arxiv. Graphdataset: ogbn-ArXiv Baselines: accuracy tr.time GAT+LR+KD [Ren, 2020] 74.160.08 (6h) GAT+LR [Niu, 2020] 73.990.12 (3h) AGDN [Sun and Wu, 2020] 73.980.09 (50m) GAT+C&S [Huang et al., 2021b] 73.860.14 (2h) GCNII [Chen et al., 2020] 72.740.16 (3h) Ourmodels: accuracy tr.time (a) iSVD 250 ( c M (NC) LR ) (§7.3.2, Eq. 7.8) 68.900.02 (1s) (b) + dropout(LR) (§8.2.3) 69.340.02 (3s) (c) + dropout( c M (NC) LR ) (§8.2.3) 71.950.03 (6s) (d) + netuneH (§8.2.2, Eq. 8.3-8.6) 74.140.05 (2m) 8.5 Relatedwork Applications: SVD was used to project rows & columns of matrixM onto an embedding space.M can be the Laplacian of a homogenous graph [Belkin and Niyogi, 2003], Adjacency of user-item bipartite graph [Koren et al., 2009], or stats [Deerwester et al., 1990, Levy and Goldberg, 2014] for word-to-document. We dier: ourM is a function of (leaf) matrices – useful whenM is expensive to store (e.g., quadratic). While Qiu et al. [2018] circumvents this by entry-wise sampling the (otherwisen 2 -dense)M, our SVD implementation could decompose exactlyM without calculating it. SymbolicSoftwareFrameworks: including Theano [Al-Rfou et al., 2016], TensorFlow [Abadi et al., 2016] and PyTorch [Paszke et al., 2019], allow chaining operations to compose a computation (directed acyclic) graph (DAG). They can eciently run the DAG upwards by evaluating (all entries of) matrix M2R rc at any DAG node. Our DAG diers: instead of calculatingM, it provides product function u M (:) = M:, a function that calculates matrix multiplication of M and its input – The graph is run downwards (reverse direction of edges). Matrix-freeSVD: For many matrices of interest, multiplying againstM is computationally cheaper than explicitly storingM entry-wise. As such, many researchers implement SVD(u M ), e.g. Calvetti et al. 112 [1994], Wu and Simon [2000], Bose et al. [2019]. We dier in the programming exibility: the earlier methods expect the practitioner to directly implement u M . On the other hand, our framework allows composition ofM via operations native to the practitioner (e.g., @, +, concat), andu M is automatically dened. Fast Graph Learning: We have applied our framework for fast graph learning, but so as sampling- based approaches including [Chen et al., 2018b, Chiang et al., 2019, Zeng et al., 2020, Markowitz et al., 2021]. Our work diers as it can obtain an initial closed-form solution (very quickly) and can be ne- tuned afterwards using any of the aforementioned approaches. Additionally, application of implicit-SVD over GRL shows one general application. Our framework might be useful in other areas utilizing SVD, beyond the GRL domain. 113 PartIV-Scalingtolargegraphs While the previous Parts (II & III) focused their evaluation metrics on smaller datasets, where the graph dataset ts in memory of a single machine, this Part of the thesis enables training of GNNs on very large graphs: ones that can be larger than the memory of a single machine. Chapter 9 (GTTF) describes a general algorithm for repeatedly sampling subgraphs from graphs, and feeding these samples to accumulate & bias functions that can be specialized to recover a variety of graph learning algorithms. Overall, since the algorithm is described using standard tensor operations, its scala- bility is inherited from the underlying compute engine. Specically, GTTF is implemented in TensorFlow, which is scalable (e.g., by using partitioned variables). Chapter 10 re-implements Chapters 7 & 8 in dis- tributed settings. Specically, it is assumed that the input graph is partitioned across many les. It is also assumed that multiple worker machines are present, which computation is distributed across. 114 Chapter9 GraphTraversalwithTensorFunctionals: Meta-AlgorithmforScalable Learning Graph Representation Learning (GRL) methods have impacted elds from chemistry to social science. However, their algorithmic implementations are specialized to specic use-cases e.g. message passing methods are run dierently fromnodeembedding ones. Despite their apparent dierences, all these meth- ods utilize the graph structure, and therefore, their learning can be approximated with stochastic graph traversals. This chapter presents Graph Traversal via Tensor Functionals (GTTF), a meta-algorithm frame- work for unifying the implementation of diverse graph algorithms and enabling transparent and ecient scaling to large graphs. The contents of this chapter can be found standalone in [Markowitz et al., 2021] and the source-code for reproducing the results is available on https://github.com/isi-usc-edu/gttf GTTF is built using a data structure (that can be stored as a sparse tensor). It implements a stochastic graph traversal algorithm (that can be described using tensor operations). The algorithm accepts two functions as input (therefore, we refer to it as a functional). It can be specialized to a variety of GRL models and objectives, simply by changing those two functions. We consider popular GNN models and we analyze their learning signal. We show that GTTF can recover the learning of the considered GNN models in an unbiased fashion: in expectation, approximates the learning as if the specialized implementations were run directly. With these capabilities, re-implementing state-of-the-art methods under the GTTF framework 115 allows them to automatically scale to large graph datasets, with only a handful of lines of code for each method specialization. 9.1 GTTF:Intuition Our algorithm (abbreviated GTTF, Section 9.2.2), receives graphs as input, traverses them using ecient tensor ∗ operations, and invokes specializable functions during the traversal. We show function special- izations for recovering popular GRL methods (Section 9.2.3). Moreover, since GTTF is stochastic, these specializations automatically scale to arbitrarily large graphs, without careful derivation per method. Im- portantly, such specializations, in expectation, recover unbiased gradient estimates of the objective w.r.t. model parameters. GTTF uses a data structure b A (Compact Adjacency, Section 9.2.1): a sparse encoding of the adjacency matrix. Nodev contains its neighbors in row b A[v], b A v , notably, in the rstdegree(v) columns of b A[v]. This encoding allows stochastic graph traversals using standard tensor operations. GTTF is a functional, as it accepts functionsAccumulateFn andBiasFn, respectively, to be provided by each GRL specialization to accumulate necessary information for computing the objective, and optionally to parametrize sampling procedurep(v’s neighborsjv). The traversal internally constructs awalkforest as part of the computation graph. Figure 9.1 depicts the data structure and the computation. From a generalization perspective, GTTF shares similarities with Dropout [Srivastava et al., 2014]. We discuss existing alternatives to GTTF on the next page. ∗ To disambiguate: by tensors, we refer to multi-dimensional arrays, as used in Deep Learning literature; and by operations, we refer to routines such as matrix multiplication, advanced indexing, etc 116 0 1 2 3 4 (a) Example graphG 2 6 6 6 6 4 1 1 1 1 1 1 1 1 1 1 3 7 7 7 7 5 (b) Adjacency matrix for graphG 2 6 6 6 6 4 1 0 2 3 4 1 1 4 1 3 3 7 7 7 7 5 2 6 6 6 6 4 1 4 1 2 2 3 7 7 7 7 5 (c) CompactAdj for G with sparse b A2Z nn and dense d2Z n . We store IDs of ad- jacent nodes in b A 0 1 1 2 4 2 0 4 1 3 0 2 1 4 (d) Walk Forest. GTTF invokes AccumulateFn once per (green) instance. Figure 9.1: Graph data structure (a & b) and the application of our traversal algorithm on it (c & d) Table 9.1: Related work. Method Family Scale Learning Models GCN, GAT MP 7 exact node2vec NE 3 approx WYS NE 7 exact Stochastic Sampling Methods SAGE MP 3 approx FastGCN MP 3 approx LADIES MP 3 approx GraphSAINT MP 3 approx CluterGCN MP 3 heuristic Software Frameworks PyG Both inherits / re- DGL Both implements Algorithmic Abstraction (ours) GTTF Both 3 approx Models for GRL have been proposed, including message passing (MP) algorithms, such as Graph Convolutional Network (GCN) [Kipf and Welling, 2017], Graph Attention (GAT) [Veličković et al., 2018]; as well as node embedding (NE) algorithms, including node2vec [Grover and Leskovec, 2016], WYS [Abu-El-Haija et al., 2018]; among many others [Xu et al., 2018, Wu et al., 2019, Perozzi et al., 2014]. The full- batch GCN of Kipf and Welling [2017], which drew recent attention and has motivated many MP algorithms, was not initially scalable to large graphs, as it processes all graph nodes at every training step. To scale MP methods to large graphs, researchers proposedStochas- tic Sampling Methods that, at each training step, assemble a batch constituting subgraph(s) of the (large) input graph. Some of these sam- pling methods yield unbiased gradient estimates (with some variance) including SAGE [Hamilton et al., 2017], FastGCN [Chen et al., 2018b], LADIES [Zou et al., 2019], and GraphSAINT [Zeng et al., 2020]. On the other hand, ClusterGCN [Chiang et al., 2019] is a heuristic in the sense that, despite its good performance, it provides no guarantee of unbiased gradient estimates of the full-batch learning. Gilmer et al. 117 [2017] and Chami et al. [2022] generalized many GRL models into Message Passing and Auto-Encoder frameworks. These frameworks prompt bundling of GRL methods under Software Libraries, like PyG [Fey and Lenssen, 2019] and DGL [Wang et al., 2019], oering consistent interfaces on data formats. We now position our contribution relative to the above. Unlike generalized message passing [Gilmer et al., 2017], rather than abstracting the model computation, we abstract the learning algorithm. As a result, GTTF can be specialized to recover the learning of MP as well as NE methods. Morever, unlike Software Frameworks, which are re-implementations of many algorithms and therefore inherit the scale and learn- ing of the copied algorithms, we re-write the algorithms themselves, giving them new properties (memory and computation complexity), while maintaining (in expectation) the original algorithm outcomes. Fur- ther, while the listed Stochastic Sampling Methods target MP algorithms (such as GCN, GAT, alike), as their initial construction could not scale to large graphs, our learning algorithm applies to a wider class of GRL methods, additionally encapsulating NE methods. Finally, while some NE methods such as node2vec [Grover and Leskovec, 2016] and DeepWalk [Perozzi et al., 2014] are scalable in their original form, their scalability stems from their multi-step process: sample many (short) random walks, save them to desk, and then learn node embeddings using positional embedding methods (e.g., word2vec, Mikolov et al. [2013]) – they are sub-optimal in the sense that their rst step (walk sampling) takes considerable time (before train- ing even starts) and also places an articial limit on the number of training samples (number of simulated walks), whereas our algorithm conducts walks on-the-y whilst training. 9.2 GraphTraversalviaTensorFunctionals(GTTF):Derivation 9.2.1 DataStructure We now describe a data structure which we coin CompactAdj (for “Compact Adjacency”, Figure 9.1c). It consists of two tensors: 118 1. d2Z n , a dense out-degree vector (gure 9.1c, right) 2. b A2Z nn , a sparse edge-list matrix in which the rowu contains left-alignedd u non-zero values. The consecutive entriesf b A[u; 0]; b A[u; 1];:::; b A[u;d u 1]g are set to the node IDs receiving an edge fromu. The remainingnd u are left unset, therefore, b A only occupiesO(m) memory when stored as a sparse matrix (Figure 9.1c, left). The design of CompactAdj allows concise description of stochastic traversals using standard tensor oper- ations. For instance, given a nodeu2V, to choose one of its neighbors uniformly at random, one can drawrU[0::( u 1)], then get the random neighbor ID by indexing: b A[u;r]. This can be generalized to vectorized form – i.e., given node batchBV and access to continuousU[0; 1), we sample neighbors for each node inB as:RU[0; 1) b , whereb =jBj, thenB 0 = b A[B;bRd[B]c] is ab-sized vector, with B 0 u containing a neighbor ofB u , oor operationb:c is applied element-wise, and is Hadamard product. 119 Algorithm8 Stochastic Traverse Functional, parametrized byAccumulateFn andBiasFn. Require: u (current node);T [] (path leading tou, starts empty);F (list of fanouts); AccumulateFn (function: with side-eects and no return. It is model-specic and records information for computing model and/or objective, see text); BiasFn U (function mappingu to distribution onu’s neighbors, defaults to uniform) 1: functionTraverse(T ,u,F ,AccumulateFn,BiasFn) 2: ifF.size() = 0then return # Base case. Traversed up-to requested depth 3: f F.pop() # fanout duplication factor (i.e. breadth) at this depth. 4: sample_bias BiasFn(T, u) 5: if sample_bias.sum() = 0then return # Special case. No sampling from zero mass 6: sample_bias sample_bias / sample_bias.sum() # valid distribution 7: K Sample( b A [u; : u ]];sample_bias;f) # Samplef nodes fromu’s neighbors 8: fork 0tof 1 do 9: T next concatenate(T; [u]) 10: AccumulateFn(T next ;K[k];f) 11: Traverse(T next ;K[k];f;AccumulateFn;BiasFn) # Recursion 12: functionSample(N,W ,f) 13: C tf.cumsum(W ) # Cumulative sum. Last entry must = 1. 14: coin_flips tf.random.uniform((f; ); 0; 1) 15: indices tf.searchsorted(C,coin_flips) 120 9.2.2 LearningbyProvidingFunctionstoaStochasticTraversalAlgorithm GTTF implements a stochastic graph traversal algorithm. It is similar to a random walk: where a walker starts at a graph node, and transitions to a random next node, repeatedly so, for a number of steps. The dierences between GTTF’s stochastic traversal and vanilla random walks are: 1. Given a start node, while vanilla random walks produce a sequence of node IDs, GTTF’s traversal produces a tree. Specically, the walker replicates itself ateverynode. We refer to the number of replicas as fanout. Setting the fanout to 1 reduces the tree into a sequence. Nonetheless, we show that the variance in estimating the learning signal can be reduced by setting fanout to> 1. 2. The algorithm is a functional: rather thanreturnthewalks, the algorithm accepts two functions: AccumulateFn andBiasFn that will be called on the tree nodes along the traversal. From an engi- neering perspective, the functional (a.k.a, callbacks) design pattern allows us to reproduce a variety of graph learning algorithms, with just a few lines per method. 3. BiasFn allows customizing the random walk: for instance, it can be used to induce a custom dis- tribution for node sampling (e.g. jump to a node with probability proportional to a precomputed quantity such as the node’s degree degree), and can be used to sample without replacement. Algorithm 8 lists our functional Traverse. The algorithm accepts node † u2V: the starting position of the random walk i.e. root of the walk tree (Figure 9.1d). The depth and width of the walk tree are determined by fanout listF : the length of the list equals the tree depth. Each item in the list must be an integer, which will be the the branching factor at each walk step. For instance, ifF = [3; 5] and given node batchBV, then the functionalTraverse samples 3 neighbors peru2B, then 5 neighbors for each of those. At each step, to determine the list of sampled nodes, the functional will invoke the providedBiasFn. On each sampled node, it invokesAccumulateFn. Next, we dene these two functions concretely. † Our pseudo-code accepts a single node and creates a walk tree. However, for eciency, our implementation is vectorized as it accepts a batch of nodes and creates a walk forest. 121 9.2.2.1 AccumulateFnandBiasFn • AccumulateFn should be provided by the caller as a function to track necessary information for estimating the learning signal. It can either directly estimate the gradients of the objective w.r.t. model parameters, or assemble a subgraph patch upon which the model can be invoked. For in- stance, an implementation of DeepWalk [Perozzi et al., 2014] on top of GTTF (§9.2.3.2), specializes AccumulateFn to estimate the (sampled softmax) likelihood of nodes’ positional distribution, mod- eled as a dot-prodct of node embeddings. On the other hand, GCN [Kipf and Welling, 2017] on top of GTTF (§9.2.3.1) implements itAccumulateFn to accumulate asampleoftheadjacencymatrix, which it passes to the underlying model (e.g. 2-layer GCN) as if this were the full adjacency matrix. • BiasFn should be provided by the caller as a function that to customize the sampling procedure for the stochastic transitions. If not given, it defaults to uniform distribution. If provided, it must yield a probability distribution over nodes, given the current node and the path that lead to it. It can be dened to read edge weights, if they denote importance, or more intricately, used to parameterize a second order Markov Chain like [Grover and Leskovec, 2016], or use neighborhood attention to guide sampling [Veličković et al., 2018]. 9.2.3 SpecializingAccumulateFn&BiasFntorecoverAlgorithmsforGRL This section explains how one can implement the functions to receover the learning algorithms of various GRL methods. A practitioner, wishing to train a graph model (e.g. link prediction or node classication) on a graph can implement a pipeline according as: 1. Construct model & initialize parameters (e.g. to random). DeneAccumulateFn andBiasFn. 2. Repeat (many rounds): i. Reset accumulation information (from previous round) and then sample batchBV. 122 ii. InvokeTraverse on (B,AccumulateFn,BiasFn), which invokes theFn’s, allowing the rst toaccumu- late information sucient for running the model and estimating an objective. iii. Use accumulated information to: run model, estimate objective, apply learning rule (e.g. SGD). The following subsections explain how the above pipeline can be utilized to recover various methods. 9.2.3.1 MessagePassing: GraphConvolutionalvariants Graph Convolutional Network (GCN) of [Kipf and Welling, 2017] and its extensions, including [Hamilton et al., 2017, Xu et al., 2018, Wu et al., 2019, Abu-El-Haija et al., 2019], require the an adjacency matrix and a node features matrix torunthemodel. We propose to run their model using asampleoftheadjacency matrix. We show in §9.3 that such adjacency samples provide unbiased estimates of the learning signal. When training GCN-variants with GTTF on node classication tasks, at each iteration, one can sample a batch of labeled nodesBV, then initialize e A2f0; 1g nn to an empty sparse matrix. The sparse matrix e A will be populated by the AccumulateFn function. Traverse (Algorithm 8) can be invoked withu =B; F to list of fanouts with sizeh;AccumulateFn andBiasFn set to: def RootedAdjAcc(T;u;f): e A[u;T 1 ] 1; (9.1) def NoRevisitBias(T;u): return1[ e A[u].sum() = 0] ~ 1 du d u ; (9.2) where ~ 1 n is ann-dimensional all-ones vector, and negative indexingT k is thek th last entry ofT . If a node has been visited through the stochastic traversal, then it already has fanout number of neighbors andNoRevisitBias ensures it does not get revisited for eciency, per line 5 of Algorithm 8. Afterwards, 123 the accumulated stochastic e A will be fed ‡ into the underlying model e.g. for a 2-layer GCN of Kipf and Welling [2017]: GCN( e A;X;W 1 ;W 2 ) = softmax( A ReLu( AXW 1 )W 2 ); (9.3) with A =D 01=2 e D 01 e D 0 D 01=2 ; D 0 = diag(d 0 ); d 0 = ~ 1 > n A 0 ; renorm trick z }| { e A 0 =I n + e A Lastly, the length of the fanout listF should be set to the receptive eld required by the model. For example, if the model is a 2-layer GCN model, then settingF with 2 entries is appropriate. However, for example, running 2-layer MixHop model (Abu-El-Haija et al. [2019], Chapter 5) where each layer considers up-to the 3 rd power of the adjacency matrix, then the fanout should be set to 6 entries. 9.2.3.2 NodeEmbeddings Given a batch of nodesBV, DeepWalk can be implemented in GTTF by rst initializing lossL to the contrastive term estimating the partition function of log-softmax: L X u2B log E vPn(B) [exp(hZ u ;Z v i)]; (9.4) whereh:;:i is dot-product notation, Z2 R nd is the trainable embedding matrix with Z u 2 R d a d- dimensional embedding for node u2V. In our experiments, we estimate the expectation by taking 5 samples and we set the negative distributionP n (v)/d 3 4 v , following Mikolov et al. [2013]. The functional is invoked with noBiasFn andAccumulateFn = def DeepWalkAcc(T;u;f): L L * Z i ; C T X k=1 [T k ] Ck+1 C Z T k + ; [u] [T 1 ] f ; (9.5) ‡ Before feeding the batch to model, in practice, we nd nodes not reached by traversal and remove their corresponding rows (and also columns) fromX (andA). 124 where hyperparameterC indicates maximum window size [inherited from word2vec, Mikolov et al., 2013], constantC T , min(C;T .size) keeps summation variablek valid entries ofT , the scalar fraction Ck+1 C is inherited from context sampling of word2vec [Section 3.1 in Levy et al., 2015], and rederived for graph context by Abu-El-Haija et al. [2018] (Chapter 6), and [u] stores a scalar per node on the traversal Walk Forest, which defaults to 1 for non-initialized entries, and is used as a correction term. DeepWalk con- ducts random walks (visualized as a straight line) whereas our walk tree has a branching factor off, when F = [f;f;f;::: ]. Setting fanoutf = 1 recovers DeepWalk’s simulation, though we foundf > 1 outper- forms DeepWalk within fewer iterations e.g. f = 5, within 1 epoch, outperforms DeepWalk’s published implementation. AfterTraverse concludes (which invokesAccumulateFn per node on the walk forest), a learning step can be taken on the accumulatedL as:Z Zr Z L; with suitable step size> 0. 9.3 TheoreticalAnalysis 9.3.1 Estimatingk th poweroftransitionmatrix We show that it is possible with GTTF to accumulate an estimate of transition matrixT to powerk. Let denote the walk forest generated by GTTF, (u;k;i) denotes thei th node in the vector of nodes at depth k of the walk tree rooted atu2B, andt u;v;k i denotes the indicator random variable1[ (u;k;i) =v]. Let the estimate of thek th power of the transition matrix be denoted b T k . Entry b T k u;v should be an unbiased estimate ofT k u;v foru2B, with controllable variance. We dene: b T k u;v = P f k i=1 t u;v;k i f k (9.6) The fraction in Equation 9.6 counts the number of times the walker starting atu visitsv in , divided by the total number of nodes visited at thek th step fromu. Proposition1. (UnbiasedTk) b T k u;v as dened in Equation 9.6, is an unbiased estimator ofT k u;v 125 Proposition2. (VarianceTk) Variance of our estimate is upper-bounded: Var[ b T k u;v ] 1 4f k Naive computation of k th powers of the transition matrix can be eciently computed via repeated sparse matrix-vector multiplication. Specically, each column ofT k can be computed inO(mk), wherem is the number of edges in the graph. Thus, computingT k in its entirety can be accomplished inO(nmk). However, this can still become prohibitively expensive if the graph grows beyond a certain size. GTTF on the other hand can estimateT k in time complexity independent of the size of the graph. (Prop. 8), with low variance. Transition matrix, raised to a power, is useful for many GRL methods [Qiu et al., 2018, Abu-El-Haija et al., 2018] 9.3.2 UnbiasedLearning As a consequence of Propositions 1 and 2, GTTF enables unbiased learning with variance control for classes of node embedding methods, and provides a convergence guarantee for graph convolution models under certain simplifying assumptions. We start by analyzing node embedding methods. Specically, we cover two general types. The rst is based on matrix factorization of the power-series of transition matrix. and the second is based on cross- entropy objectives, e.g., like DeepWalk [Perozzi et al., 2014], or node2vec [Grover and Leskovec, 2016], These two are shown in Proposations 3 and 4 Proposition 3. (UnbiasedTFactorization) SupposeL =jjLR P k c k T k jj 2 F , i.e. factorization objective that can be optimized by gradient descent by calculatingr L;R L, wherec k ’s are scalar coecients. Let its estimate b L =jjLR P k c k b T k jj 2 F ,where b T isobtainedbyGTTFaccordingtoEquation9.6. ThenE[r L;R b L] = r L;R L. Proposition 4. (UnbiasedLearnNE) Learning node embeddingsZ2R nd with objective functionL, de- composableasL(Z) = P u2V L 1 (Z;u) P u;v2V P k L 2 (T k ;u;v)L 3 (Z;u;v),whereL 2 islinearoverT k , then using b T k yields an unbiased estimate ofr Z L. 126 Generally,L 1 (andL 3 ) score the similarity between disconnected (and connected) nodesu andv. The above form ofL covers a family of contrastive learning objectives that use cross-entropy loss and assume a logistic or (sampled-)softmax distributions. In Table 5 of our full manuscript, Markowitz et al. [2021], we provide the decompositions for the objectives of DeepWalk [Perozzi et al., 2014], node2vec [Grover and Leskovec, 2016] and WYS [Abu-El-Haija et al., 2018]. Proposition5. (UnbiasedMP)Giveninputactivations,H (l1) ,graphconvlayer (l)canuserootedadjacency e AaccumulatedbyRootedAdjAcc(9.1),toprovideunbiasedpre-activationoutput,i.e.E h A k H (l1) W (l) i = D 0 1=2 A 0 D 0 1=2 k H (l1) W (l) , withA 0 andD 0 dened in (9.3). Proposition6. (UnbiasedLearnMP) If objective to a graph convolution model is convex and Lipschitz con- tinous, with minimizer , then utilizing GTTF for graph convolution converges to . 9.3.3 ComplexityAnalysis Proposition7. Storage complexity of GTTF isO(m +n). Proposition8. Time complexity of GTTF isO(bf h ) for batch sizeb, fanoutf, and depthh. Proposition 8 implies the speed of computation is irrespective of graph size. Methods implemented in GTTF inherit this advantage. For instance, the node embedding algorithm WYS [Abu-El-Haija et al., 2018] isO(n 3 ), however, we apply its GTTF implementation on large graphs. 9.4 Experiments We evaluate our GTTF functional, Traverse, Algorithm 8, if it is able to re-implement graph learning algorithms while maintaining performance. Here, a learning algorithm refers to (1) the model and (2) the training procedure. We show that Implementing in GTTF, model & its training, (e.g., of WYS [Abu-El-Haija 127 et al., 2018], Chapter 6), by choosing appropriate AccumulateFn and BiasFn, (e.g., yieldingF(WYS)), should • produce a trained model that matches empirical performance. • train faster than baselines • automatically scaling to large graphs. 9.4.1 NodeEmbeddingsforLinkPrediction We run baselines of WYS [Abu-El-Haija et al., 2018, Chapter 6] and DeepWalk [Perozzi et al., 2014] using code of author’s, and re-implement them in GTTF (by choosingAccumulateFn per §9.2.3.2), abbreviated asF(WYS) andF(DeepWalk). Table 9.3 summarizes link prediction test performance. Note that the origial implentation of WYS [Chapter 6 Abu-El-Haija et al., 2018] does not scale to large datasets of LiveJournal and Reddit. However, scalable implementationF(WYS) sets new state-of-the-art on these datasets. 9.4.2 MessagePassingforNodeClassication This section compares baseline models GCN [Kipf and Welling, 2017], SAGE [Hamilton et al., 2017], Mix- Hop [Abu-El-Haija et al., 2019], SimpleGCN [Wu et al., 2019], GAT [Veličković et al., 2018] and GCNII [Chen et al., 2020], against their GTTF implementations, respectively as,F(GCN),F(SAGE),F(MixHop), F(SimpleGCN),F(GAT), andF(GCNII). Per results in Tables 9.4 (left and middle), GTTF implementations matches the baselines performance. 9.4.3 ExperimentscomparingagainstSamplingmethodsforMessagePassing Some methods proposed sampling graph to draw stochastic samples for the training algorithm, where each sample is a subgraph, e.g. like GraphSAINT and ClusterGCN. This section compares the computational 128 Table 9.2: Dataset summary. Tasks are LP, SSC, FSC, for link prediction, semi- and fully-supervised clas- sication. Split indicates the train/validate/test paritioning, with (a) = [Abu-El-Haija et al., 2018], (b) = to be released, (c) = [Hamilton et al., 2017], (d) = [Yang et al., 2016a]; (e) = [Hu et al., 2020]. Dataset Split #Nodes #Edges #Classes Nodes Edges Tasks PPI (a) 3,852 20,881 N/A proteins interaction LP ca-HepTh (a) 80,638 24,827 N/A researchers co-authorship LP ca-AstroPh (a) 17,903 197,031 N/A researchers co-authorship LP LiveJournal (b) 4.85M 68.99M N/A users friendship LP Reddit (c) 233,965 11.60M 41 posts user co-comment LP/FSC Amazon (b) 2.6M 48.33M 31 products co-purchased FSC Cora (d) 2,708 5,429 7 articles citation SSC CiteSeer (d) 3,327 4,732 6 articles citation SSC PubMed (d) 19,717 44,338 3 articles citation SSC Products (e) 2.45M 61.86M 47 products co-purchased SSC Table 9.3: Results of node embeddings on Link Prediction. Left: Test ROC-AUC scores. Right: Mean Rank on the right for consistency with Lerer et al. [2019]. *OOM = Out of Memory. PPI HepTh Reddit DeepWalk 70.6 91.8 93.5 F(DeepWalk) 87.9 89.9 95.5 WYS 89.8 93.6 OOM F(WYS) 90.5 93.5 98.6 LiveJournal DeepWalk 234.6 PBG 245.9 WYS OOM* F(WYS) 185.6 Table 9.4: Node classication tasks. Left: test accuracy scores on semi-supervised classication (SSC) of citation networks. Middle: test micro-F1 scores for large fully-supervised classication. Right: test accuracy on an SSC task, showing only scalable baselines. Highest value per column is bolded. Cora Citeseer Pubmeb GCN 81.5 70.3 79.0 F(GCN) 81.9 69.8 79.4 MixHop 81.9 71.4 80.8 F(MixHop) 83.1 71.8 80.9 GAT* 83.2 72.4 77.7 F(GAT) 83.3 72.5 77.8 GCNII 85.5 73.4 80.3 F(GCNII) 85.3 74.4 80.2 Reddit Amazon SAGE 95.0 88.3 F(SAGE) 95.9 88.5 SimpGCN 94.9 83.4 F(SimpGCN) 94.8 83.8 Products node2vec 72.1 ClusterGCN 75.2 GraphSAINT 77.3 F(SAGE) 77.0 requirement for gTTF and these sampling-based methods. Table 9.4 (right) shows test performance on node classication accuracy on a large dataset: Products. 129 Table 9.5: Performance of GTTF against frameworks DGL and PyG.Left: Speed is the per epoch time in seconds when training GraphSAGE. Memory is the memory in GB used when training GCN. All experi- ments conducted using an AMD Ryzen 3 1200 Quad-Core CPU and an Nvidia GTX 1080Ti GPU. Right: Training curve for GTTF and PyG implementations of Node2Vec. Speed(s) Memory(GB) Reddit Products Reddit Cora Citeseer Pubmed DGL 17.3 13.4 OOM 1.1 1.1 1.1 PyG 5.8 9.2 OOM 1.2 1.3 1.6 GTTF 1.8 1.4 2.44 0.32 0.40 0.43 0 250 500 750 Time (s) 0.6 0.7 0.8 ROC-AUC framework (node2vec) PyG(node2vec) 9.4.4 RuntimeandMemorycomparisonagainstoptimizedSoftwareFrameworks In addition to the accuracy metrics discussed above, we also care about computational performance. This section compares the run-time and memory of established python frameworks, DGL [Wang et al., 2019] and PyG [Fey and Lenssen, 2019], against GTTF implementation of the baseline models. Table 9.5 sum- marizes the results as follows. First (left), shows time-per-epoch on large graphs of their implementation of GraphSAGE, compared with GTTF’s, where we make all hyper parameters to be the same (of model ar- chitecture, and number of neighbors at message passing layers). Second (middle), contains run their GCN implementation of DGL and PyG on small datasets (Cora, Citeseer, Pubmed) to show peak memory usage. While the aforementioned two comparisons are on popular message passing methods, the third (right) chart shows a popular node embedding method: node2vec’s link prediction test ROC-AUC in relation to its training runtime. 9.4.5 Hyperparameters Hyperparameters for all experiments are listed in our main paper [Markowitz et al., 2021] and accompa- nying code at https://github.com/isi-usc-edu/gttf. 130 9.5 Proofs All proofs are in the manuscript of [Markowitz et al., 2021] ProofofProposition3 Proof. Consider ad-dimensional factorization of P k c k T k , wherec k ’s are scalar coecients: L = LR X k c k T k 2 F ; (9.7) parametrized byL;R > 2R nd . The gradients ofL w.r.t. parameters are: r L L = LR > X k c k T k ! R > and r R L =L > LR > X k c k T k ! : (9.8) Given estimate objectiveL (replacing b T with using GTTF-estimated b T ): b L = LR X k c k b T k 2 F : (9.9) It follows that: E h r L b L i =E " LR > X k c k b T k ! R > # =E " LR > X k c k b T k !# R > Scaling property of expectation = LR > X k c k E h b T k i ! R > Linearity of expectation = LR > X k c k T k ! R > Proof of Proposition 1 =r L L 131 The above steps can similarly be used to showE h r L b L i =r L L ProofofProposition6 GTTF can be seen as a way of applying dropout [Srivastava et al., 2014], and our proof is contingent on the convergence of dropout, which is shown in Baldi and Sadowski [2014]. Our dropout is on the adjacency, rather than the features. Denote the output of a graph convolution network § withH: H = GCN X (A;W ) =TXW We restrict our analysis to GCNs with linear activations. We are interested in quantifying the change ofH asA changes, and therefore the xed (always visible) featuresX is placed on the subscript. Let e A denote adjacency accumulated by GTTF’sRootedAdjAcc (Eq. 9.1). e H c = GCN X ( e A c ): LetA =f e A c g jAj c=1 denote the (countable) set of all adjacency matrices realizable by GTTF. For the analysis, assume the graph is -regular: the assumption eases the notation though it is not needed. Therefore, degree u = for allu2 V . Our analysis depends ¶ on 1 jAj P e A2A e A/ A. i.e. the average realizable matrix by GTTF is proportional (entry-wise) to the full adjacency. This is can be shown when considering one-row at a time: given node u with u = outgoing neighbors, each of its neighbors has the same appearance probability = 1 u . Summing over all combinations u f , makes each edge appear the same frequency = 1 u jAj, noting thatjAj evenly divides u f for allu2V . We dene a dropout module: d A = jAj X c z c e A c with z Categorical jAj of them z }| { 1 jAj ; 1 jAj ;:::; 1 jAj ! ; (9.10) § The following denition averages the node features (uses non-symmetric normalization) and appears in multiple GCN’s including Hamilton et al. [2017]. ¶ If not-regular, it would be 1 jAj P e A2A e A/D 1 A 132 wherez c acts as Multinoulli selector over the elements ofA, with one of its entries set to 1 and all others to zero. With this denitions, GCNs can be seen in the droupout framework as: e H = GCN X ( d A). Nonetheless, in order to inherit the analysis of [Baldi and Sadowski, 2014, see their equations 140 & 141], we need to satisfy two conditions which their analysis is founded upon: (i) E[GCN X ( d A)] = GCN X (A): in the usual (feature-wise) dropout, such condition is easily veried. (ii) Backpropagated error signal does not vary too much around around the mean, across all realizations of d A. Condition (i) is satised due to proof of Proposition 5. To analyze the error signal, i.e. the gradient of the error w.r.t. the network, assume loss functionL(H), outputs scalar loss, is-Lipschitz continuous. The Liptchitz continuity allows us to bound the dierence in error signal betweenL(H) andL( e H): jjr H L(H)r H L( e H)jj 2 2 (a) (r H L(H)r H L( e H)) > (H e H) (9.11) (b) jjr H L(H)r H L( e H)jj 2 jjH e Hjj 2 (9.12) w.p.1 1 Q 2 jjr H L(H)r H L( e H)jj 2 W > X > Q p Var[T ]XW (9.13) = Q 2 p f jjr H L(H)r H L( e H)jj 2 jjWjj 2 1 jjXjj 2 1 (9.14) jjr H L(H)r H L( e H)jj 2 Q 2 p f jjWjj 2 1 jjXjj 2 1 (9.15) where (a) is by Lipschitz continuity, (b) is by Cauchy–Schwarz inequality, “w.p.” means with probability and uses Chebyshev’s inequality, with the following equality because the variance ofT is shown element- wise in proof for Prop. 2. Finally, we get the last line by dividing both sides over the common term. This shows that one can make the error signal for the dierent realizations arbitrarily small, for example, by choosing a larger fanout value or putting (convex) norm constraints onW andX e.g. through batchnorm and/or weightnorm. Since we can haver H L(H)r H L( e H 1 )r H L( e H 2 )r H L( e H jAj ) with high probability, then the analysis of Baldi and Sadowski [2014] applies. Eectively, it can be thought of as 133 an online learning algorithm where the elements ofA are the stochastic training examples and analyzed per [Bottou, 1998, 2004], as explained by Baldi and Sadowski [2014]. 9.6 AdditionalGTTFImplementations 9.6.1 MessagePassingImplementations 9.6.1.1 GraphAttentionNetworks[GAT,Veličkovićetal.,2018] One can implement GAT by following the previous subsection, utilizing AccumulateFn and BiasFn de- ned in (9.1) and (9.2), but just replacing the model (9.3) by GAT’s: GAT( e A;X;A;W 1 ;W 2 ) = softmax((A A) ReLu((A A)XW 1 )W 2 ); (9.16) where is hadamard product andA is annn matrix placing a positive scalar (an attention value) on each edge, parametrized by multi-headed attention described in [Veličković et al., 2018]. However, for some high-degree nodes that put most of the attention weight on a small subset of their neighbors, sampling uniformly (with BiasFn=NoRevisitBias) might mostly sample neighbors with entries inA with value 0, and could require more epochs for convergence. However, our exible functional allows us to propose a sample-ecient alternative, that is in expectation, equivalent to the above: GAT( e A;X;A;W 1 ;W 2 ) = softmax(( p A A) ReLu(( p A A)XW 1 )W 2 ); (9.17) def GatBias(T;u): returnNoRevisitBias(T;u) q A[u; b A[u]]; (9.18) 9.6.1.2 DeepGraphInfomax[DGI,Veličkovićetal.,2019] DGI implementation on GTTF can use AccumulateFn=RootedAdjAcc, dened in (9.1). To create the positive graph: it can sample some nodesBV. It would pass to GTTF’s Traverse B, and utilize the 134 accumulated adjacency b A for running: GCN( b A;X B ) and GCN( b A;X permute ), where the second run ran- domly permutes the order of nodes inX. Finally, the output of those GCNs can then be fed into a readout function which outputs to a descriminator trying to classify if the readout latent vector correspond to the real, or the permuted features. 9.6.2 NodeEmbeddingImplementations 9.6.2.1 Node2Vec[GroverandLeskovec,2016] A simple implementation follows from above: N2vAcc,DeepWalkAcc; but overrideBiasFn = defN2vBias(T;u): returnp 1[i=T 2 ] q 1[hA[T 2 ];A[u]i>0] ; (9.19) where1 denotes indicator function,p;q > 0 are hyperparameters of node2vec assigning (unnormalized) probabilities for transitioning back to the previous node or to node connected to it.hA[T 2 ];A[u]i counts mutual neighbors between considered nodeu and previousT 2 . An alternative implementation is tonotoverrideBiasFn but rather fold it intoAccumulateFn, as: def N2vAcc(T;u;f): DeepWalkAcc(T;u;f); [u] [u] N2vBias(T;u); (9.20) Both alternatives are equivalent in expectation. However, the latter directly exposes the parametersp and q to the objectiveL: allowing them to be dierentiable w.r.t.L and therefore trainable via gradient descent, rather than by grid-search. Nonetheless, parameterizingp &q is beyond our scope. 135 9.6.2.2 WatchYourStep[WYS,Abu-El-Haijaetal.,2018] First, embedding dictionariesR;L2R n d 2 can be initialized to random. Then repeatedly over batches BV, the lossL can be initialized to estimate the negative part of the objective: L X u2B log(E v2V [hR u ;L v i +hR v ;L u i]); Then call GTTF’s traverse passing the followingAccumulateFn= def WysAcc(T;u): ifT:size()6=Q:size(): return; t T [0]; U T [1 :][ [u]; ctx_weighted_L X j Q j L U j ; ctx_weighted_R X j Q j R U j ; L L log((hR t ;ctx_weighted_Li +hL t ;ctx_weighted_Ri)); 136 Chapter10 di: PyFrameworkforDistributedComputingofMatricesandtheir ImplicitDecompositions This Chapter discusses a python framework, di (distributed implicit): a distributed matrix computation engine. di can encode matrix arithmetic symbolically as a computation directed acyclic graph (DAG). di adopts common python nomenclature API syntax, e.g., @, +, -, di.diag(), di.concat(), etc. Like other frameworks,di can calculate matrices (by propagatingup the DAG). However,di is unique in its ability to multiply a matrix without necessarily calculating it entry-by-entry. Rather, multiplications are propagated down through the DAG, which immediately enables ecient algorithms for matrix decomposition, such as Singular Value Decomposition (SVD). We experimentally show the power of di in two ways. First, it allows us to obtain state-of-the-art on the largest link prediction task signicantly faster than any other alternative (that we are aware of). Second, it allows us to calculate matrix-free SVD of dense but structured matrices, signicantly faster than scipy’s matrix-free SVD. The contents of this chapter can be found standalone in [Abu-El-Haija et al., 2022] and the source-code for reproducing the results is available on http://tinyurl. com/z44tj2ep 137 10.1 Introduction distributed implicit, abbreviated as di, is a python programming framework for computing matrices in a distributed fashion, and we apply it for Machine Learning tasks over large graphs. Similar to TensorFlow [Abadi et al., 2016] and PyTorch [Paszke et al., 2019], di represents computations using an computation directed acyclic graph (DAG), where each DAG node corresponds to a matrix. Similar to dask [Dask Dev Team, 2016],di can also divide large computation jobs across one or more workers. As expected of a com- putation engine, di can evaluate a matrixM at any DAG node using graph traversal: evaluate all DAG nodes, in topological order, starting from leaf nodes (reachable byM) and propagating computation up- ward towardsM. However, in contrast to these existing popular frameworks,di can also multiply against M without calculatingM entry-by-entry, by propagating multiplicands downward fromM to leaf nodes. We refer to this asimplicitmultiplication, which immediately enables matrix-free decompositions, such as Singular Value Decomposition (SVD). Beyond our application domain, SVD has broad utility as summa- rized in §10.1.1. In fact, matrix-free SVD of di is central to our application, as we train link prediction models over large graphs (millions of entities). The matrices we decompose are too large (>50 million rows) or have (n 2 ) non-zeros. Many graph datasets only get larger with time. For instance, new products on Amazon and new pa- pers published on ArXiv grow on co-purchase and citation graphs, respectively. To model the increasing complexity of information on graphs, such as neighborhood patterns, researchers have designed Graph Neural Networks (GNNs), which are ML models that incorporate certain (transformation) layers, such as Graph Convolution [Bruna et al., 2014, Deerrard et al., 2016] or Graph Convolutional Network (GCN) of Kipf and Welling [2017]. Various methods further expand model capacity by incorporating neural archi- tectures such as attention [Veličković et al., 2018], neighborhood-mixing [Abu-El-Haija et al., 2019, Sun et al., 2021], residual skip-connections [Chen et al., 2020], distance-encoding [Li et al., 2020], and many more. Although GNNs exhibit strong empirical performance on node classication and link prediction 138 tasks, this comes with a signicant cost in computation time and memory. Therefore, methods and im- plementations have been proposed that scale learning more eciently using techniques such as: sampling [Hamilton et al., 2017, Ying et al., 2018, Markowitz et al., 2021]; solving sub-problems e.g. by graph parti- tioning [Chiang et al., 2019, Salha et al., 2021] or hierarchically [Chen et al., 2018a]; or relying on standard networks e.g., MLP, rather than GNNs therefore ignoring the graph during training, but use the graph for pre- or post-processing [Wu et al., 2019, Klicpera et al., 2019, Bojchevski et al., 2020, Huang et al., 2021a] Per the work described in Chapters 7 & 8 [Abu-El-Haija et al., 2021a,b], matrix-free SVD provides another alternative for speeding-up learning on graphs. In particular, given GNN modelh, one can come- up with a design matrix M such that, SVD(M) can estimate the parameters of e h, which is a version ofh that is linearized by convexication. While linear model e h can be trained very quickly via matrix-free SVD, it has slightly-worse empirical performance compared to its deeper counter-part,h. We show in [Abu-El- Haija et al., 2021b] that e h can initializeh especially if they are co-designed, such that,h is non-linear (i.e., deep) almost everywhere though behaves linearly when its parameters reside on some hyperplane that is activated upon initialization from e h (§10.3.4). Unfortunately, the algorithms and implementations discussed in Chapters 7 & 8 [Abu-El-Haija et al., 2021a,b], assume that the input matrices, e.g., adjacency and features, are small enough to t in memory, rendering the earlier methods as impractical for large graphs. As such, di is a scalable implementation that can run as a single-worker or as a distributed system. In addition to engineering features expected of a distributed system, including scheduling, communication, and exception propagation,di also incorporates algorithmic enhancements, including a modication of the popular randomized SVD routine, outlined in [Halko et al., 2009], making it recursive and replacing the orthonormalization step with a randomized version (§10.3.6). 139 10.1.1 UtilityofSVD Since there are several approaches to speeding-up or scaling (graph) neural networks, we motivate our contribution by highlighting some merits to SVD: 1. Applicability. SVD recovers the pseudo-inverse of a matrix, eciently solving a linear system of equations. It has also been used for low-rank embeddings and matrix completion for tasks such as recommendation [Koren et al., 2009]. Further, many Machine Learning (ML) problems can be cast as SVD of an application-specicdesignmatrix, such as for vision [Turk and Pentland, 1991], language [Deerwester et al., 1990, Levy and Goldberg, 2014], graphs [Qiu et al., 2018], and many others. 2. Global Convergence. SVD can nd the unique global optimum ∗ to convex problems, minimizing Euclidean distance objectives. This obviates need for hyper-parameters of training algorithms, such as the update rule and coecients (learning rate, momentum, etc). 3. FastGraphLearning. SVD nds solutions without calculating gradients (no back-propagation). Fur- ther, implicit (matrix-free) SVD can turn computationally-quadratic (all node-pairs) objective functions into computationally-linear problems. It estimates GNN parameters orders-of-magnitudes faster than other state-of-the-art alternatives [Abu-El-Haija et al., 2021b]. 4. Initializes (non-linear) Deep GNNs that are architected to be non-linear (deep) almost everywhere yet behave linearly when their parameters reside on a hyperplane, onto which SVD initializes (Theorem 9, §10.3.4) 5. Scalability. Scalable SVD algorithms rely on matrix multiplications and orthonormalization [Halko et al., 2009]. The rst can linearly scale with the number of workers [Gupta and Kumar, 1995] – with some overhead cost. We propose approximations for scalable distributed orthonormalization. ∗ If multiple optimum exist, SVD yields the one with least-norm 140 10.1.2 Chapter’sContributions 1. This chapter discussesdi: python framework for dening arbitrarily-large matrices, using usual matrix operations, e.g., arithmetics, concatenation, etc. di can compute matrices, submatrix at a time, using one or many worker machines, and can decompose a matrix implicitly (without computing the matrix entry-wise). di natively integrates with NumPy, TensorFlow, PyTorch. 2. This chapter usesdi to learn GNNs on the largest link-prediction task on Stanford’s Open Graph Bench- mark [OGB, Hu et al., 2020]. Speeding-up training from days to hours. To our knowledge, di enables by-far the fastest SOTA learning algorithms on large graphs §10.3.4. Empirical performance was aided by a novel design matrix, as well a new non-linear layer (SplitRes) with a linear region. 3. Algorithmic contributions, including approximations to SVD: modify the randomized algorithm [Halko et al., 2009] to invoke itself, and approximate orthonormalization of large matrices by adding random noise then invoking Cholesky decomposition (§10.3.6). In addition, we designed a simple heuristic for graph partitioning that is chronologically-aware (§10.3.3). 10.2 RelatedWork Others also distribute SVD and/or training of GNNs. ScalableSVD: Mahout and Spark [2014] can calculate the SVD of a large matrices, stored distributed lesystem [e.g., HDFS, Shvachko et al., 2010], utilizing a distributed computation engine [e.g., MapReduce, Dean and Ghemawat, 2004]. However, these distributed SVD frameworks require the matrix undergoing SVD to be explicitly (entry-wise) saved on disk. In our cases of interest, this requirement is prohibitive: the matrix might be too large and often quadratic in the input size (§??). On the other hand,di allows exible denition of matrices, and can parallel-compute their SVD, without computing the matrices themselves. Tables 10.1 & 10.2 compare memory requirements of SVD implementations. 141 Table 10.1: Size limit on matrix undergoing SVD when using dierent frameworks, against main memory (RAM) of one machine. NumPy is not implicit, while others are. Since di is sharded and distributed, it needs not to t the input in RAM. 1 worker must t numpy SciPy [2008] Abu-El-Haija et al. [2021b] di input Y Y Y N M Y N N N Distributed training of neural networks: TensorFlow and PyTorch both have modules for dis- tributed model training. There are two modes of operation: (i) Synchronous SGD and (ii) Asynchronous SGD (a.k.a, downpour SGD). Each has merits. (i) is equivalent to standard (single-worker) SGD: the worker machines synchronize and average their gradients, at each mini-batch. This equivalency comes at a cost: all workers must wait for the slowest one, and the communication is frequent (incurring communication over- head). On the other hand, (ii) relaxes the synchronization requirement at every mini-batch: Each worker updates its local copy of model parameters and only communicates its (stale) gradients based on some congurable frequency [Dean et al., 2012, Zhang et al., 2015]. Distributing SVD has dierent advantages, as it calculates no gradients. In fact, it gets the best of both worlds: matrix multiplication is parallelizable [Gupta and Kumar, 1995], yet distribution maintains exactness to the one-worker calculation. Scalable GNN training: Beyond arbitrary neural nets, many methods scale training of GNNs by stochastic sampling, such as, GraphSAGE [Hamilton et al., 2017], FastGCN [Chen et al., 2018b], PinSAGE [Ying et al., 2018], ClusterGCN [Chiang et al., 2019], GTTF [Markowitz et al., 2021]. Our method diers. Rather than sampling, our method is equivalent to a full-batch method, as it computes the SVD of an arbitrarily large design matrix encoding information from the entire graph, however, to models that must be restricted as linear. Nonetheless, any of the aforementioned methods can be used to ne-tune a deeper version of the linear model parameters that were obtained via full-batch SVD training. We use GTTF for ne-tuning (§10.3.4). Contributionssinceourproof-of-concept[Abu-El-Haijaetal.,2021b]: distributed implementa- tion, which requires approximation tricks, such as orthonormalization and a randomized SVD algorithm 142 Table 10.2: Size limit on matrix undergoing SVD when using distributed frameworks, against lesystem size (of all available machines). Out of those, only di oers implicit multiplication and therefore matrix- free SVD. Filesystem must t Spark Mahout Dask di input Y Y Y Y M Y Y Y N that invokes itself (§10.3.6); novel design matrix (for incorporating node features); new SplitRes layer – equipping Split-ReLu layer of Abu-El-Haija et al. [2021b] with skip-connections (while maintaining the non-linear behavior though linear on a hyperplane); as well as gradient analysis (in Appendix), formally showing that linearity lasts only until the rst gradient update. 10.3 DistribitedImplicit(di) di can read matrices (e.g., adjacency or features), transform them (e.g., via matrix multiplication, con- catenation), to compute desired outputs (e.g., trained GNN paramters). It couples implicit representation [Chapter 7, Abu-El-Haija et al., 2021a,b] with distributed computation (e.g., of dask). To demonstratedi, we rst start with a set of high-level requirements highlighting who would use di, followed by a front-to- back explanation of how to use it. 10.3.1 High-levelRequirements To usedi, you probably should have: • One or more worker computers (e.g., laptop). Each worker will execute python code. If there are multiple workers, they must be part of a private network, e.g., LAN, as they will communicate via Remote Procedure Calls [using PyTorch RPC, 2019]. Workers must also share a lesystem, e.g., viassh mount. • Data matrices (e.g., adjacency and features): either directly available on disk, or by implementing func- tions that can construct them. If they are too large, they must be sharded, with each shard small to t 143 on RAM (e.g.,<2GB), over which, you aim to train a (GNN) modelh for some task, e.g., link prediction or node classication. • To initialize deep GNNs using matrix-free SVD, you can also: – Design linear model e h, resulting from convexication of deeper GNN h (e.g., by removing non- linearities, see §10.3.4 and Abu-El-Haija et al. [2021b]), and derivation of design matrix M, such that, parameters of e h can be estimated from the basesU;S;V SVD(f M ). – Re-architecth to behave as a deep non-linear GNN, almost everywhere, but behaves linearly when its parameters reside on some hyperplane that is activated when initializing from e h. Although dimensions ofM can be arbitrarily large, the decomposition rankk must not be too large, such that, matrix of sizekk must t on memory of a single machine (e.g.,k can be up-to thousands). 10.3.2 Interfacefordeningandcomputingmatrices 10.3.2.1 Inputmatrices Suppose you want to refer to the adjacency matrix using python variable A, then you must provide a function (load_adj()) that can be queried for submatrices ofA. 1 shards = [0, 10, 20, 30, 40] 2 A = di.sparse_lambda( 3 load_adj, splits=(shards, shards)) load_adj is expected to return submatrix containing values of A[r0:r1, c0:c1] , when called with data_range = ((r0, r1), (c0, c1)) . The possible values of ((r0, r1), (c0, c1)) are deter- mined by shards. The splits argument species how rows and columns are sharded. As a toy im- plementation, let us load the full matrix (pickle toscipy.csr_matrix), then index into the data-range, as: 1 @di.register_fn 2 def load_adj(common_args, data_range): 144 3 row_range, col_range = data_range 4 row_start, row_end = row_range 5 col_start, col_end = col_range 6 7 a = pickle.load(open("adj.pkl", "b")) 8 a_shard = a[row_start:row_end] 9 return a_shard[:, col_start:col_end] The 5 entries in shards yield 4-way sharding on both the row and colum axes, therefore load_adj is invoked 16 times, among available workers. For example, load_adj(data_range=((30, 40), (0, 10 ))) will be invoked by a worker to recover the bottom-left submatrix. Note that whiledi.sparse_lambda represents sparse matrices, its sister method,di.dense_lambda, allows for dense matrix construction, e.g.: 1 X = di.dense_lambda( 2 load_feats, splits=(shards, [0, 128])) Here, we assume thatload_feats() returns a submatrix containing 10 examples, each with 128-d feature vector. Note that the column axis ofX is not sharded. 10.3.2.2 Transformingmatrices Arithmetics overA andX can now be programmed, e.g., 1 A_sl = A + di.eye_like(A) #self-loops. 2 d = A_sl.sum(axis=0) #degree vector. 3 D_sqrtinv = di.diag(d ** -0.5) #normalizer. 4 Ahat = D_inv_sqrt @ A @ D_inv_sqrt 5 M = Ahat @ Ahat @ X #SimpleGCN’s preprocess So far,M encodes the matrix b A 2 X. The simple (yet powerful) model of [SimpleGCN Wu et al., 2019] advo- cates for GNN modelh SGCN W = b A 2 XW, where parametersW are trained to minimize (softmax) cross- entropy objectiveL XE asL XE ((h SGCN W ) [L] ;Y [L] ), and subscript notation [L] selects rows corresponding 145 to labeled examples. If one replacesL XE with Euclidean distance objective, then per Eq.??, the minimizer arg min W jj(h SGCN W ) [L] Y [L] jj 2 F = b A 2 X + [L] Y [L] , can be estimated by SVD as: 1 M_tr = di.gather(M, L) 2 Y_tr = di.gather(Y, L) 3 U, s, V = di.svd(M_tr, rank=500) 4 5 W = V @ di.diag(1/s) @ (U.T @ Y_tr) Empirically we found that, when only a small fraction of nodes are labeled, i.e.,jLj n, then it works better to calculate SVD of entireM then select rows fromU corresponding to labeled examples, as: 1 U, s, V = di.svd(M, rank=500) 2 U_tr = di.gather(U, L) 3 Y_tr = di.gather(Y, L) 4 5 W = V @ di.diag(1/s) @ (U_tr.T @ Y_tr) By default, svd() is calculated implicitly i.e. the matrix undergoing SVD will not be calculated entry-by- entry. To force an explicit SVD, you can supplyimplicit=False 10.3.2.3 Materializing: writingsubmatricesondisk The practitioner likely needs the value ofW, which can be stored on disk as sharded les: 1 w_s = W.write("/fs/w-<rows>-<cols>.npz") Substrings "<rows>" and "<cols>" get replaced by strings encoding ranges: "row.%i-%i"%(r0, r1), and the path /fs/ is assumed mounted on all workers. The code snippets, thus far, all take0 seconds. Not suprising, as no matrices have yet been com- puted. Previously-used di functions only constructs the DAG. Figure 10.1 shows DAG ofM (NE) (Eq. ??). Figure 10.2 shows three DAG’s, each corresponding to a single operation, but specically show the shard- ing for distributing the computation (§10.3.3) 146 @ + @ x x ½ ⅓ - 1 T @ - x A Figure 10.1: DAG ofM (NE) =T + 1 2 T 2 + 1 3 T 3 (1A) with grey, yellow, and red colors, respectively, corresponding to input, transformation, and output. @ = @ @ @ + = T .T = cat _diag( , ) 0 0 Figure 10.2: Three work graphs (left: matrix-multiply; center: transpose; right: concatenate along diag- onal), with input matrices in blue and orange, and output matrices in red. Each matrix consists of sub- matrices (sharding can be non-uniform, e.g., right). For each, we hightlight on the work graph one node corresponding to a submatrix on the output matrix, and all its incoming edges until the input submatri- ces. All-zero submatrices are marked to increase eciency (e.g., if right-most matrix was further used to produce other matrices). 10.3.2.4 Computeengine To actually compute (i.e. materialize) a matrix, you can run the compute engine, e.g., by calling: 1 w_s = di.compute(w_s) di.compute() accepts either one or a list matrices, all must be dened on the di DAG (i.e. output of a di. function). Further, di.compute() is a blocking call. It creates a work graph, to execute on available workers (§10.3.3). 147 For instance, if you conguredi to run on 8 workers, then your terminal will display 8 progress bars, one per worker, as: Although di.compute returns its input matrices, it is more ecient to point the reference of w_s to the return rather than keep it to the argument of di.compute(). The return reference causes subsequent invocations ofdi.compute() involving expressions ofw_s to re-use the previously computedw_s. 10.3.3 Distributingtheworkgraphamongworkers The work graph gets initialized from the DAG as follows. First, the target matrices, i.e. the argument of di.compute(), get traversed towards the DAG leaf nodes, in this case, until reaching X, and A. Each traversed node (matrix) then becomes multiple nodes on the work graph: one per submatrix. The sub- matrices are connected via directed edges indicating the dataow. Figure 10.2 shows three sample work- graphs, each for a single operation (e.g., matrix-multiply, using@). For each, we show one target submatrix node (in red), its source submatrix nodes (grey), and intermediate computation nodes along the path from source to target. Figure 10.3 depicts a more-realistic work graph, corresponding to the SVD of transition matrix (to reduce clutter, we drop the rectangular matrices and only depict submatrices as circle nodes). Now we discuss how the work graph could be partitioned onto multiple workers: we would like to approximately equally divide the work among workers. This problem is an NP-hard problem, and is known as balanced graph partitioning, or normalized cuts [Shi and Malik, 2000, Andreev and Racke, 2004, Feldmann, 2012], with many fast approximation algorithms, including recent method of Dhulipala et al. [2021] 148 @ 1 1 1 2 @ @ @ A 22 A 11 + + d 1 d 2 D -1 11 @ @ @ @ D -1 22 ~ ~ @ @ @ @ ... ... ... ... U 1 V 1 s U 2 V 2 A 21 A 12 diag 1/ . 1/ . diag # load_A() reads sharded adjacency A = di.sparse_lambda(load_A, …) # Transition Matrix T d = di.sum(A, axis=1) Dinv = di.diag(1 / d) T = Dinv @ A # Calculate SVD U, S, V = di.svd(T, rank=k) di.compute([U, S, V]) #blocking-call Figure 10.3: di code (left) and work graph (right), with brace-brackets marking their correspondence. The work graph can be divided into two workers. Existing graph-cut algorithms may prefer the purple cut (as it removes only 4 edges) whereas our heuristic prefers the green cut (as it keeps both workers busy, until computation nishes) Our problem is similar: we would like to equally divide the work graph into sub-graphs, each sub- graph to be executed by one worker, such that, edges crossing subgraphs be as minimal as possible, so as to reduce inter-worker communication of sub-matrices. However, our problem is at least as dicult. Not only we would like the workers to do (approximately) the same amount of work, we also wish that they are busy at all times. Suppose we have two available workers to execute the work graph depicted in Figure. 10.3. Which cut divides the work better, the dashed purple or dashed green cut? The purple cut only removes 4 edges, but keeps one worker idle until the computation reaches the cut. On the other hand, the green cut removes more edges (6 depicted), though keeps the two workers busy until the computation nishes. While the aforementioned algorithms may choose the purple cut, as they ignore the chronology of nodes, we would like to use a heuristic that returns the green cut. We develop a chronologically-aware balanced graph cuts algorithm, that executes in the following steps: 1. Let A WG denote the adjacency matrix of the work graph, which we make undirected i.e. A WG = A > WG . Further, let b A WG be its symmetric normalization per §??. We calculate low-rank estimate for the Personalized PageRank matrix I (1) b A WG 1 using SVD as: U;S;V SVD(I (1 ) b A WG ) 149 2. Suppose there are L leaf nodes on the work graph, i.e., submatrices with no incoming edges, e.g., constructed programatically withdi.dense_lambda. Suppose these occupy the rstL rows & columns inA WG . We construct a dense graph asA L =V [1:L] S + U > [1:L] , which we pass to the normalized cut algorithm by Shi and Malik [2000], producing subgraphs, one per worker. At this point, only the leaf nodes has been divided among workers. 3. We compute the longest distance from all graph nodes to the target nodes (e.g., depicted in red in Figures 10.2 and 10.3). We group nodes by distance. We pick the group that is furthest from target nodes, then we round-robin distribute its nodes among the workers, growing each worker graph by one node. Specically, the worker will choose one node that has the largest number of edges incident on other nodes already assigned to the worker. 10.3.4 Application: SOTALinkPredictionGNNs We now alternate the tone and describe howdi can be incoporated in training higher-capacity GNNs, e.g., for achieving competitive SOTA performance, as measured on largeest link prediction task of Stanford Open Graph Benchmark [OGB Hu et al., 2020]. In these models, we follow the process of [Abu-El-Haija et al., 2021b] where we (i) pick a GNN modelh, (ii) linearize into e h to train via SVD, then (iii) use e h to initialize non-linear GNNh to ne-tune deep onL XE . Here, our contribution is in introducing novel new choices forh and e h, than the ones outlined in [Abu-El-Haija et al., 2021b]. LinkPrediction: Given a graph, with its node featuresX2R nd and partially-observed adjacency A2R nn , the goal is to predict (test) edges inA that are not observed by the training algorithm. This has many applications, including in social networks and recommendation systems. Our model combines inspirations from models of SEAL [Zhang and Chen, 2018] and DeepWalk [Perozzi et al., 2014]. Like SEAL, for modeling score of edge (i;j), we would like to run GNN model oni and also on j, extracting neighborhood features summarizing both nodes,butwhilecarefullyassumingthat (i;j) 150 is not an edge, even if it exists. This assumption resembles the test scenario, where by denition, test edges are unobserved. Further, SEAL supplements node features with distance information: the shortest- distance between all nodes in batch [n] to both offi;jg is measured amended toX. We utilize another distance scheme, utilizing DeepWalk: we encode structural information by computing (implicit) SVD on the link-prediction design matrixM (NE) (dened in Eq.??). For reference, other distance-encoding schemes are viable, such as choosing anchor nodes per [Li et al., 2020]. We now formalize our model ingredients, show-casing the utility of di. We dene (implicit) matrix M (pw) 2R RD , where pw stands for pairwise, where rowr2 [R] encodes information from node pair (i;j). Coded indi as: 1 gc_X = di.dense_lambda(make_gc_feats) 2 h_X = di.dense_lambda(make_heuristic_feats) 3 dw_X = di.dense_lambda(make_dw_feats) 4 5 M_lp = di.concat([gc_X, dw_X, h_X], axis=1) The height of the matrix can be arbitrarily large, even larger than the number of positive edges – we useR = 60; 000; 000 for dataset ogbl-citation2. The make_*_feats() functions are implemented to be pseudorandom and the random seeding is consistent across all three. Each invocation receives row-range (e.g., [r0:r1] ), and should extract information for a total ofb =r1 - r0 edges. The rst b Kn rows VS the lastb b Kn rows, respectively, correspond to positive VS (sampled) negative edges. We useK n = 4 implying 4 negative edges for each positive. Allm positive edges are pre-shued. Then, an invocation to anymake_*_feats instantiates: 1 # .. consistently in def make_*_feats() .. 2 # Pos edges (deterministic given r0, r1) 3 pos_ids = np.range(r0, r0 + (r1-r0) / Kn) 4 pos_ids = pos_ids % m #mod to round-robin 5 pos_edges = train_edges[pos_ids] 151 6 7 # Neg edges (seed random with r0, r1) 8 np.random.seed((r0, r1)) 9 n_negs = (r1-r0) - (r1-r0) / K 10 neg_edges = np.random.randint( 11 low=0, high=n, size=(n_negs, 2)) 12 13 # Concat pos&neg to extract edge feats 14 all_edges = np.concatenate( 15 [pos_edges, neg_edges], axis=0) We now detail themake_*_feats functions: • Heuristic features. For each candidate edge (i;j), funtionmake_heuristic_feats will create a vector containing statistics like: [CommonNeighbors(i;j), Jaccard(i;j), AdamicAdar(i;j)]. • Structural DeepWalk features. Given edge (i;j), functionmake_dw_feats returns vector of Hadamard product: (U [i] +V [i] ) (U [j] +V [j] ). The choice of Hadamard allows the the full model (next) to recover symmetric version of score() function of Eq. ??. U andV are computed using SVD(f M (NE)), including up-to the 4 th powerT and using = 0:001. • Graph-convolution neighborhood features. Given edge (i;j), function make_gc_feats utilizes GTTF [Markowitz et al., 2021] to conduct two traversals, one from nodei andj, each of the two traversals yields a vector, giving: " X [i] E h \ A n(i;j) X i [i] E \ A n(i;j) 2 X [i] # and " X [j] E h \ A n(i;j) X i [j] E \ A n(i;j) 2 X [j] # ; 152 where the \ A n(i;j) is the adjacency matrix but excluding edge (i;j) if it exists, and normalized ac- cording to §??. E comes from GTTF’s sampling process, yielding unbiased estimates. Both feature vectors contain 3 sub-vectors (separated by dashed line), with each sub-vector beingd-dimensional. Denote rst line with [Z (0) [i] . . .Z (1) [i] . . .Z (2) [i] ], where matrixZ (p) =E[ \ A n(i;j) p X]2R Rd . The nal output of make_gc_feats per edge (i;j) is an all-pair hadamard product, combining every sub-vector from the i th with every subvector from thej ’th vector, totaling to 9d-dimensions. The individual vectors should seem familiar to graph practitioners. Specically, it combines elements from SimpleGCN [Wu et al., 2019] and JKNet [Xu et al., 2018], where the rst implies propagation (repeated multiplications byA) followed by learning a model (next), and the second implies that all propagation layers get combined together (we use concatenation). Altogether, the three make_*_feats functions produce matrixM (pw) 2R RD whose rowr can be depicted as: M (pw) [r] = h Z (0) [i] Z (0) [j] Z (1) [i] Z (0) [j] .. Z (2) [i] Z (2) [j] (U +V) [i] (U +V) [j] graphStats(i;j) i (10.1) where the rst line contributes 9d dimensions and is extracted usingmake_gc_feats, and the second line is extracted via the other twomake_*_feats functions. We wish to learn a parameter column vectorw (out) 2R D1 such that the pairwise model e h, e h(i;j) =M (pw) [r] w (out) 2R (10.2) scores high if (i;j) is a positive edge and low otherwise. LearningviaSVD Suppose we have column vectorY with rowY [r] = 1 ifr corresponds to a positive (training edge) andY [r] =1 if it corresponds to a negative (sampled) edge. Then, as discussed in Eq.??, 153 the optimumw (out) that minimizes the Euclidean distance of e h fromY is given by M (pw) + Y. We rst dene implicitY, asY indi: 1 @di.register_fn 2 def make_target_y(common_args, data_range): 3 r0, r1 = data_range[0] 4 num_edges = r1 - r0 5 num_pos = num_edges / Kn 6 num_neg = num_edges - num_pos 7 return np.concatenate([ 8 np.ones([num_pos, 1]), 9 -1 * np.ones([num_neg, 1]), 10 ], axis=0) 11 12 Y = di.dense_lambda(make_target_y) then obtain and save the global minimizer, asw indi: 1 U, s, V = di.svd(M, rank=500) 2 w = V @ di.diag(1/s) @ (U.T @ Y) 3 w = di.compute(w,"/fs/w-<rows>-<cols>.npz") The second argument todi.compute instructsdi to savew, obviating the need to invokew.write() † . Finetunedeepermodel. To obtain SOTA performance, we use learned e h to initialize deeper model h; Deepend as: 1. Classication parameters,w (out) of Eq. 10.2, will be converted into a 3-layer MLP with residual connec- tions. 2. Graph convolution features,Z in Eq. 10.1, will be replaced by GNN with residual connections. † write() always writes the matrix, but passing the lename to compute() will compute-and-write only if les do not exist. If les exist, thencompute() would be non-blocking and will return an DAG matrix that would read the data from disk. 154 Both of these deepening operations utilize split-ReLu layer introduced in [Abu-El-Haija et al., 2021b], such that, at the initialiation point,h(i;j) = e h(i;j) for alli;j. We extend the split-ReLu layer into a residual one. We now proceed to replace everyZ in Eq. 10.1 withH, dened as: H (2) =g 2 (Z (2) ); H (1) =g 1 (g 2 (Z (1) )) (10.3) H (0) =g 0 (g 1 (g 2 (Z (0) ))) (10.4) Similarly, we deepen the classier from:xw (out) intog out (x)w (out) . Everyg() contains 3-layers as: g(Z) = SplitRes(SplitRes(SplitRes(Z;Z);Z);Z) (10.5) No SplitRes layers share parameters. Each computes: SplitRes(H;Z) = [HW p +Z] + [HW n Z] + ; (10.6) where ReLu [:] + = max(:; 0) is applied element-wise and the layer has two parameter matrices,W p and W n – which are2R dd forg 0 ;g 1 ;g 2 and are2R DD forg out – associated with thepositive andnegative terms. The two terms are combined with element-wise subtraction. It is crucial that, while the SplitRes layer is generally non-linear, and therefore its stacking is considered “deep”, however, when its parameter values reside on the hyperplaneW p =W n , then SplitRes behaves as linear in both of its arguments. Theorem9. OnhyperplaneW p =W n ,SplitRes(H;Z)islinearinHandZ. Further,adeepnet,stacking SplitReswillbehaveasalinearnetwork,specically,allowingformodelequivalenceatinitializationh(i;j) = e h(i;j). Theorem 9 extends the linear region of the Split ReLu layer, introduced in [Abu-El-Haija et al., 2021b], for incorporating residual connections. The signicance of linear regions enable initializations from the 155 Table 10.3: Implicit Operations (rst column) and sample code (second column) for composing the DAG. Every implicit DAG nodeM must be able to propagate-down the multiplications to its constituents i.e. implement functionsf M andf > M (last 2 columns). operator description code:M = Mvf M (v) v > Mf > M (v) matrix plus matrix B 1 +B 2 f B 1 (v) +f B 2 (v) f > B 1 (v) +f > B 2 (v) matrix times matrix B 1 @B 2 f B 1 (f B 2 (v)) f > B 2 (f > B 1 (v)) matrix times scalar aBBa af B (v) af > B (v) matrix plus scalar a +BB +a f B (v) +a1(1 > v) f > B (v) +a1(1 > v) column-wise concat cat([B 1 ;B 2 ];axis=1) f B 1 (v[:c 1 ]) +f B 2 (v[c 1 :]) cat([f > B 1 (v);f > B 2 (v)]; 1) row-wise concat cat([B 1 ;B 2 ];axis=0) cat([f B 1 (v);f B 2 (v)]; 0) f > B 1 (v[:c 1 ]) +f > B 1 (v[c 1 :]) transpose B.T f > B (v) > f B (v) > Table 10.4: Wall-clock time on our hardware for training SOTA models on link prediction task ogbl- citation2 versus Mean Reciprocal Rank (MRR, high is better). Below the middle line are our methods. Model Implementation Device Run-time MRR SEAL (1 epoch) FAIR GitHub repo RTX 2080 Ti 26 hours 85.6 SEAL (50 epochs) FAIR GitHub repo RTX 2080 Ti 54 days 87.7 GraphSAGE (1 epoch) OGB ocial github repo Intel Xeon Gold 2.2 GHz 21 hours 34.7 GraphSAGE (50 epochs) OGB ocial GitHub repo Intel Xeon Gold 2.2 GHz 44 days 82.6 GCN (1 epoch) OGB ocial github repo Intel Xeon Gold 2.2 GHz 27 hours 50.7 GCN (50 epochs) OGB ocial github repo Intel Xeon Gold 2.2 GHz 56 days 84.7 Ours: SVD(f M (pw)) Attached 8 workers using Intel Xeon 1 hour 81.5 + Finetune Attached RTX 2080 Ti 15 hours 87.5 solution obtained (in closed-form) via SVD, recovering the (decent) test accuracy due to model equivalence. In practice, netuningh for only a handful of epochs, will suce to obtain SOTA performance. Proof of Theorem 9 is in the appendix, with some further analysis on the gradients obtained during ne-tuning, showing the parameters of SplitRes will deviate from the hyperplane (i.e. it will be come deep), on the rst gradient update. The ne-tuning code does not use di though it reuses functions make_*_feats. It minimizes the cross-entropy objective using TensorFlow, which should be native to ML practitioners. Details are in the accompanied code. 156 10.3.5 ImplicitMultiplication The distributive property of matrix multiplication allows us to recursively push multiplications down the computation DAG towards leaf nodes. This is similar to removing paranthesis, e.g, (B 1 +B 2 )v =B 1 v + B 2 v. However, distribution is only possible into linear operations, e.g., tanh(B 1 +B 2 )v6= tanh(B 1 v + B 2 v). As such, each operation on the DAG can either be: explicit, i.e., non-linear operations, such as, di.element_wise, di.diag, which cannot propagate multiplication towards constituents; or implicit, with a subset of them listed in Table 10.3, while additional ones can be implemented in terms of other implicit operations (e.g.,di.gather is implemented by multiplication against sparse matrix with one-hot selector rows). Every implicit operation, computing matrixM2R RC must implement two functions:f > M :R C ! R R andf > M :R R !R C that can multiplyM with any properly-sized vector, respectively, from the right and from the left. These functions are implemented in terms ofM’s constituents. 10.3.6 Approximationstospeed-upSVD Randomized SVD algorithim of Halko et al. [2009] accepts a largeM2R RC matrix and decomposition rank k 2 . Assume R < C (if not, M will be transposed then resulting U and V can be swapped). The algorithm rst samples random matrixQ2R Rk , from standard normal, then repeats four steps: 1. Q MQ; 2. Q orthonormalize(Q); 3. Q M > Q; 4. Q orthonormalize(Q), then, nally, computes non-random SVD on (smaller)Rk matrixQ. In our case, we cannot conduct non-random SVD, as we do not assume thatRk is small-enough to t onto one machine. We allowR (andC) to be arbitrarily large. Therefore,di randomized SVD function 157 invokes itself (just once). In the inner invocation, ourQ iskk, which indeed ts on one machine, over which, we invoke SVD of PyTorch. Further, naive orthonormalization implementation requires access to entire matrix columns at once. However, ourQ matrix is sharded along the rows (R) dimension (but onK). Therefore, we use an alter- native orthonormalization using Cholesky decomposition, specically: orthonormalize(Q) =Q(cholesky(Q > Q) 1 ) > (10.7) The good news: Eq. 10.7 never looks at entire columns at once. Specically, productQ > Q multiplies each submatrix ofQ with its transpose (on the worker it lives on), then thekk matrices are accumulated onto one worker and summed to produceQ > Q. The bad news: Q must be full column rank, which is not guaranteed after multiplications withM. To overcome this, we make it full column rank (w.h.p.), by adding small noise toQ, sampled entry-wise, i.e., independently by workers. 10.3.7 EngineeringHighlights We will highlight some engineering aspects todi. First, di requires no installation. It only assumes your controller (e.g., your laptop) can SSH to the master machine without password (e.g., public key is added to master) and that the master machine can SSH to the worker machines, without password. Upon di. compute,di copies itself, together with all les contain registered functions (via@di.register_fn) that are used in the work graph to be computed. Second, exceptions (e.g., stack-trace) are propagated from workers to master to your controller. Third, di’s computation framework automatically nds an unused TCP port for the distributed RPC communication (we use PyTorch RPC). Further details are in the reference manual. 158 10.4 Experiment&Results 10.4.1 RuntimeofimplicitSVDofSciPyVSdi 20 30 40 50 60 70 80 90 100 Percent of kept edges (%) 0 10 20 30 Time for implicit SVD (minutes) di scipy Figure 10.4: SVD run-time of implicit SVD implementations of scipy versusdi (using 4 worker machines). We measure the run-time of our implicit SVD implementation (running using 4 workers) versus scipy’s implicit SVD implementation (we supply a LinearOperator instance). We compute the SVD of matrix P 4 q=1 1 q T q , where the transition matrixT comes from OGB graph datasetogbn-products (with 2.5 million nodes). We take 5 dierent versions of the graph: keeping: all (62 million) edges, 80% of edges, . . . , 20% of edges. Runtime is summarized in Figure 10.4. We see that the speedup is sub-linear, e.g., for the full graph, scipy’s implementation requires 31 minutes (using one worker) versus ours requiring 8 minutes (using 4 workers). Note that it is impractical to do the SVD over this matrix using distributed dask, as it requires the matrix entry-wise (needs 25 Terabytes of disk space). 10.4.2 LinkPrediction: ogbl-citation2 Table 10.4 shows that di can speed-up the training over the largest link prediction graph dataset oered by OGB (ogbl-citation2;n = 2:9 million;m = 30:6 million;d = 128). We run baseline methods of SEAL [Zhang and Chen, 2018], GraphSAGE [Hamilton et al., 2017] and GCN [Kipf and Welling, 2017] using 159 ocial codebases from GitHub. As the last 2 models consume more memory than our GPU could t (we do not have expensive Tesla V100, compared to the authors’), then we had to train on CPU. 10.5 Analysis 10.5.1 ProofofTheorem9 We want to show that on the hyperplaneW p =W n , SplitRes(H;Z) = [HW p +Z] + [HW n Z] + ; (10.8) is linear in bothH andZ. Assume that the parameters are on the hyperplane, then simplify as: [HW p +Z] + [HW p Z] + = [HW p +Z] + [(HW p +Z)] + =HW p +Z; where the rst line comes from the hyperplane assumption: we can replaceW n of Eq. 10.8 withW p . The second line just factors a negative from the ReLu input, and the third line follows element-wise analysis. The input of the left relu equals the negative of the input of the right relu, at every matrix element. Let v rc = (HW p +Z) [rc] . Lets consider two disjoint cases that partition the space: 1. Ifv rc 0, then the left relu should output a positive valuev rc at positionrc, the right relu should output 0, and their element-wise subtraction (nal) output would be:v rc . 2. Ifv rc < 0, then the left relu should output zero, the right relu should outputv rc , and their element- wise subtraction (nal) output would be:(v rc ) =v rc . 160 In both cases, the output isv rc = (HW p +Z) [rc] Now, we have to show the remainder of the theorem. We want an initialization scheme that yields e h =h. This reduces to nding an initialization, such that: g(Z) = SplitRes(SplitRes(SplitRes(Z;Z);Z);Z) =Z: In fact, we can just initialize allW p =W n = 0 10.5.2 Deviationfromlinearityhyperplaneuponne-tuning While we initialize the non-linear layer in its linear region, it is important to analyze that the linearity does not last: the layer actually becomes non-linear (and therefore, the network becomes deep) at the rst gradient update. To analyize this, we would like to check that @L @W p 6= @L @W n , whereL is the objective function (e.g., cross-entropy). At the initialization, for every locationrc, one of the two relus outputs zero: at that location, the partial derivative ofL with respect to input of the relu will also be zero. This implies that the Hadamard product @L @(HW p +Z) @L @(HW n Z) (10.9) is an all-zero matrix, implying thatany submatrix of the left derivative, if rasterized, is orthogonal to the submatrix of the right derivative (taken at the same submatrix indices). Applying the derivative sum rule (on the denominator) implies: @L @(HW p ) [:;c] ? @L @(HW n ) [:;c] ; (10.10) 161 for any columnc. Since @L @W p =H > @L @(HW p ) and @L @W n =H > @L @(HW n ) , then it is easy to see that the gradient updates @L @W p and @L @W n cannot be parallel. Therefore, they sendW p andW n directions to deviate from the hyperplane. 10.5.3 Tradingoexactequivalenceformodelcapacity Finally, if allg’s are initialized withW p =W n = 0 for, then it is straight-forward to show exact equality i.e.h(i;j) = e h(i;j) at the initialization. However, an analysis on the learning gradients, @L XE @W p , shows that everyg would be limited in capacity to a recurrent layer. As such, we loosen the exactness of equality and seek a model thath(i;j) e h(i;j) at the initialization point. We proceed with: W p N (0;) entries drawn i.i.d (10.11) W n W p (10.12) The extent to which the equality holds depends on, we use = 0:001. 162 Chapter11 Conclusions The research area of Maching Learning on Graphs, a.k.a, Graph Representation Learning (GRL) has been attracting much attention from various communities, causing rapid advancement that could be explained with the following factors: (i) Graphs are applicable to many domains. They are general data structures. For example, algo- rithms for learningfriendshiprecommendation on asocialnetwork, can be utilized for a dierent application such as predicting whethertwodrugschemically-interact given a partially-observed drug-drug interaction network. In fact, these two are examples of link prediction: a general task on graph data. (ii) ML on Graphs borrows & generalizes contributions from other ML elds. Neural architec- tures including,e.g.,BatchNorm [Ioe and Szegedy, 2015],DropOut [Srivastava et al., 2014], Residual Networks [He et al., 2016a],Convolution [LeCun et al., 1998], and many others, inspire recent graph neural networks (iii) Unprecedented growth of available graph datasets: Over the last decade, the amount of digi- tized data, shaped as graphs, has been massively growing. Internet facilitates sharing of data. For example, Stanford Open Graphs Benchmark (OGB) [Hu et al., 2020] and Stanford SNAP [Leskovec and Krevl, 2014a] bundle graph datasets from various domains, including biological (protein- or 163 drug-interaction graphs), social, and citation networks. Other examples are MoleculeNet [Wu et al., 2018] – dataset of chemical molecules, stored as graphs, and ConceptNet [Speer et al., 2017] – a knowledge graph. While larger data generally implies more information, it is coupled with a chal- lenge of how to train fast and eciently scale to larger graphs. 11.1 High-levelSummary The content of this thesis is intertwined with the above three factors. Specically, (i), the contributions of all chapters have been experimentally veried on data that stems from dierent domains, including social networks (e.g., Facebook dataset), citation networks (e.g., Cora, PubMed, Citeseer, ArXiv), biological networks (e.g., drug-drug interactions, protein-protein interactions), and others. This shows that, in general, an advancement can simultaneously impact multiple domains. Moreover, (ii), the thesis brings several neural architectural and modeling improvements. To mention some instances: N-GCN (Chapter 4) proposes to instantiate multiple graph neural networks (GNNs), each instance on a dierent scale of the graph, then combine all via an output layer, extending vanilla GNNs to consider further nodes; MixHop (Chapter 5) proposes a novel graph convolution layer, which can learn a generalized feature detector that can capture Gabor lters; Finally Chapter 8 (Deepening shallow models) introduces the SplitReLu layer, which is hand-crafted to encode a standard fully-connected ReLu layer, in some region of its parameter space, yet also encode linear layers, in another region of its parameter space. These two regions enable fast training yet strong modeling abilities. The last sentence is a segue towards (iii), as a couple of chapters were focused on faster training of neural networks. Chapter 7 shows how one can train linearized models in closed-form, which trains model very fast but unfortunately with some sacrice on the model’s empirical performance. However, Chapter 8 shows how the parameters obtained for the linear model can be used to initialize a deeper model (archi- tected using the SplitReLu layer). In addition, Chapter 9 makes training faster by sampling the graph, via 164 an algorithm that is programmed using tensor operations, i.e., can be parallelized on GPU. Finally, While GTTF (Chapter 9) shows how one can scale of training gradient-based graph learning, di (Chapter 10) shows how one can scale learning of convexied GNNs (i.e., scales learning described in Chapters 7 & 8). 11.2 TowardsData-ecientGraphRepresentationLearning A number of chapters were dedicated to models that improve the empirical performance of graph neural networks, especially in situations where only few labels are given, including Chapters 4, 5 and 6. Chapter 4 discusses N-GCN, which is a meta model that can wrap arbitrary Graph Convolution models, such as GCN [Kipf and Welling, 2017] and SAGE [Hamilton et al., 2017], on the output of random walks. Traditional Graph Convolution models operate on the normalized adjacency matrix. N-GCN makes mul- tiple instantiations of such models, feeding each instantiation a dierent power of the adjacency matrix, and then feed the output of all instances into a classication sub-network. Each instantiation therefore operates on dierent scale of the graph. The N-GCN model, Network of GCNs (and similarly, Network of SAGE), is end-to-end trainable, and is able to directly learn information across near or distant neighbors. Experiments in the chapter show that Network of a model has superior empirical test performance than the underlying model on its own. Inspecting the distribution of parameter weights in the classication sub- network, reveals that the Network model eectively circumvents adversarial perturbations on the input by shifting weights towards model instances consuming higher powers of the adjacency matrix. Chapter 5 presents the MixHop model. The chapter analyzes analyzes the expressive power of popular methods for semi-supervised node classication using Graph Neural Networks. The chapter proved that vanilla GNNs cannot learn general neighborhood mixing functions. MixHop employs a graph convolu- tional layer that utilizes multiple powers of the adjacency matrix. Repeated application of this layer allows a model to learn general mixing of neighborhood information, including averaging and delta operators in the feature space, without additional memory or computational complexity. Experimental results show 165 that higher order graph convolution can achieve state of the art performance on several node classication tasks. Analysis of the experimental results showed that neighborhood dierence operators are especially useful in graphs which do not have high homophily (correlation between edges and labels). While the Chapter focused on applying our higher-order convolution onto the most popular models for graph con- volution, however, it is possible to implement it onto more sophisticated models, such as GAT Veličković et al. [2018]. Finally, while it is not the main purpose of the chapter, Watch Your Step (Chapter 6) analyzes the objective function of DeepWalk, and swaps it with the Negative Log Graph Likelihood (NLGL, [Abu-El- Haija et al., 2017]). We nd that the NLGL objective is superior than the hierarchical softmax, employed by DeepWalk, as suggested by experimental results on the attempted datasets. 11.3 SummaryforHumanEciency A few chapters presented ideas for saving the time of human practitioners, including Chapters 5, 6, & 7. Specically, they present ways for arriving at state-of-the-art graph models, with fewer experiments. Roughly, this corresponds to having fewer hyperparameters, so as to reduce the grid-search. Chapter 5 (MixHop) employs a (dierentiable) form of architecture search. In particular, since every layer has various powers of the adjacency matrix, it could be a burden to sweep the latent dimensions devoted to every power of every layer. The latent dimensions get determined by applying L2 group lasso regularization on the layer parameters. According to our experiments, the automatically-found values for layer dimensions outperform manual choices, and yields a unique architecture per dataset. Chapter 6 (Watch Your Step, WYS) automatically learns the context distribution, which is used to con- trol the random walk and node sampling, in methods like DeepWalk [Perozzi et al., 2014] and node2vec [Grover and Leskovec, 2016]. Closed-form expectation of the DeepWalk’s co-occurrence statistics matrix 166 was derived, showing an equivalence between the context distribution hyper-parameters, and the co- ecients of the power series of the graph transition matrix. The chapter proposes to replace the context hyper-parameters with trainable models, that are jointly learned with the embeddings, on an objective that preserves the graph structure (the Negative Log Graph Likelihood, NLGL). The chapter uses a Graph Atten- tion Model to learn a free-form contexts distribution with a parameter for each type of context similarity (e.g. distance in a random walk). Learning the context distribution signicantly improves performance of link prediction and node classication tasks, over state-of-the-art baselines (that use a xed-form context distribution), reducing error on link prediction and classication, respectively by up to 40% and 10%. In addition to improved performance (by learning distributions of arbitrary forms), our method can obviate the manual grid search over hyper-parameters: walk length and form of context distribution, which can drastically uctuate the quality of the learned embeddings and are dierent for every graph. On considered datasets, the method is robust to its hyper-parameters, as described in Section 6.3.2. Visualizing of learned context distributions convey to us that some graphs (e.g. voting graphs) can be better preserved by using longer walks, while other graphs (e.g. protein-protein interaction graphs) contain more information in short dependencies and require shorter walks. We take an opportunity to motivate the next section. Even though the WYS model yields state-of-the-art embedding performance, showing signicant improvements especially when learning the context distribution parameters (q 1 ;:::;q w ), there is a drawback: learning requires signicant computation. In particular, it is cubic in the number of nodes i.e.O(jVj 3 ), making the method of Chapter 6 be prohibitive to execute on any graph with more than 20,000 nodes and therefore unt for many real-world situations. Even thoughA and thereforeT containO(jEj) non-zero entries, T j would have exactlyjVj 2 non-zero entries ifj is set to the diameter of the graph (i.e. the distance be- tween two furthest nodes). The principle of 6-degrees-of-separation states that for most real-world graphs, any two nodes are at most 6 hops apart. RaisingT to a power beyond the diameter presents cubic time complexity. 167 Moreover, Chapter 7 removes hyper-parameters concerning the training procedure, including choice of the optimization algorithm, e.g. ADAM [Ba and Kingma, 2015], and optimization hyperparameters including step-size, momentum, and regularization coecients. The chapter proposes to nd the optimal parameters, of a linearized model, in closed-form. In addition to saving computational resources, it also obviates the need for running many experiments sweeping the aforementioned hyperparameters as well as from various random initializations, as convergence is guaranteed. 11.4 MachineEciency It is always of benet to train model faster while using less memory, especially if while doing so, one can guarantee if their model will perform well on the test task. While many methods, described in this thesis and outside of it, consume more hardware resources than necessary, this thesis dedicates a couple of chapters to aid reducing requirements on computational resources. Specically, to train faster and to utilize less memory while training. Chapter 9 (GTTF: Graph Traversal with Tensor Functionals) uses a sampling- based approach, describing a functional algorithm, that when specialized, it can provide unbiased estimates of the learning signals recovering a wide class of graph learning algorithms. On the other hand, Chapters 7 & 8, take a linear algebraic approach to speed-up training and utilize less memory. Chapter 9 presents an algorithm, Graph Traversal via Tensor Functionals (GTTF), that can be special- ized to re-implement the algorithms of various Graph Representation Learning methods. The specializa- tion takes little eort per method, making it straight-forward to port existing methods or introduce new ones. Methods implemented in GTTF run eciently as GTTF uses tensor operations to traverse graphs. In addition, the traversal is stochastic (i.e., samples subgraphs) and therefore automatically makes the implementations scalable to large graphs. The chapter shows theoretically guarantees that the learning outcome, due to the stochastic traversal is, in expectation, equivalent to the baseline when the graph is 168 observed at-once, for popular GRL methods we analyze. Experimental evaluation conrms that meth- ods implemented in GTTF maintain their empirical performance, and can be trained faster and using less memory even compared to software frameworks that have been thoroughly optimized. Chapter 7 (Fast Graph Learning) oers an alternative way to train Graph Neural Networks (GNNs). Rather than usual backpropagation, Chapter 7 poses as a linear function which its optimal parameters can be found in closed-form, i.e., without calculating gradients, and guaranteeing model convergence. For each considered task type (link prediction or node classication), the linearization of the model is achieved by designing a matrix, such that, its singular value decomposition (SVD) produces the trained model parameters. The chapter has a technical contribution: a python framework that allows users to describe matrices, and can compute the SVD of the described matrices, without computing the matrices themselves. In other words, the framework providessyntacticsugar for matrix-free SVD. While this shows signicant speedups and memory savings, it suers from two things: (1) the linearization of the model, although speeds up training, yields trained models that are worse in empirical performance than state-of- the-art (SOTA) deep GNNs and (2) it assumes that the graph matrices (e.g., adjacency and features) must be small enough to t in memory of one machine. Both of these weaknesses are resolved by subsequent chapters. Specically, Chapter 8 shows how the linear model can be made deeper: Parameters of the deeper model, once initialized from the linear model, can be trained within a handful of iterations, where the overall training procedure is hundreds-of-times faster than baselines. Moreover, Chapter 10 shows how these methods could be distributed onto multiple machines, as summarized in 11.5 11.5 ScalingtoLargerGraphs Two of the chapters discussed methods for scaling learning to large graphs, specically, Chapters 9 & 10. Chapter 9 introduced GTTF. The presented functional algorithm that can be used to approximate many graph learning algorithms. The algorithm utilizes Compact Adjacency data structure, and traverses the 169 graph via tensor operations. The size of theCompactAdjacency equals that of the graph, requiring storage that is linear, in the number of edges. Luckily, frameworks like TensorFlow and PyTorch allow for sharding a single tensor onto many machines. Practitioners wishing to apply GTTF for very large graphs should follow tutorials for sharding tensors. Since GTTF is described entirely using tensor operations, it suces to distribute the underlying tensor engine (TensorFlow or PyTorch). Chapter 10 describes a python framework, di, for representing matrices implicitly, and calculating products against those matrices using a distributed system, without necessarily calculating the matrices themselves. This can enable many applications, including matrix-free SVD, which is crucial when the ma- trix undergoing the decomposition is too expensive to compute,i.e., too large and/or too-dense. In addition to popular use-cases of SVD, this chapter focused on scaling learning for graph task of link prediction. In that sense, Chapter 10 can scale the methods introduced in Chapters 7 and 8 to arbitrary large graphs. di acheives SOTA on the largeest OGB link prediction task, much faster than present SOTA alternatives. 170 Bibliography Martín Abadi, Ashish Agarwal, et al. TensorFlow: Large-scale machine learning on heterogeneous systems. In USENIX Symposium on Operating Systems Design and Implementation, OSDI, pages 265–283, 2016. Sami Abu-El-Haija. Proportionate gradient updates with percentdelta. In arXiv, 2017. Sami Abu-El-Haija, Bryan Perozzi, and Rami Al-Rfou. Learning edge representations via low-rank asym- metric projections. In ACM on Conference on Information and Knowledge Management, 2017. Sami Abu-El-Haija, Bryan Perozzi, Rami Al-Rfou, and Alexander A Alemi. Watch your step: Learning node embeddings via graph attention. In Advances in Neural Information Processing Systems, NeurIPS, 2018. Sami Abu-El-Haija, Amol Kapoor, Bryan Perozzi, and Joonseok Lee. N-gcn: Multi-scale graph convolution for semi-supervised node classication. In Uncertainty in Articial Intelligence, 2019. URL http://auai. org/uai2019/proceedings/papers/310.pdf. Sami Abu-El-Haija, Bryan Perozzi, Amol Kapoor, Hrayr Harutyunyan, Nazanin Alipourfard, Kristina Ler- man, Greg Ver Steeg, and Aram Galstyan. Mixhop: Higher-order graph convolutional architectures via sparsied neighborhood mixing. In International Conference on Machine Learning, ICML, 2019. Sami Abu-El-Haija, Valentino Crespi, Greg Ver Steeg, and Aram Galstyan. Fast graph learning with unique optimal solutions. InICLR2021WorkshoponGeometricalandTopologicalRepresentationLearning, 2021a. Sami Abu-El-Haija, Hesham Mostafa, Marcel Nassar, Valentino Crespi, Greg Ver Steeg, and Aram Galstyan. Implicit svd for graph representation learning. In Advances in Neural Information Processing Systems, 2021b. Sami Abu-El-Haija, Hesham Mostafa, Marcel Nassar, Somdeb Majumdar, Greg Ver Steeg, Valentino Crespi, Wes Hardaker, and Aram Galstyan. di: Python framework for distributed computing of matrices and their implicit decompositions. In In submission, 2022. Rami Al-Rfou, Guillaume Alain, and others. Theano: A Python framework for fast computation of math- ematical expressions. arXiv e-prints, abs/1605.02688, 2016. URL http://arxiv.org/abs/1605.02688. Konstantin Andreev and Harald Racke. Balanced graph partitioning. InACMSymposiumonParallelismin Algorithms and Architectures, pages 120–124, 2004. W. E. Arnoldi. The principle of minimized iterations in the solution of the matrix eigenvalue problem. In Quarterly of Applied Mathematics, pages 17–29, 1951. James Atwood and Don Towsley. Diusion-convolutional neural networks. In Advances in Neural Infor- mation Processing Systems, 2016. 171 Jimmy Ba and Diederik Kingma. Adam: A method for stochastic optimization. InInternationalConference on Learning Representations, 2015. L. Backstrom, P. Boldi, M. Rosa, J. Ugander, and S Vigna. Four degrees of separation. In ACM Web Science Conference, pages 33–42, 2012. Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations (ICLR), 2015. Pierre Baldi and Peter Sadowski. The dropout learning algorithm. In Articial Intelligence, 2014. Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data represen- tation. In Neural Computation, 2003. Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of machine learning research, 7(11):2399–2434, 2006. Aleksandar Bojchevski, Johannes Klicpera, Bryan Perozzi, Amol Kapoor, Martin Blais, Benedek Rózem- berczki, Michal Lukasik, and Stephan Günnemann. Scaling graph neural networks with approximate pagerank. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2464–2473, 2020. Aritra Bose, Vassilis Kalantzis, Eugenia-Maria Kontopoulou, Mai Elkady, Peristera Paschou, and Petros Drineas. Terapca: a fast and scalable software package to study genetic variation in tera-scale genotypes. In Bioinformatics, pages 3679–3683, 2019. Leon Bottou. Online algorithms and stochastic approximations. In Online Learning and Neural Networks, 1998. Leon Bottou. Stochastic learning. In Advanced Lectures on Machine Learning, Lecture Notes in Articial Intelligence, vol. 3176, Springer Verlag, 2004. Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. In International Conference on Learning Representations, 2014. Thang D. Bui, Sujith Ravi, and Vivek Ramavajjala. Neural graph learning: Training neural networks using graphs. In Proceedings of 11th ACM International Conference on Web Search and Data Mining (WSDM), 2018. D. Calvetti, L. Reichel, and D.C. Sorensen. An implicitly restarted lanczos method for large symmetric eigenvalue problems. In Electronic Transactions on Numerical Analysis, pages 1—-21, 1994. Shaosheng Cao, Wei Lu, and Qiongkai Xu. Deep neural networks for learning graph representations. In AAAI Conference on Articial Intelligence, 2016. Ines Chami, Sami Abu-El-Haija, Bryan Perozzi, Christopher Ré, and Kevin Murphy. Machine learning on graphs: A model and comprehensive taxonomy. In Journal on Machine Learning Research, 2022. Haochen Chen, Bryan Perozzi, Yifan Hu, and Steven Skiena. Harp: hierarchical representation learning for networks. In AAAI Conference on Articial Intelligence, 2018a. Jie Chen, Tengfei Ma, and Cao Xiao. FastGCN: Fast learning with graph convolutional networks via im- portance sampling. In International Conference on Learning Representations, 2018b. 172 Ming Chen, Zhewei Wei, Zengfeng Huang, Bolin Ding, and Yaliang Li. Simple and deep graph convolu- tional networks. In International Conference on Machine Learning, ICML, 2020. Siheng Chen, Sufeng Niu, Leman Akoglu, Jelena Kovacevic, and Christos Faloutsos. Fast, warped graph embedding: Unifying framework and one-click algorithm. arxiv:1702.05764, 2017. Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui Hsieh. Cluster-gcn: An ecient algorithm for training deep and large graph convolutional networks. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019. Dask Dev Team. Dask: Library for dynamic task scheduling, 2016. URL https://dask.org. John Daugman. Two-dimensional spectral analysis of cortical receptive eld proles. In Vision Research, 1980. John Daugman. Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical lters. In Journal of the Optical Society of America, 1985. Jerey Dean and Sanjay Ghemawat. Mapreduce: Simplied data processing on large clusters. In Sympo- sium on Operating System Design and Implementation, pages 137–150, 2004. Jerey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, 2012. Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. In- dexing by latent semantic analysis. In Journal of the American Society for Information Science, pages 391–407, 1990. Michaël Deerrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral ltering. In Advances in Neural Information Processing Systems, 2016. Laxman Dhulipala, David Eisenstat, Jakub Lacki, Vahab Mirrokni, and Jessica Shi. Hierarchical agglomer- ative graph clustering in nearly-linear time. In International Conference on Machine Learning, 2021. C. Eckart and G. Young. The approximation of one matrix by another of lower rank. In Psychometrika, pages 211–218, 1936. Andreas Emil Feldmann. Fast balanced partitioning is hard, even on grids and trees. In International Symposium on Mathematical Foundations of Computer Science, 2012. Matthias Fey and Jan E. Lenssen. Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019. Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural message passing for quantum chemistry. In International Conference on Machine Learning, 2017. C. Goller and A. Kuchler. Learning task-dependent distributed representations by backpropagation through structure. In International Conference on Neural Networks, 1996. G. Golub and W. Kahan. Calculating the Singular Values and Pseudo-Inverse of a Matrix. SIAMJournalon Numerical Analysis, 1965. 173 Gene H. Golub and Charles F. Van Loan. Matrix Computations, pages 257–258. John Hopkins University Press, Baltimore, 3rd edition, 1996. Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016. Anshul Gupta and Vipin Kumar. Scalability of parallel algorithms for matrix multiplication. InInternational Conference on Parallel Processing, 1995. doi: 10.1109/ICPP.1993.160. N Halko, Martinsson P. G., and J. A Tropp. Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decompositions. In ACM Technical Reports, 2009. N. Halko, P.G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. In SIAM Review, 2011. W. Hamilton, R. Ying, and J. Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, 2017. William L. Hamilton. Graph representation learning. Synthesis Lectures on Articial Intelligence and Ma- chine Learning, 14(3):1–159. David K Hammond, Pierre Vandergheynst, and Rémi Gribonval. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis, 30(2):129–150, 2011. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016a. Xiangnan He, Hanwang Zhang, Min-Yen Kan, and Tat-Seng Chua. Fast matrix factorization for online recommendation with implicit feedback. In International ACM SIGIR Conference on Research and Devel- opment in Information Retrieval, SIGIR ’16, 2016b. Georey Hinton. Neural network tutorials. https://www.cs.toronto.edu/~hinton/nntut.html, NN-tutorials. Olivia Hsu and Chuanqi Chen. Cs224w project: The study of drug-drug interaction learning through var- ious graph learning methods, 2021. URL https://github.com/chuanqichen/cs224w/blob/main/the_study_ of_drug_drug_interaction_learning_through_various_graph_learning_methods.pdf. Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. In arXiv, 2020. Qian Huang, Horace He, Abhay Singh, Ser-Nam Lim, and Austin Benson. Combining label propagation and simple models out-performs graph neural networks. InInternationalConferenceonLearningRepre- sentations, 2021a. Qian Huang, Horace He, Abhay Singh, Ser-Nam Lim, and Austin Benson. Combining label propagation and simple models out-performs graph neural networks. InInternationalConferenceonLearningRepre- sentations, 2021b. URL https://openreview.net/forum?id=8E1-f3VhX1o. Sergey Ioe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In arXiv:1502.03167, 2015. Fariba Karimi, Mathieu Genois, Claudia Wagner, Philipp Singer, and Markus Strohmaier. Visibility of minorities in social networks. In arxiv/1702.00150, 2017. 174 Thomas Kipf and Max Welling. Semi-supervised classication with graph convolutional networks. In International Conference on Learning Representations, 2017. Johannes Klicpera, Aleksandar Bojchevski, and Stephan Gunnemann. Combining neural networks with personalized pagerank for classication on graphs. In International Conference on Learning Representa- tions, 2019. Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. In IEEE Computer, page 30–37, 2009. C Lanczos. An iteration method for the solution of the eigenvalue problem of linear dierential and integral operators. In Journal of Research of the National Bureau of Standards, pages 255–282, 1950. Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haner, et al. Gradient-based learning applied to doc- ument recognition. Proc. of the IEEE, 86(11):2278–2324, 1998. Adam Lerer, Ledell Wu, Jiajun Shen, Timothee Lacroix, Luca Wehrstedt, Abhijit Bose, and Alex Peysakhovich. Pytorch-biggraph: A large-scale graph embedding system. In The Conference on Sys- tems and Machine Learning, 2019. Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap. stanford.edu/data, 2014a. Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap. stanford.edu/data, 2014b. Jure Leskovec, Anand Rajaraman, and Jerey D. Ullman.MiningofMassiveDatasets. Cambridge University Press, 2014. Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems, NeurIPS’14, pages 2177–2185, 2014. Omer Levy, Yoav Goldberg, and Ido Dagan. Improving distributional similarity with lessons learned from word embeddings. In Transactions of the Association for Computational Linguistics, 2015. Pan Li, Yanbang Wang, Hongwei Wang, and Jure Leskovec. Distance encoding: Design provably more powerful neural networks for graph representation learning. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, pages 4465–4478, 2020. D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social networks. InJournalofAmerican Society for Information Science and Technology, 2007. Qing Lu and Lise Getoor. Link-based classication. In Proc. of the International Conference on Machine Learning (ICML), 2003. Y. Luo, Q. Wang, B. Wang, and L. Guo. Context-dependent knowledge graph embedding. InConferenceon Emperical Methods in Natural Language Processing (EMNLP), 2015. Yi Luo, Aiguo Chen, Bei Hui, and Ke Yan. Memory-associated dierential learning, 2021. Mahout. Distributed stochastic singular value decomposition. URL https://mahout.apache.org/docs/ latest/algorithms/linear-algebra/d-ssvd.html. 175 Elan Sopher Markowitz, Keshav Balasubramanian, Mehrnoosh Mirtaheri, Sami Abu-El-Haija, Bryan Per- ozzi, Greg Ver Steeg, and Aram Galstyan. Graph traversal with tensor functionals: A meta-algorithm for scalable learning. In International Conference on Learning Representations, ICLR, 2021. Selin Merdan, Christine L. Barnett, and Brian T. Denton. Data analytics for optimal detection of metastatic prostate cancer. Technical report, University of Michigan, 2017. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Je Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, 2013. Volodymyr Mnih, Nicolas Heess, Alex Graves, and koray kavukcuoglu. Recurrent models of visual atten- tion. In Advances in Neural Information Processing Systems (NIPS), 2014. F. Molteni, R. Buizza, T.N. Palmer, and T. Petroliagis. The ecmwf ensemble prediction system: Methodology and validation. In Quarterly Journal of the Royal Meteorological Society, pages 73–119, 1996. Mengyang Niu. Gat+label+reuse+topo loss ogb submission. In GitHub, 2020. URL https://github.com/ mengyangniu/dgl/tree/master/examples/pytorch/ogb/ogbn-arxiv. Antonio Ortega, Pascal Frossard, Jelena Kovačević, José M. F. Moura, and Pierre Vandergheynst. Graph signal processing: Overview, challenges, and applications. ProceedingsoftheIEEE, 106(5):808–828, 2018. doi: 10.1109/JPROC.2018.2820126. Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bring- ing order to the web. Technical report, Stanford InfoLab, 1999. URL http://ilpubs.stanford.edu: 8090/422/. Adam Paszke, Sam Gross, Francisco Massa, and Others. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, 2019. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. Jerey Pennington, Richard Socher, and Christopher Manning. GloVe: Global vectors for word represen- tation. In Empirical Methods in Natural Language Processing, EMNLP, pages 1532–1543, 2014. Bryan Perozzi and Leman Akoglu. Discovering communities and anomalies in attributed graphs: Inter- active visual exploration and summarization. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 24:1–24:40, 2018. Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In ACM SIGKDD international conference on Knowledge discovery and data mining, KDD, pages 701–710, 2014. Omri Puny, Heli Ben-Hamu, and Yaron Lipman. Global attention improves graph networks generalization, 2020. PyTorch RPC. Pytorch: Distributed rpc framework, 2019. URL https://pytorch.org/docs/stable/rpc.html. Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and Jie Tang. Network embedding as matrix factorization: Unifying deepwalk, line, pte, and node2vec. InInternationalConferenceonWebSearchand Data Mining, WSDM, pages 459–467, 2018. 176 Vignesh Ramanathan, Jonathan Huang, Sami Abu-El-Haija, Alexander Gorban, Kevin Murphy, and Li Fei- Fei. Detecting events and key actors in multi-person videos. InIEEEConferenceonComputerVisionand Pattern Recognition (CVPR), 2016. Shunli Ren. Gat(norm.adj.) + label reuse + self kd for ogbn-arxiv. In GitHub, 2020. URL https://github. com/ShunliRen/dgl/tree/master/examples/pytorch/ogb/ogbn-arxiv. Guillaume Salha, Romain Hennequin, Jean-Baptiste Remy, Manuel Moussallam, and Michalis Vazirgiannis. Fastgae: Scalable graph autoencoders with stochastic subgraph decoding. NeuralNetworks, pages 1–19, 2021. SciPy. scipy.sparse.linalg.svds - scipy v1.7.1 manual, 2008. URL https://docs.scipy.org/doc/scipy/ reference/generated/scipy.sparse.linalg.svds.html. Jianbo Shi and J. Malik. Normalized cuts and image segmentation. InIEEETransactionsonPatternAnalysis and Machine Intelligence, pages 888–905, 2000. K. Shvachko, Hairong Kuang, S. Radia, and R. Chansler. The hadoop distributed le system. In IEEE Symposium on Mass Storage Systems and Technologies (MSST), pages 1–10, 2010. Steven Skiena. The Algorithm Design Manual: Second Edition. Springer, 2008. Spark. Dimensionality reduction - rdd-based api, 2014. URL https://spark.apache.org/docs/2.2.0/ mllib-dimensionality-reduction.html. Robyn Speer, Joshua Chin, and Catherine Havasi. Conceptnet 5.5: An open multilingual graph of general knowledge. In AAAI Conference on Articial Intelligence, 2017. Nitish Srivastava, Georey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overtting. In Journal of Machine Learning Research, pages 1929–1958, 2014. Chris Stark, Bobby-Joe Breitkreutz, Teresa Reguly, Lorrie Boucher, Ashton Breitkreutz, and Mike Tyers. Biogrid: a general repository for interaction datasets. In Nucleic Acids Research, 2006. Chuxiong Sun and Guoshi Wu. Adaptive graph diusion networks with hop-wise attention, 2020. Chuxiong Sun, Hongming Gu, and Jie Hu. Scalable and adaptive graph neural networks with self-label- enhanced training. In ArXiv/2104.09376, 2021. Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan. Fast random walk with restart and its applications. In International Conference on Data Mining (ICDM). IEEE, 2006. Jerey Travers and Stanley Milgram. An experimental study of the small world problem. In Sociometry, pages 425–443, 1969. M. Turk and A. Pentland. Eigenfaces for recognition. In Journal of Cognitive Neuroscience, pages 71—-86, 1991. L.J.P. van der Maaten and G.E Hinton. Visualizing high-dimensional data using t-sne. InJournalofMachine Learning Research, 2008. Stefan van der Walt, S. Chris Colbert, and Gael Varoquaux. The numpy array: A structure for ecient numerical computation. In Computing in Science and Engineering, 2011. 177 Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations, 2018. Petar Veličković, William Fedus, William L. Hamilton, Pietro Liò, Yoshua Bengio, and R Devon Hjelm. Deep graph infomax. In International Conference on Learning Representations, 2019. Hongwei Wang and Jure Leskovec. Unifying graph convolutional neural networks and label propagation, 2020. Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song, Jinjing Zhou, Chao Ma, Lingfan Yu, Yu Gai, Tianjun Xiao, Tong He, George Karypis, Jinyang Li, and Zheng Zhang. Deep graph library: A graph-centric, highly-performant package for graph neural networks. arXiv preprint arXiv:1909.01315, 2019. Yangkun Wang. Bag of tricks of semi-supervised classication with graph neural networks, 2021. Jason Weston, Frédéric Ratle, Hossein Mobahi, and Ronan Collobert. Deep learning via semi-supervised embedding. In Neural Networks: Tricks of the Trade, pages 639–655. Springer, 2012. Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. Simplifying graph convolutional networks. In International Conference on Machine Learning, 2019. Kesheng Wu and Horst Simon. Thick-restart lanczos method for large symmetric eigenvalue problems. In SIAM Journal on Matrix Analysis and Applications, pages 602–616, 2000. Zhenqin Wu, Bharath Ramsundar, Evan Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh Pappu, Karl Leswingd, and Vijay Pande. Moleculenet: a benchmark for molecular machine learning. In Journal of Chemical Science, 2018. Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka. Representation learning on graphs with jumping knowledge networks. In International Conference on Machine Learning, ICML, pages 5453–5462, 2018. Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations, ICLR, 2019. Yichen Yang, Lingjue Xie, and Fangchen Li. Global and local context-aware graph convolution networks, 2021. URL https://github.com/JeffJeffy/CS224W-OGB-DEA-JK/blob/main/CS224w_final_report.pdf. Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. Revisiting semi-supervised learning with graph embeddings. In International Conference on Machine Learning, 2016a. Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. Hierarchical attention networks for document classication. InConferenceoftheNorthAmericanChapteroftheAssociationfor Computational Linguistics (NAACL), 2016b. Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L. Hamilton, and Jure Leskovec. Graph convolutional neural networks for web-scale recommender systems. In ACM SIGKDD international conference on Knowledge discovery and data mining, 2018. Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, and Viktor Prasanna. GraphSAINT: Graph sampling based inductive learning method. In International Conference on Learning Representa- tions, 2020. 178 Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. In Advances in Neural Information Processing Systems, 2018. Sixin Zhang, Anna Choromanska, and Yann LeCun. Deep learning with elastic averaging sgd. InAdvances in Neural Information Processing Systems, NeurIPS’15, page 685–693, 2015. Xiaojin Zhu, Zoubin Ghahramani, and John Laerty. Semi-supervised learning using gaussian elds and harmonic functions. In Proc. of the International Conference on Machine Learning (ICML), 2003. Difan Zou, Ziniu Hu, Yewen Wang, Song Jiang, Yizhou Sun, and Quanquan Gu. Few-shot representation learning for out-of-vocabulary words. In Advances in Neural Information Processing Systems, 2019. 179
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Learning to diagnose from electronic health records data
PDF
Efficient graph learning: theory and performance evaluation
PDF
Scaling up temporal graph learning: powerful models, efficient algorithms, and optimized systems
PDF
Representation problems in brain imaging
PDF
On information captured by neural networks: connections with memorization and generalization
PDF
Robust causal inference with machine learning on observational data
PDF
Effective graph representation and vertex classification with machine learning techniques
PDF
Hashcode representations of natural language for relation extraction
PDF
Simulation and machine learning at exascale
PDF
Graph embedding algorithms for attributed and temporal graphs
PDF
Hardware-software codesign for accelerating graph neural networks on FPGA
PDF
Learning distributed representations from network data and human navigation
PDF
Towards learning generalization
PDF
Graph machine learning for hardware security and security of graph machine learning: attacks and defenses
PDF
Building straggler-resilient and private machine learning systems in the cloud
PDF
Federated and distributed machine learning at scale: from systems to algorithms to applications
PDF
Modeling and predicting with spatial‐temporal social networks
PDF
Exploiting variable task granularities for scalable and efficient parallel graph analytics
PDF
Scaling up deep graph learning: efficient algorithms, expressive models and fast acceleration
PDF
Learning controllable data generation for scalable model training
Asset Metadata
Creator
Abu-el-Haija, Sami
(author)
Core Title
Fast and label-efficient graph representation learning
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Degree Conferral Date
2023-05
Publication Date
01/20/2023
Defense Date
03/20/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
efficient learning,graph neural networks,graphs,machine learning,matrix-free decomposition,OAI-PMH Harvest,social networks
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Galstyan, Aram (
committee member
), Itti, Laurent (
committee member
), May, Jon (
committee member
), Ver Steeg, Greg (
committee member
)
Creator Email
haija@google.org,sami@haija.org
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC112718701
Unique identifier
UC112718701
Identifier
etd-AbuelHaija-11430.pdf (filename)
Legacy Identifier
etd-AbuelHaija-11430
Document Type
Dissertation
Format
theses (aat)
Rights
Abu-el-Haija, Sami
Internet Media Type
application/pdf
Type
texts
Source
20230126-usctheses-batch-1003
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
efficient learning
graph neural networks
machine learning
matrix-free decomposition
social networks