Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Graph embedding algorithms for attributed and temporal graphs
(USC Thesis Other)
Graph embedding algorithms for attributed and temporal graphs
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Graph Embedding Algorithms for Attributed and Temporal Graphs by Palash Goyal A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Computer Science) December 2019 Copyright 2020 Palash Goyal Acknowledgements I would like to thank my advisor, Professor Emilio Ferrara, for constantly guiding me through the research and providing me his valuable insights. Without his excellent supervision and persistent mentoring, I would not have been able to complete this thesis. I would also like to thank my Ph.D. committee: Professor Gaurav Sukhatme and Cauligi Raghavendra for nding time from their hectic schedule to provide supervision, support and insightful comments for the thesis. My sincere thanks go to Dr. Arquimedes Canedo for providing me an opportunity to join his intern team and widening the scope of my research. I thank Dr. Jure Leskovec and Dr. Steven Skiena for engaging and inspiring me to get new ideas for the research. I also thank my colleagues Anna Sapienza, Ashok Deb, Nitin Kamra, Ayush Jaiswal, Chung Ming Cheung, Homa Hosseinmardi, Sahil Garg, Rob Brekelmans and Shuyang Gao for collaborating in various research endeavors. I also thank Sujit Chhetri for the insightful discussions we had during our internship together. I thank my fellow lab-mates for engaging me in stimulating discussions. Last but not the least, I would like to thank my family and my friends, specially my father which helped me to carry on through my Ph.D. ii Table of Contents Acknowledgements ii List Of Tables viii List Of Figures x Abstract xiv Chapter 1: Introduction 1 1.1 Graph Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Graph Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Thesis Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Chapter 2: Survey of Techniques 4 2.1 Graph Embedding Research Context and Evolution . . . . . . . . . . . . . . . . . . 4 2.2 Denitions and Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 A Taxonomy of Graph Embedding Methods . . . . . . . . . . . . . . . . . . . . . . 6 2.3.1 Factorization based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3.1.1 Locally Linear Embedding (LLE) . . . . . . . . . . . . . . . . . . 7 2.3.1.2 Laplacian Eigenmaps . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.1.3 Cauchy Graph Embedding . . . . . . . . . . . . . . . . . . . . . . 8 2.3.1.4 Structure Preserving Embedding (SPE) . . . . . . . . . . . . . . . 9 2.3.1.5 Graph Factorization (GF) . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.1.6 GraRep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.1.7 HOPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.1.8 Additional Variants . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.2 Random Walk based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.2.1 DeepWalk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.2.2 node2vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.2.3 Hierarchical Representation Learning for Networks (HARP) . . . 12 2.3.2.4 Walklets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.2.5 Additional Variants . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.3 Deep Learning based Methods . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.3.1 Structural Deep Network Embedding (SDNE) . . . . . . . . . . . 13 2.3.3.2 Deep Neural Networks for Learning Graph Representations (DNGR) 14 2.3.3.3 Graph Convolutional Networks (GCN) . . . . . . . . . . . . . . . 14 2.3.3.4 Variational Graph Auto-Encoders (VGAE) . . . . . . . . . . . . . 15 2.3.4 Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3.4.1 LINE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 iii 2.3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.1 Network Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.2 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.4 Link Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4.5 Node Classication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Chapter 3: Universality of Graph Embedding 22 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3 Experiments and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3.1 Graph Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3.2 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.3 Link Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3.4 Node Classication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3.5 Hyper-parameter Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4 A Python Library for Graph Embedding . . . . . . . . . . . . . . . . . . . . . . . . 39 Chapter 4: Benchmarks for Graph Embedding Evaluation 40 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.1.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.2 Notations and Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.2.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.2.2 Graph Embedding Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2.2.1 Factorization based approaches . . . . . . . . . . . . . . . . . . . . 45 4.2.2.2 Random walk approaches . . . . . . . . . . . . . . . . . . . . . . . 45 4.2.2.3 Neural network approaches . . . . . . . . . . . . . . . . . . . . . . 46 4.3 GEM-BEN: Graph Embedding Methods Benchmark . . . . . . . . . . . . . . . . . 47 4.3.1 Real Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.3.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3.3 GFS-score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3.4 Link Prediction Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3.5 Embedding Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.4 Experiments and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4.1 Domain Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.4.2 Sensitivity to Graph Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.4.3 Sensitivity to Graph Density . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.4.4 Sensitivity to Embedding Dimension . . . . . . . . . . . . . . . . . . . . . . 59 4.5 Python Library for GEM-BEN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.6 Results on Synthetic Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.6.1 Synthetic Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.6.2 Synthetic Graph Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.6.3.1 Domain Performance . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.6.3.2 Sensitivity to Graph Size . . . . . . . . . . . . . . . . . . . . . . . 66 iv 4.6.3.3 Sensitivity to Average Node Degree . . . . . . . . . . . . . . . . . 67 4.6.3.4 Sensitivity to Embedding Dimension . . . . . . . . . . . . . . . . . 67 4.7 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Chapter 5: Embedding Networks with Edge Attributes 73 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.2.1 Vanilla Network Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.2.2 Attributed Network Embedding . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.4 ELAINE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.4.1 Variational Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.4.2 Higher Order Proximity and Role Preservations . . . . . . . . . . . . . . . . 83 5.4.2.1 Random Walks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.4.2.2 Role preserving features . . . . . . . . . . . . . . . . . . . . . . . . 84 5.4.3 Incorporating edge labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.4.3.1 Neighborhood and social role reconstruction . . . . . . . . . . . . 85 5.4.3.2 Edge label/attributes reconstruction . . . . . . . . . . . . . . . . . 85 5.4.3.3 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.4.4 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.4.5 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.5.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.5.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.5.4 Parameter settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.6 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.6.1 Link Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.6.2 Node Classication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.6.3 Eect of Each Component . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.6.4 Hyperparameter Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Chapter 6: Updating Embedding in Dynamic Networks 97 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.2 Denitions and Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.3 DynGEM: Dynamic Graph Embedding Model . . . . . . . . . . . . . . . . . . . . . 101 6.3.1 Handling growing graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.3.2 Loss function and training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.3.3 Stability by reusing previous step embedding . . . . . . . . . . . . . . . . . 104 6.3.4 Techniques for scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.4.2 Algorithms and Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . 107 6.5 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.5.1 Graph Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.5.2 Link Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.5.3 Stability of Embedding Methods . . . . . . . . . . . . . . . . . . . . . . . . 109 6.5.4 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.5.5 Application to Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . 111 6.5.6 Eect of Layer Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 v 6.5.7 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Chapter 7: Capturing Dynamics of Networks using Graph Embedding 115 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 7.2.1 Static Graph Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 7.2.2 Dynamic Graph Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 7.2.3 Dynamic Link Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.3 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.4.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.4.2 dyngraph2vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.4.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 7.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 7.5.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 7.5.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 7.6 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 7.6.1 SBM Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 7.6.2 Hep-th Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 7.6.3 AS Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 7.6.4 MAP exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 7.6.5 Hyper-parameter Sensitivity: Lookback . . . . . . . . . . . . . . . . . . . . 135 7.6.6 Length of training sequence versus MAP value . . . . . . . . . . . . . . . . 136 7.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 7.8 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 7.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 Chapter 8: Graph Embedding for Optimal Team Composition 140 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 8.2 Data Collection and Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 8.3 Skill Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 8.4 Network generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 8.4.1 Short-term Performance Network . . . . . . . . . . . . . . . . . . . . . . . . 147 8.4.2 Long-term Performance Network . . . . . . . . . . . . . . . . . . . . . . . . 148 8.4.3 LCC and network properties . . . . . . . . . . . . . . . . . . . . . . . . . . 150 8.5 Performance Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 8.5.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 8.5.2 Network Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 8.5.2.1 Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 8.5.2.2 Traditional Autoencoder . . . . . . . . . . . . . . . . . . . . . . . 153 8.5.2.3 Teammate Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . 154 8.5.3 Evaluation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 8.5.3.1 Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . 155 8.5.3.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 8.5.4 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 8.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 8.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Chapter 9: Conclusion 166 vi Reference List 167 vii List Of Tables 2.1 Summary of notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.1 Dataset Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Summary of strengths and weaknesses of evaluated methods. . . . . . . . . . . . . 38 4.1 Summary of notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2 Average and standard deviation of GFS-score . . . . . . . . . . . . . . . . . . . . . 53 5.1 Summary of notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.2 Dataset Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.3 Common interests of authors in Hep-th . . . . . . . . . . . . . . . . . . . . . . . . 93 5.4 Eect of each component on link prediction for Hep-th. . . . . . . . . . . . . . . . 94 6.1 Notations for deep autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.2 Statistics of datasets.jVj,jEj and T denote the number of nodes, edges and length of time series respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.3 Average MAP of graph reconstruction. . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.4 Average MAP of link prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.5 Stability constants K S (F ) of embeddings on dynamic graphs. . . . . . . . . . . . . 109 6.6 Computation time of embedding methods for the rst forty time steps on each dataset. Speedup exp is the expected speedup based on model parameters. . . . . . 113 6.7 Computation time of embedding methods on SYN while varying the length of time series T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7.1 Dataset Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 7.2 Average MAP values over dierent embedding sizes. . . . . . . . . . . . . . . . . . 134 viii 8.1 Comparison of the overall performance networks' characteristics and its LCC. Note that the number of nodes and links are the same for both the Short-term Performance Network (SPN) and the Long-term Performance Network (LPN), while the range of weights varies from one case to the other. . . . . . . . . . . . . . . . . . . . . . . . 151 8.2 Average and standard deviation of player performance prediction (MSE) and teammate recommendation (MANE) for d = 1; 024 in both SPN and LPN. . . . . 165 ix List Of Figures 2.1 Examples illustrating the eect of type of similarity preserved. Here, CPE and SPE stand for Community Preserving Embedding and Structural-equivalence Preserving Embedding, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1 Precision@k of graph reconstruction for dierent data sets (dimension of embedding is 128). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2 MAP of graph reconstruction for dierent data sets with varying dimensions. . . . 26 3.3 Visualization of SBM using t-SNE (original dimension of embedding is 128). Each point corresponds to a node in the graph. Color of a node denotes its community. 28 3.4 Visualization of Karate club graph. Each point corresponds to a node in the graph. Each node is embedded in a 2-dimensional space using the corresponding embedding method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.5 Precision@k of link prediction for dierent data sets (dimension of embedding is 128). 30 3.6 MAP of link prediction for dierent data sets with varying dimensions. . . . . . . . 30 3.7 Micro-F1 and Macro-F1 of node classication for dierent data sets varying the train-test split ratio (dimension of embedding is 128). . . . . . . . . . . . . . . . . 31 3.8 Micro-F1 and Macro-F1 of node classication for dierent data sets varying the number of dimensions. The train-test split is 50%. . . . . . . . . . . . . . . . . . . 32 3.9 Eect of regularization coecient () in Graph Factorization on various tasks. . . 33 3.10 Eect of attenuation factor () in HOPE on various tasks. . . . . . . . . . . . . . . 33 3.11 Eect of observed link reconstruction weight in SDNE on various tasks. . . . . . . 34 3.12 Eect of random walk bias weights in node2vec on various tasks. . . . . . . . . . . 34 3.13 Eect of random walk bias weights in node2vec on SBM. . . . . . . . . . . . . . . . 35 4.1 Real graphs properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 x 4.2 Performance evaluation of dierent methods varying the attributes of graphs. The x axis denotes the dimension of embedding, whereas the y axis denotes the MAP scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.3 Performance evaluation of dierent methods varying the attributes of graphs. The x axis denotes the dimension of embedding, whereas the y axis denotes the P @100 scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.4 Benchmark Synthetic plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.5 Benchmark Synthetic plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.6 Benchmark plot for individual synthetic graphs. . . . . . . . . . . . . . . . . . . . . 70 4.7 Benchmark plot for individual synthetic graphs. . . . . . . . . . . . . . . . . . . . . 71 5.1 Users i and j are both engaged in work, family and sport topics. Aggregation of their topics over dierent interactions, will cause loss of valuable information. . . . 74 5.2 Traditional deep autoencoder model. . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.3 Edge label aware embedding model. ELAINE extracts higher order relations between nodes using random walks and social role based features. The coupled autoencoder jointly optimizes these features and edge attributes to obtain a unied representation. 81 5.4 Importance of capturing higher order proximity and social roles. (a) Nodes i and j are more similar in (ii) compared to (i) but rst order proximity fails to capture this, (b) Node i and k have similar roles, but they are far apart in network. Using social role indicative statistical features can capture similarity of these nodes. . . . 81 5.5 Precision@k and MAP of link prediction for dierent data sets. . . . . . . . . . . . 89 5.6 Node classication results for dierent data sets. . . . . . . . . . . . . . . . . . . . 91 5.7 Eect of 1 , coecient of edge label reconstruction, on link prediction MAP. . . . 95 5.8 Eect of embedding dimensions on link prediction MAP. It shows that link prediction performance peaks at 128. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.1 DynGEM: Dynamic Graph Embedding Model. The gure shows two snapshots of a dynamic graph and the corresponding deep autoencoder at each step. . . . . . . . 104 6.2 2D visualization of 100-dimensional embeddings for SYN dataset, when nodes change their communities over a time step. A point in any plot represents the embedding of a node in the graph, with the color indicating the node community. Small (big) points are nodes which didn't (did) change communities. Each big point is colored according to its nal community color. . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.3 Results of anomaly detection on Enron and visualization of embeddings for weeks 93, 94 and 101 on Enron dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 xi 7.1 User A breaks ties with his friend at each time step and befriends a friend of a friend. Such temporal patterns require knowledge across multiple time steps for accurate prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.2 Motivating example of network evolution - community shift. . . . . . . . . . . . . . 119 7.3 Motivating example of network evolution - community shift (for clarity, only showing 50 of 500 nodes and 2 out 10 migrating nodes). . . . . . . . . . . . . . . . . . . . 121 7.4 dyngraph2vec architecture variations for dynamic graph embedding. . . . . . . . . 122 7.5 MAP values for the SBM dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 7.6 MAP values for the Hep-th dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . 132 7.7 MAP values for the AS dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 7.8 Mean MAP values for various lookback numbers for Hep-th dataset. . . . . . . . . 135 7.9 Mean MAP values for various lookback numbers for AS dataset. . . . . . . . . . . 136 7.10 MAP value with increasing amount of temporal graphs added in the training data for Hep-th dataset (lookback = 8). . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 7.11 MAP value with increasing amount of temporal graphs added in the training data for AS dataset (lookback = 8). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 8.1 Distribution of the number of matches per player in the Dota 2 dataset. . . . . . . 144 8.2 TrueSkill timelines of players in the top, bottom, and median decile. Lines show the mean of TrueSkill values at each match index, while shades indicate the related standard deviations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 8.3 Kendall's tau coecient distribution computed by comparing each player's ranking in the short-term and long-term performance networks. . . . . . . . . . . . . . . . 149 8.4 Distribution of the number of occurrences per link, i.e. number of times a couple of teammates play together. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 8.5 An example of deep autoencoder model. . . . . . . . . . . . . . . . . . . . . . . . . 153 8.6 Evaluation Framework: The co-play network is divided into training and test networks. The parameters of the models are learned using the training network. We obtain multiple test subnetworks by using a random walk sampling with random restart and input the nodes of these subnetworks to the models for prediction. The predicted weights are then evaluated against the test link weights to obtain various metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 8.7 Distribution of the weights of the network sampled by using random walk. . . . . . 156 xii 8.8 Short-term Performance Network. (a) Mean Squared Error (MSE) gain of models over average prediction. (b) Mean Absolute Normalized Error (MANE) gain of models over average prediction. (c) AvgRec@k of models. . . . . . . . . . . . . . . 158 8.9 Long-term Performance Network. (a) Mean Squared Error (MSE) gain of models over average prediction. (b) Mean Absolute Normalized Error (MANE) gain of models over average prediction. (c) AvgRec@k of models. . . . . . . . . . . . . . . 159 xiii Abstract Learning low-dimensional representations of nodes in a graph has recently gained attention due to its wide applicability in network tasks such as graph visualization, link prediction, node classication and clustering. The models proposed often preserve certain characteristic properties of the graph and are tested on these tasks. In this thesis, I propose to extend the graph embedding work in three directions. Firstly, I yield insights into the existing models and understand the universality of the learned embeddings and embedding methods. Specically, I study the dependence of performance on a task on the model hyperparameters of the embedding method. I also analyze the characteristics of models required for each network task. Further, I propose a benchmark to evaluate any graph embedding approach and draw insights into it. Secondly, I propose an extension of graph embedding approach which can capture edge attributes of a graph. I show that capturing such attributes can be useful in link prediction and propose a model to learn edge attributes along with higher order proximity and social roles. Thirdly, I build models which can update embeddings eciently for streaming graphs and can capture temporal patterns in sequential graphs. xiv Chapter 1 Introduction 1.1 Graph Analysis Graph analysis has been attracting increasing attention in the recent years due the ubiquity of networks in the real world. Graphs (a.k.a. networks) have been used to denote information in various areas including biology (Protein-Protein interaction networks)[203], social sciences (friendship networks)[64] and linguistics (word co-occurrence networks)[97]. Modeling the interactions between entities as graphs has enabled researchers to understand the various network systems in a systematic manner[128]. For example, social networks have been used for applications like friendship or content recommendation, as well as for advertisement [135]. Graph analytic tasks can be broadly abstracted into the following four categories: (a) node classication[17], (b) link prediction[135], (c) clustering[50], and (d) visualization[142]. Node classication aims at determining the label of nodes (a.k.a. vertices) based on other labeled nodes and the topology of the network. Link prediction refers to the task of predicting missing links or links that are likely to occur in the future. Clustering is used to nd subsets of similar nodes and group them together; nally, visualization helps in providing insights into the structure of the network. 1 1.2 Graph Embedding In the past few decades, many methods have been proposed for the tasks dened above. For node classication, there are broadly two categories of approaches | methods which use random walks to propagate the labels[8, 10], and methods which extract features from nodes and apply classiers on them[18, 139]. Approaches for link prediction include similarity based methods[99, 1], maximum likelihood models[41, 219], and probabilistic models[65, 83]. Clustering methods include attribute based models[242] and methods which directly maximize (resp., minimize) the inter- cluster (resp., intra-cluster) distances[50, 189]. This survey will provide a taxonomy that captures these application domains and the existing strategies. Typically, a model dened to solve graph-based problems either operates on the original graph adjacency matrix or on a derived vector space. Recently, the methods based on representing networks in vector space, while preserving their properties, have become widely popular[3, 198, 213]. Obtaining such an embedding is useful in the tasks dened above. 1 The embeddings are input as features to a model and the parameters are learned based on the training data. This obviates the need for complex classication models which are applied directly on the graph. 1.3 Challenges The major challenges in graph embedding domain are as follows: Incomplete understanding of embeddings learned: Over the past decade, several embedding methods have been proposed which aim to preserve varying properties. However, understanding the primary dierence between them and recommending embedding methods for a new graph is still challenging. 1 The term graph embedding has been used in the literature in two ways: to represent an entire graph in vector space([225, 229]), or to represent each individual node in vector space([140, 171]). In this paper, we use the latter denition since such representations can be used for tasks like node classication, dierently from the former representation. The former denition, embedding the entire graph, has many interesting applications including pattern recognition [175], computer vision [221], and document analysis [24]. Interested readers are referred to Yan et. al. [225] for further reading. 2 Oversimplication of real world data: The data from real world is often multi-modal and contains more information than captured by vertices and edges of the graphs. Recent methods also focused on capturing node attributes into network embedding but edge attributes which may have impact on understanding the network is widely unstudied. Lack of embedding models for dynamic graphs: Most recent methods focus on a snapshot of a graph. Real world networks often evolve with time as new nodes and edges are added. Capturing the dynamics is non-trivial and challenging. 1.4 Thesis Contribution : The major contributions of this thesis are as follows: Graph embedding survey and benchmark: I yield insights into the existing models and understand the universality of the learned embeddings and embedding methods. Chapters 2, 3 and 4 cover this. Model capturing edge attributes: I propose an extension of graph embedding approach which can capture edge attributes of a graph. Chapter 5 covers this. Dynamic graph embedding models: I build models which can update embeddings eciently for streaming graphs and can capture temporal patterns in sequential graphs. Chapters 6 and 7 introduce these models. Application to teammate recommendation: I further show an application to teammate recommendation to illustrate the utility of embedding in this domain. Chapter 8 covers this. 3 Chapter 2 Survey of Techniques In the past decade, there has been a lot of research in the eld of graph embedding, with a focus on designing new embedding algorithms. More recently, researchers pushed forward scalable embedding algorithms that can be applied on graphs with millions of nodes and edges. In the following, we provide historical context about the research progress in this domain (x2.1), then propose a taxonomy of graph embedding techniques (x2.3) covering (i) factorization methods (x2.3.1), (ii) random walk techniques (x2.3.2), (iii) deep learning (x2.3.3), and (iv) other miscellaneous strategies (x2.3.4). 2.1 Graph Embedding Research Context and Evolution In the early 2000s, researchers developed graph embedding algorithms as part of dimensionality reduction techniques. They would construct a similarity graph for a set of n D-dimensional points based on neighborhood and then embed the nodes of the graph in a d-dimensional vector space, where d D. The idea for embedding was to keep connected nodes closer to each other in the vector space. Laplacian Eigenmaps [14] and Locally Linear Embedding (LLE) [178] are examples of algorithms based on this rationale. However, scalability is a major issue in this approach, whose time complexity is O(jVj 2 ). 4 Since 2010, research on graph embedding has shifted to obtaining scalable graph embedding techniques which leverage the sparsity of real-world networks. For example, Graph Factorization [3] uses an approximate factorization of the adjacency matrix as the embedding. LINE [198] extends this approach and attempts to preserve both rst order and second proximities. HOPE [158] extends LINE to attempt preserve high-order proximity by decomposing the similarity matrix rather than adjacency matrix using a generalized Singular Value Decomposition (SVD). SDNE [213] uses autoencoders to embed graph nodes and capture highly non-linear dependencies. The new scalable approaches have a time complexity of O(jEj). 2.2 Denitions and Preliminaries We represent the setf1; ;ng by [n] in the rest of the thesis. We start by formally dening several preliminaries which have been dened similar to Wang et al. [213]. Denition 1 (Graph) A graph G(V;E) is a collection of V =fv 1 ; ;v n g vertices (a.k.a. nodes) and E =fe ij g n i;j=1 edges. The adjacency matrix S of graph G contains non-negative weights associated with each edge: s ij 0. If v i and v j are not connected to each other, then s ij = 0. For undirected weighted graphs, s ij =s ji 8i;j2 [n]. The edge weights ij is generally treated as a measure of similarity between the nodes v i andv j . The higher the edge weight, the more similar the two nodes are expected to be. Denition 2 (First-order proximity) Edge weights s ij are also called rst-order proximities between nodes v i and v j , since they are the rst and foremost measures of similarity between two nodes. We can similarly dene higher-order proximities between nodes. For instance, Denition 3 (Second-order proximity) The second-order proximity between a pair of nodes de- scribes the proximity of the pair's neighborhood structure. Let s i = [s i1 ; ;s in ] denote the 5 rst-order proximity between v i and other nodes. Then, second-order proximity between v i and v j is determined by the similarity ofs i ands j . Second-order proximity compares the neighborhood of two nodes and treats them as similar if they have a similar neighborhood. It is possible to dene higher-order proximities using other metrics, e.g. Katz Index, Rooted PageRank, Common Neighbors, Adamic Adar, etc. (for detailed denitions, omitted here in the interest of space, see Ou et al. [158]). Next, we dene a graph embedding: Denition 4 (Graph embedding) Given a graph G = (V;E), a graph embedding is a mapping f :v i !y i 2R d 8i2 [n] such that djVj and the function f preserves some proximity measure dened on graph G. An embedding therefore maps each node to a low-dimensional feature vector and tries to preserve the connection strengths between vertices. For instance, an embedding preserving rst- order proximity might be obtained by minimizing P i;j s ij ky i y j k 2 2 . Let two node pairs (v i ;v j ) and (v i ;v k ) be associated with connections strengths such that s ij >s ik . In this case, v i and v j will be mapped to points in the embedding space that will be closer each other than the mapping of v i and v k . 2.3 A Taxonomy of Graph Embedding Methods We propose a taxonomy of embedding approaches. We categorize the embedding methods into three broad categories: (1) Factorization based, (2) Random Walk based, and (3) Deep Learning based. Below we explain the characteristics of each of these categories and provide a summary of a few representative approaches for each category, using the notation presented in Table 2.1. 6 G Graphical representation of the data V Set of vertices in the graph E Set of edges in the graph d Number of dimensions Y Embedding of the graph,jVjd Y i Embedding of node v i , 1d (also i th row of Y ) Ys Source embedding of a directed graph,jVjd Yt Target embedding of a directed graph,jVjd W Adjacency matrix of the graph,jVjjVj D Diagonal matrix of the degree of each vertex,jVjjVj L Graph Laplacian (L =DW ),jVjjVj <Y i ;Y j > Inner product of Y i and Y j i.e. Y i Y T j S Similarity matrix of the graph,jVjjVj Table 2.1: Summary of notation 2.3.1 Factorization based Methods Factorization based algorithms represent the connections between nodes in the form of a matrix and factorize this matrix to obtain the embedding. The matrices used to represent the connections include node adjacency matrix, Laplacian matrix, node transition probability matrix, and Katz similarity matrix, among others. Approaches to factorize the representative matrix vary based on the matrix properties. If the obtained matrix is positive semidenite, e.g. the Laplacian matrix, one can use eigenvalue decomposition. For unstructured matrices, one can use gradient descent methods to obtain the embedding in linear time. 2.3.1.1 Locally Linear Embedding (LLE) LLE [178] assumes that every node is a linear combination of its neighbors in the embedding space. If we assume that the adjacency matrix element W ij of graphG represents the weight of node j in the representation of node i, we dene Y i X j W ij Y j 8i2V: 7 Hence, we can obtain the embedding Y Nd by minimizing (Y ) = X i jY i X j W ij Y j j 2 ; To remove degenerate solutions, the variance of the embedding is constrained as 1 N Y T Y =I. To further remove translational invariance, the embedding is centered around zero: P i Y i = 0. The above constrained optimization problem can be reduced to an eigenvalue problem, whose solution is to take the bottom d + 1 eigenvectors of the sparse matrix (IW ) T (IW ) and discarding the eigenvector corresponding to the smallest eigenvalue. 2.3.1.2 Laplacian Eigenmaps Laplacian Eigenmaps [14] aims to keep the embedding of two nodes close when the weight W ij is high. Specically, they minimize the following objective function (Y ) = 1 2 X i;j jY i Y j j 2 W ij =tr(Y T LY ); whereL is the Laplacian of graphG. The objective function is subjected to the constraintY T DY = I to eliminate trivial solution. The solution to this can be obtained by taking the eigenvectors corresponding to the d smallest eigenvalues of the normalized Laplacian, L norm =D 1=2 LD 1=2 . 2.3.1.3 Cauchy Graph Embedding Laplacian Eigenmaps uses a quadratic penalty function on the distance between embeddings. The objective function thus emphasizes preservation of dissimilarity between nodes more than their similarity. This may yield embeddings which do not preserve local topology, which can be dened as the equality between relative order of edge weights (W ij ) and inverse order of distances in the embedded space (jY i Y j j 2 ). Cauchy Graph Embedding [140] tackles this problem by replacing 8 the quadratic functionjY i Y j j 2 with jYiYjj 2 jYiYjj 2 + 2 . Upon rearrangement, the objective function to be maximized becomes (Y ) = X i;j W ij jY i Y j j 2 + 2 ; with constraints Y T Y = I and P i Y i = 0 for each i. The new objective is an inverse function of distance and thus puts emphasis on similar nodes rather than dissimilar nodes. The authors propose several variants including Gaussian, Exponential and Linear embeddings with varying relative emphasis on the distance between nodes. 2.3.1.4 Structure Preserving Embedding (SPE) Structure Preserving Embedding ([188]) is another approach which extends Laplacian Eigenmaps. SPE aims to reconstruct the input graph exactly. The embedding is stored as a positive semidenite kernel matrix K and a connectivity algorithmG is dened which reconstructs the graph from K. The kernel K is chosen such that it maximizes tr(KW) which attempts to recover rank-1 spectral embedding. Choice of the connectivity algorithmG induces constraints on this objective function. For e.g., if the connectivity scheme is to connect each node to neighbors which lie within a ball of radius , the constraint (K ii +K jj 2K ij )(W ij 1=2)(W ij 1=2) produces a kernel which can perfectly reconstruct the original graph. To handle noise in the graph, a slack variable is added. For -connectivity, the optimization thus becomes max tr(KA)C s.t. (K ii +K jj 2K ij )(W ij 1=2)(W ij 1=2), where is the slack variable and C controls slackness. 9 2.3.1.5 Graph Factorization (GF) To the best of our knowledge, Graph Factorization [3] was the rst method to obtain a graph embedding in O(jEj) time. To obtain the embedding, GF factorizes the adjacency matrix of the graph, minimizing the following loss function (Y;) = 1 2 X (i;j)2E (W ij <Y i ;Y j >) 2 + 2 X i kY i k 2 ; where is a regularization coecient. Note that the summation is over the observed edges as opposed to all possible edges. This is an approximation in the interest of scalability, and as such it may introduce noise in the solution. Note that as the adjacency matrix is often not positive semidenite, the minimum of the loss function is greater than 0 even if the dimensionality of embedding isjVj. 2.3.1.6 GraRep GraRep [28] denes the node transition probability asT =D 1 W and preservesk-order proximity by minimizingkX k Y k s Y kT t k 2 F whereX k is derived fromT k (refer to [28] for a detailed derivation). It then concatenates Y k s for all k to form Y s . Note that this is similar to HOPE [158] which minimizeskSY s Y T t k 2 F where S is an appropriate similarity matrix. The drawback of GraRep is scalability, since T k can have O(jVj 2 ) non-zero entries. 2.3.1.7 HOPE HOPE [158] preserves higher order proximity by minimizingkSY s Y T t k 2 F , whereS is the similarity matrix. The authors experimented with dierent similarity measures, including Katz Index, Rooted Page Rank, Common Neighbors, and Adamic-Adar score. They represented each similarity measure asS =M 1 g M l , where bothM g andM l are sparse. This enables HOPE to use generalized Singular Value Decomposition (SVD) [211] to obtain the embedding eciently. 10 2.3.1.8 Additional Variants For the purpose of dimensionality reduction of high dimensional data, there are several other methods developed capable of performing graph embedding. Yan et al. [225] survey a list of such methods including Principal Component Analysis (PCA) [106], Linear Discrimant Analysis (LDA) [144], ISOMAP [202], Multidimesional Scaling (MDS) [119], Locality Preserving Properties (LPP) [82] and Kernel Eigenmaps [20]. Matrinex et al. [145]) proposed a general framework, non-negative graph embedding, which yields non-negative embeddings for these algorithms. A number of recent techniques have focused on jointly learning network structure and additional node attribute information available for the network., Augmented Relation Embedding (ARE) [136] augments network with content based features for images and modies the graph-Laplacian to capture such information. Text-associated DeepWalk (TADW) [226] performs matrix factorization on node similarity matrix disentangling the representation using text feature matrix. Heterogeneous Network Embedding (HNE) [33] learns representation for each modality of the network and then unies them into a common space using linear transformations. Other works ([207, 238, 95]) perform similar transformations between various node attributes and learn joint embedding. 2.3.2 Random Walk based Methods Random walks have been used to approximate many properties in the graph including node centrality[155] and similarity[62]. They are especially useful when one can either only partially observe the graph, or the graph is too large to measure in its entirety. Embedding techniques using random walks on graphs to obtain node representations have been proposed: DeepWalk and node2vec are two examples. 2.3.2.1 DeepWalk DeepWalk [169] preserves higher-order proximity between nodes by maximizing the probability of observing the lastk nodes and the nextk nodes in the random walk centered atv i , i.e. maximizing 11 logPr(v ik ;:::;v i1 ;v i+1 ;:::;v i+k jY i ), where 2k +1 is the length of the random walk. The model generates multiple random walks each of length 2k + 1 and performs the optimization over sum of log-likelihoods for each random walk. A dot-product based decoder is used to reconstruct the edges from the node embeddings. 2.3.2.2 node2vec Similar to DeepWalk [169], node2vec [78] preserves higher-order proximity between nodes by maximizing the probability of occurrence of subsequent nodes in xed length random walks. The crucial dierence from DeepWalk is that node2vec employs biased-random walks that provide a trade-o between breadth-rst (BFS) and depth-rst (DFS) graph searches, and hence produces higher-quality and more informative embeddings than DeepWalk. Choosing the right balance enables node2vec to preserve community structure as well as structural equivalence between nodes. 2.3.2.3 Hierarchical Representation Learning for Networks (HARP) DeepWalk and node2vec initialize the node embeddings randomly for training the models. As their objective function is non-convex, such initializations can be stuck in local optima. HARP [34] introduces a strategy to improve the solution and avoid local optima by better weight initialization. To this purpose, HARP creates hierarchy of nodes by aggregating nodes in the previous layer of hierarchy using graph coarsening. It then generates embedding of the coarsest graph and initializes the node embeddings of the rened graph (one up in the hierarchy) with the learned embedding. It propagates such embeddings through the hierarchy to obtain the embeddings of the original graph. Thus HARP can be used in conjunction with random walk based methods like DeepWalk and node2vec to obtain better solutions to the optimization function. 12 2.3.2.4 Walklets DeepWalk and node2vec implicitly preserve higher order proximity between nodes by generating multiple random walks which connect nodes at various distances due to its stochastic nature. On the other hand, factorization based approches like GF and HOPE explicitly preserve distances between nodes by modeling it in their objective function. Walklets [170] combine this idea of explicit modeling with random walks. The model modies the random walk strategy used in DeepWalk by skipping over some nodes in the graph. This is performed for multiple skip lengths, analogous to factorizingA k in GraRep, and the resulting set of random walks are used for training the model similar to DeepWalk. 2.3.2.5 Additional Variants There have been several variations of the above methods proposed recently. Similar to augmenting graph structure with node attributes for factorization based methods, GenVector [230], Discrimina- tive Deep Random Walk (DDRW) [132], Tri-party Deep Network Representation (TriDNR) [161] and [229] extend random walks to jointly learn network structure and node attributes. 2.3.3 Deep Learning based Methods The growing research on deep learning has led to a deluge of deep neural networks based methods applied to graphs[213, 30, 157]. Deep autoencoders have been used for dimensionality reduction[15] due to their ability to model non-linear structure in the data. Recently, SDNE [213], DNGR [30] utilized this ability of deep autoencoder to generate an embedding model that can capture non-linearity in graphs. 2.3.3.1 Structural Deep Network Embedding (SDNE) Wang et al. [213] proposed to use deep autoencoders to preserve the rst and second order network proximities. They achieve this by jointly optimizing the two proximities. The approach uses highly 13 non-linear functions to obtain the embedding. The model consists of two parts: unsupervised and supervised. The former consists of an autoencoder aiming at nding an embedding for a node which can reconstruct its neighborhood. The latter is based on Laplacian Eigenmaps[14] which apply a penalty when similar vertices are mapped far from each other in the embedding space. 2.3.3.2 Deep Neural Networks for Learning Graph Representations (DNGR) DNGR combines random surng with deep autoencoder. The model consists of 3 components: random surng, positive pointwise mutual information (PPMI) calculation and stacked denoising autoencoders. Random surng model is used on the input graph to generate a probabilistic co-occurence matrix, analogous to similarity matrix in HOPE. The matrix is transformed to a PPMI matrix and input into a stacked denoising autoencoder to obtain the embedding. Inputting PPMI matrix ensures that the autoencoder model can capture higher order proximity. Furthermore, using stacked denoising autoencoders aids robustness of the model in presence of noise in the graph as well as in capturing underlying structure required for tasks such as link prediction and node classication. 2.3.3.3 Graph Convolutional Networks (GCN) Deep neural network based methods discussed above, namely SDNE and DNGR, take as input the global neighborhood of each node (a row of PPMI for DNGR and adjacency matrix for SDNE). This can be computationally expensive and inoptimal for large sparse graphs. Graph Convolutional Networks (GCNs) [112] tackle this problem by dening a convolution operator on graph. The model iteratively aggregates the embeddings of neighbors for a node and uses a function of the obtained embedding and its embedding at previous iteration to obtain the new embedding. Aggregating embedding of only local neighborhood makes it scalable and multiple iterations allows the learned embedding of a node to characterize global neighborhood. 14 Several recent papers ([23, 84, 55, 133, 47, 80]) have proposed methods using convolution on graphs to obtain semi-supervised embedding, which can be used to obtain unsupervised embedding by dening unique labels for each node. The approaches vary in the construction of convolutional lters which can broadly be categorized into spatial and spectral lters. Spatial lters operate directly on the original graph and adjacency matrix whereas spectral lters operate on the spectrum of graph-Laplacian. 2.3.3.4 Variational Graph Auto-Encoders (VGAE) Kipf et al. [113] evaluate the performance of variational autoencoders [111] on the task of graph embedding. The model uses a graph convolutional network (GCN) encoder and an inner product decoder. The input is adjacency matrix and they rely on GCN to learn the higher order dependencies between nodes. They empirically show that using variational autoencoders can improve performance compared to non-probabilistic autoencoders. 2.3.4 Other Methods 2.3.4.1 LINE LINE [198] explicitly denes two functions, one each for rst- and second-order proximities, and minimizes the combination of the two. The function for rst-order proximity is similar to that of Graph Factorization (GF) [3] in that they both aim to keep the adjacency matrix and dot product of embeddings close. The dierence is that GF does this by directly minimizing the dierence of the two. Instead, LINE denes two joint probability distributions for each pair of vertices, one using adjancency matrix and the other using the embedding. Then, LINE minimizes 15 the Kullback-Leibler (KL) divergence of these two distributions. The two distributions and the objective function are as follows p 1 (v i ;v j ) = 1 1 +exp(<Y i ;Y j >) ^ p 1 (v i ;v j ) = W ij P (i;j)2E W ij O 1 =KL( ^ p 1 ;p 1 ) O 1 = X (i;j)2E W ij logp 1 (v i ;v j ): The authors similarly dene probability distributions and objective function for the second-order proximity. 2.3.5 Discussion 1 2 3 5 6 7 4 8 (a) Graph G 1 0.550 0.525 0.500 0.475 0.450 0.425 0.400 0.510 0.505 0.500 0.495 0.490 1, 2, 3, 4, 5, 6, 7, 8 (b) CPE for G 1 0.12 0.14 0.16 0.18 0.20 0.22 0.020 0.025 0.030 0.035 0.040 0.045 0.050 1, 2, 3, 4 5, 6, 7, 8 (c) SPE for G 1 (d) Graph G 2 1.0 0.8 0.6 0.4 0.2 0.0 0.2 0.6 0.4 0.2 0.0 0.2 0.4 0.6 1 2 3 4,5,6 7,8,9 (e) CPE for G 2 1.0 0.8 0.6 0.4 0.2 0.0 0.2 0.6 0.4 0.2 0.0 0.2 0.4 0.6 1 3 2 4,5,6 7,8,9 (f) SPE for G 2 Figure 2.1: Examples illustrating the eect of type of similarity preserved. Here, CPE and SPE stand for Community Preserving Embedding and Structural-equivalence Preserving Embedding, respectively. 16 We can interpret embeddings as representations which describe graph data. Thus, embeddings can yield insights into the properties of a network. We illustrate this in Figure 2.1. Consider a complete bipartite graphG. An embedding algorithm which attempts to keep two connected nodes close (i.e., preserve the community structure), would fail to capture the structure of the graph | as shown in 2.1(b). However, an algorithm which embeds structurally-equivalent nodes together learns an interpretable embedding | as shown in 2.1(c). Similarly, in 2.1(d) we consider a graph with two star components connected through a hub. Nodes 1 and 3 are structurally equivalent (they link to the same nodes) and are clustered together in 2.1(f), whereas in 2.1(e) they are far apart. The classes of algorithms above can be described in terms of their ability to explain the properties of graphs. Factorization-based methods are not capable of learning an arbitrary function, e.g., to explain network connectivity. Thus, unless explicitly included in their objective function, they cannot learn structural equivalence. In random walk based methods, the mixture of equivalences can be controlled to a certain extent by varying the random walk parameters. Deep learning methods can model a wide range of functions following the universal approximation theorem [89]: given enough parameters, they can learn the mix of community and structural equivalence, to embed the nodes such that the reconstruction error is minimized. We can interpret the weights of the autoencoder as a representation of the structure of the graph. For example, 2.1(c) plots the embedding learned by SDNE for the complete bipartite graph G 1 . The autoencoder stored the bipartite structure in weights and achieved perfect reconstruction. Given the variety of properties of real-world graphs, using general non-linear models that span a large class of functions is a promising direction that warrants further exploration. 17 2.4 Applications As graph representations, embeddings can be used in a variety of tasks. These applications can be broadly classied as: network compression (x2.4.1), visualization (x2.4.2), clustering (x2.4.3), link prediction (x2.4.4), and node classication (x2.4.5). 2.4.1 Network Compression Feder et al.[59] introduced the concept of network compression (a.k.a. graph simplication). For a graph G, they dened a compression G which has smaller number of edges. The goal was to store the network more eciently and run graph analysis algorithms faster. They obtained the compression graph by partitioning the original graph into bipartite cliques and replacing them by trees, thus reducing the number of edges. Over the years, many researchers have used aggregation based methods [162, 205, 206] to compress graphs. The main idea in this line of work is to exploit the link structure of the graph to group nodes and edges. Navlakha et al.[151] used Minimum Description Length (MDL) [177] from information theory to summarize a graph into a graph summary and edge correction. Similar to these representations, graph embedding can also be interpreted as a summarization of graph. Wang et al.[213] and Ou et al.[158] tested this hypothesis explicitly by reconstructing the original graph from the embedding and evaluating the reconstruction error. They show that a low dimensional representation for each node (in the order of 100s) suces to reconstruct the graph with high precision. 2.4.2 Visualization Application of visualizing graphs can be dated back to 1736 when Euler used it to solve "Konigs- berger Bruckenproblem" [107]. In the recent years, graph visualization has found applications in software engineering [67], electrical circuits [49], biology [203] and sociology [64]. Battista et al.[49] 18 and Eades et al.[56] survey a range of methods used to draw graphs and dene aesthetic criteria for this purpose. Herman et al.[87] generalize this and view it from an information visualization perspective. They study and compare various traditional layouts used to draw graphs including tree-, 3D- and hyperbolic-based layouts. As embedding represents a graph in a vector space, dimensionality reduction techniques like Principal Component Analysis (PCA) [167] and t-distributed stochastic neighbor embedding (t-SNE) [142] can be applied on it to visualize the graph. The authors of DeepWalk [169] illustrated the goodness of their embedding approach by visualizing the Zachary's Karate Club network. The authors of LINE [198] visualized the DBLP co-authorship network, and showed that LINE is able to cluster together authors in the same eld. The authors of SDNE [213] applied it on 20-Newsgroup document similarity network to obtain clusters of documents based on topics. 2.4.3 Clustering Graph clustering (a.k.a., network partitioning) can be of two types: (a) structure based, and (b) attribute based clustering. The former can be further divided into two categories, namely community based, and structurally equivalent clustering. Structure-based methods [50, 189, 156], aim to nd dense subgraphs with high number of intra-cluster edges, and low number of inter- cluster edges. Structural equivalence clustering [224], on the contrary, is designed to identify nodes with similar roles (like bridges and outliers). Attribute based methods [242] utilize node labels, in addition to observed links, to cluster nodes. White et al.[220] usedk-means on the embedding to cluster the nodes and visualize the clusters obtained on Wordnet and NCAA data sets verifying that the clusters obtained have intuitive interpretation. Recent methods on embedding haven't explicitly evaluated their models on this task and thus it is a promising eld of research in the graph embedding community. 19 2.4.4 Link Prediction Networks are constructed from the observed interactions between entities, which may be incomplete or inaccurate. The challenge often lies in identifying spurious interactions and predicting missing information. Link prediction refers to the task of predicting either missing interactions or links that may appear in the future in an evolving network. Link prediction is pervasive in biological network analysis, where verifying the existence of links between nodes requires costly experimental tests. Limiting the experiments to links ordered by presence likelihood has been shown to be very cost eective. In social networks, link prediction is used to predict probable friendships, which can be used for recommendation and lead to a more satisfactory user experience. Liben-Nowell et al.[135], Lu et al.[138] and Hasan et al.[6] survey the recent progress in this eld and categorize the algorithms into (a) similarity based (local and global) [99, 1, 108], (b) maximum likelihood based [41, 219] and (c) probabilistic methods [65, 83, 234]. Embeddings capture inherent dynamics of the network either explicitly or implicitly thus enabling application to link prediction. Wang et al.[213] and Ou et al.[158] predict links from the learned node representations on publicly available collaboration and social networks. In addition, Grover et al.[78] apply it to biology networks. They show that on these data sets links predicted using embeddings are more accurate than traditional similarity based link prediction methods described above. 2.4.5 Node Classication Often in networks, a fraction of nodes are labeled. In social networks, labels may indicate interests, beliefs, or demographics. In language networks, a document may be labeled with topics or keywords, whereas the labels of entities in biology networks may be based on functionality. Due to various factors, labels may be unknown for large fractions of nodes. For example, in social networks many users do not provide their demographic information due to privacy concerns. Missing labels can 20 be inferred using the labeled nodes and the links in the network. The task of predicting these missing labels is also known as node classication. Bhagat et al.[17] survey the methods used in the literature for this task. They classify the approaches into two categories, i.e., feature extraction based and random walk based. Feature-based models [18, 139, 152] generate features for nodes based on their neighborhood and local network statistics and then apply a classier like Logistic Regression [90] and Naive Bayes [146] to predict the labels. Random walk based models [8, 10] propagate the labels with random walks. Embeddings can be interpreted as automatically extracted node features based on network structure and thus falls into the rst category. Recent work[169, 198, 158, 213, 78] has evaluated the predictive power of embedding on various information networks including language, social, biology and collaboration graphs. They show that embeddings can predict missing labels with high precision. 21 Chapter 3 Universality of Graph Embedding 3.1 Introduction Graph embedding methods have been shown to yield good performance on various network tasks. However, the dependence of the embeddings on the desired application has not been studied. In this chapter, we aim to answer the following two questions about graph embeddings: (i) how universal are the embeddings?, and (ii) how universal are the embedding methods? The former deals with the relationship between the embedding learned by an approach and the output task. For example, in node2vec, a community biased setting of random walk parameter may yield better performance on link prediction whereas a structure biased setting may improve node classication performance. We aim to characterize this dependence between model hyperparameters and downstream task. Secondly, we delve into the universality of the embedding method and analyze the behavior of various embedding methods. For this purpose, we evaluate the state-of-the-art models on various real and synthetic networks and vary the hyperparameters and show the evaluation results. 22 3.2 Experimental Setup Our experiments evaluate the feature representations obtained using the methods reviewed before on the previous four application domains. Next, we specify the datasets and evaluation metrics we used. The experiments were performed on a Ubuntu 14.04.4 LTS system with 32 cores, 128 GB RAM and a clock speed of 2.6 GHz. The GPU used for deep network based models was Nvidia Tesla K40C. 3.2.1 Datasets Synthetic Social Network Collaboration Network Biology Network Name SYN-SBM KARATE BLOGCATALOG YOUTUBE HEP-TH ASTRO-PH PPI jVj 1024 34 10,312 1,157,827 7,980 18,772 3,890 jEj 29,833 78 333,983 4,945,382 21,036 396,160 38,739 Avg. degree 58.27 4.59 64.78 8.54 5.27 31.55 19.91 No. of labels 3 4 39 47 - - 50 Table 3.1: Dataset Statistics We evaluate the embedding approaches on a synthetic and 6 real datasets. The datasets are summarized in Table 3.1. SYN-SBM: We generate synthetic graph using Stochastic Block Model [214] with 1024 nodes and 3 communities. We set the in-block and cross-block probabilities as 0.1 and 0.01 respectively. As we know the community structure in this graph, we use it to visualize the embeddings learnt by various approaches. KARATE [235]: Zachary's karate network is a well-known social network of a university karate club. It has been widely studied in social network analysis. The network has 34 nodes, 78 edges and 2 communities. BLOGCATALOG [199]: This is a network of social relationships of the bloggers listed on the BlogCatalog website. The labels represent blogger interests inferred through the metadata provided by the bloggers. The network has 10,312 nodes, 333,983 edges and 39 dierent labels. 23 YOUTUBE [200]: This is a social network of Youtube users. This is a large network containing 1,157,827 nodes and 4,945,382 edges. The labels represent groups of users who enjoy common video genres. HEP-TH [68]: The original dataset contains abstracts of papers in High Energy Physics Theory for the period from January 1993 to April 2003. We create a collaboration network for the papers published in this period. The network has 7,980 nodes and 21,036 edges. ASTRO-PH [129]: This is a collaboration network of authors of papers submitted to e-print arXiv during the period from January 1993 to April 2003. The network has 18,772 nodes and 396,160 edges. PROTEIN-PROTEIN INTERACTIONS (PPI) [22]: This is a network of biological interactions between proteins in humans. This network has 3,890 nodes and 38,739 edges. 3.2.2 Evaluation Metrics To evaluate the performance of embedding methods on graph reconstruction and link prediction, we use Precision at k (Pr@k) and MeanAveragePrecision(MAP) as our metrics. For node classication, we use micro-F1 and macro-F1. These metrics are dened as follows: Pr@k is the fraction of correct predictions in top k predictions. It is dened as Pr@k = jE pred (1:k)\E obs j k , where E pred (1 :k) are the top k predictions and E obs are the observed edges. For the task of graph reconstruction, E obs =E and for link prediction, E obs is the set of hidden edges. MAP estimates precision for every node and computes the average over all nodes, as follows: MAP = P i AP (i) jVj ; where AP(i) = P k Pr@k(i)IfE pred i (k)2E obs i g jfk:E pred i (k)2E obs i gj , Pr@k(i) = jE pred i (1:k)\E obs i j k , and E predi and E obsi are the predicted and observed edges for node i respectively. 24 macro-F1, in a multi-label classication task, is dened as the average F1 of all the labels, i.e., macroF 1 = P l2L F 1(l) jLj ; where F 1(l) is the F 1-score for label l. micro-F1 calculates F 1 globally by counting the total true positives, false negatives and false positives, giving equal weight to each instance. It is dened as follows: microF 1 = 2PR P +R ; where P = P l2L TP(l) P l2L (TP(l)+FP(l)) , and R = P l2L TP(l) P l2L (TP(l)+FN(l)) , are precision (P) and recall (R) respectively, and TP (l), FP (l) and FN(l) denote the number of true positives, false positives and false negatives respectively among the instances which are associated with the label l either in the ground truth or the predictions. 3.3 Experiments and Analysis In this section, we evaluate and compare embedding methods on the for tasks presented above. For each task, we show the eect of number of embedding dimensions on the performance and compare hyper parameter sensitivity of the methods. Furthermore, we correlate the performance of embedding techniques on various tasks varying hyper parameters to test the notion of an \all-good" embedding which can perform well on all tasks. 3.3.1 Graph Reconstruction Embeddings as a low-dimensional representation of the graph are expected to accurately reconstruct the graph. Note that reconstruction diers for dierent embedding techniques. For each method, we reconstruct the proximity of nodes and rank pair of nodes according to their proximity. Then 25 2 3 2 4 2 5 2 6 2 7 2 8 0.0 0.2 0.4 0.6 0.8 1.0 precision@k SBM 2 3 2 4 2 5 2 6 2 7 2 8 0.0 0.2 0.4 0.6 0.8 1.0 PPI 2 3 2 4 2 5 2 6 2 7 2 8 0.2 0.4 0.6 0.8 1.0 AstroPh 2 3 2 4 2 5 2 6 2 7 2 8 k 0.0 0.2 0.4 0.6 precision@k BlogCatalog 2 3 2 4 2 5 2 6 2 7 2 8 k 0.2 0.4 0.6 0.8 1.0 Hep-th 2 3 2 4 2 5 2 6 2 7 2 8 k 0.0 0.2 0.4 0.6 0.8 1.0 Youtube node2vec GF SDNE HOPE LE Figure 3.1: Precision@k of graph reconstruction for dierent data sets (dimension of embedding is 128). 2 1 2 3 2 5 2 7 0.2 0.4 0.6 0.8 1.0 MAP SBM 2 1 2 3 2 5 2 7 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 PPI 2 1 2 3 2 5 2 7 0.0 0.1 0.2 0.3 0.4 0.5 AstroPh 2 1 2 3 2 5 2 7 d 0.0 0.1 0.2 0.3 0.4 MAP BlogCatalog 2 1 2 3 2 5 2 7 d 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Hep-th 2 1 2 3 2 5 2 7 d 0.0 0.1 0.2 0.3 0.4 0.5 Youtube node2vec GF SDNE HOPE LE Figure 3.2: MAP of graph reconstruction for dierent data sets with varying dimensions. 26 we calculate the ratio of real links in top k predictions as the reconstruction precision. As the number of possible node pairs (N(N 1)) can be very large for networks with a large number of nodes, we randomly sample 1024 nodes for evaluation. We obtain 5 such samples for each dataset and calculate the mean and standard deviation of precision and MAP values for subgraph reconstruction. To obtain optimal hyper-parameters for each embedding method, we compare the mean MAP values for each hyper-parameter. We then re-run the experiments with the optimal hyper-parameter on 5 dierent random samples of 1024 nodes and report the results. Figure 3.1 illustrates the reconstruction precision obtained by 128-dimensional embeddings. We observe that although performance of methods is dataset dependent, embedding approaches which preserve higher order proximities in general outperform others. Exceptional performance of Laplacian Eigenmaps on SBM can be attributed to the lack of higher order structure in the data set. We also observe that SDNE consistently performs well on all data sets. This can be attributed to its capability of learning complex structure from the network. Embeddings learned by node2vec have low reconstruction precision. This may be due to the highly non-linear dimensionality reduction yielding a non-linear manifold. However, HOPE, which learns linear embeddings but preserves higher order proximity reconstructs the graph well without any additional parameters. Eect of dimension. Figure 3.2 illustrates the eect of dimension on the reconstruction error. With a couple of exceptions, as the number of dimensions increase, the MAP value increases. This is intuitive as higher number of dimensions are capable of storing more information. We also observe that SDNE is able to embed the graphs in 16-dimensional vector space with high precision although decoder parameters are required to obtain such precision. 3.3.2 Visualization Since embedding is a low-dimensional vector representation of nodes in the graph, it allows us to visualize the nodes to understand the network topology. As dierent embedding methods preserve dierent structures in the network, their ability and interpretation of node visualization dier. 27 15 10 5 0 5 10 15 15 10 5 0 5 10 15 (a) LLE 15 10 5 0 5 10 15 25 20 15 10 5 0 (b) GF 20 15 10 5 0 5 10 15 20 15 10 5 0 5 10 (c) node2vec 15 10 5 0 5 10 15 20 25 20 10 0 10 20 (d) HOPE 20 15 10 5 0 5 10 15 20 15 10 5 0 5 10 15 (e) SDNE 15 10 5 0 5 10 15 15 10 5 0 5 10 15 (f) LE Figure 3.3: Visualization of SBM using t-SNE (original dimension of embedding is 128). Each point corresponds to a node in the graph. Color of a node denotes its community. 0.2 0.1 0.0 0.1 0.2 0.3 0.4 0.3 0.2 0.1 0.0 0.1 0.2 0.3 0.4 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 (a) LLE 0.02 0.01 0.00 0.01 0.02 0.03 0.03 0.02 0.01 0.00 0.01 0.02 0.03 0.04 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 (b) GF 1.2 1.0 0.8 0.6 0.4 0.2 0.0 8.0 7.5 7.0 6.5 6.0 5.5 5.0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 (c) node2vec 0.100 0.075 0.050 0.025 0.000 0.025 0.050 0.075 0.100 0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 (d) HOPE 0.0 0.5 1.0 1.5 2.0 2.5 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 (e) SDNE 0.3 0.2 0.1 0.0 0.1 0.2 0.3 0.4 0.3 0.2 0.1 0.0 0.1 0.2 0.3 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 (f) LE Figure 3.4: Visualization of Karate club graph. Each point corresponds to a node in the graph. Each node is embedded in a 2-dimensional space using the corresponding embedding method. For instance, embeddings learned by node2vec with parameters set to prefer BFS random walk would cluster structurally equivalent nodes together. On the other hand, methods which directly 28 preserve k-hop distances between nodes (GF, LE and LLE with k = 1 and HOPE and SDNE with k> 1) cluster neighboring nodes together. We compare the ability of dierent methods to visualize nodes on SBM and Karate graph. For SBM, following [213], we learn a 128-dimensional embedding for each method and input it to t-SNE [142] to reduce the dimensionality to 2 and visualize nodes in a 2-dimensional space. Visualization of SBM is show in Figure 3.3. As we know the underlying community structure, we use the community label to color the nodes. We observe that embeddings generated by HOPE and SDNE which preserve higher order proximities well separate the communities although as the data is well structured LE, GF and LLE are able to capture community structure to some extent. We visualize Karate graph (see Figure 3.4) to illustrate the properties preserved by embedding methods. LLE and LE ((a) and (f)) attempt to preserve the community structure of the graph and cluster nodes with high intra-cluster edges together. GF ((b)) embeds communities very closely and keeps leaf nodes far away from other nodes. In (d), we observe that HOPE embeds nodes 16 and 21, whose Katz similarity in the original graph is very low (0.0006), farthest apart (considering dot product similarity). node2vec and SDNE ((c) and (e)) preserve a mix of community structure and structural property of the nodes. Nodes 32 and 33, which are both high degree hubs and central in their communities, are embedded together and away from low degree nodes. Also, they are closer to nodes which belong to their communities. SDNE embeds node 0, which acts a bridge between communities, far away from other nodes. Note that, unlike for other methods, it does not imply that node 0 is disconnected from the rest of the nodes. The implication here is that SDNE identies node 0 as a separate type of node and encodes its connection to other nodes in encoder and decoder. The ability of deep autoencoders to identify important nodes in the network has not been studied but given this observation we believe this direction can be promising. 29 2 3 2 4 2 5 2 6 2 7 2 8 k 0.00 0.02 0.04 0.06 0.08 0.10 precision@k PPI 2 3 2 4 2 5 2 6 2 7 2 8 k 0.00 0.05 0.10 0.15 0.20 AstroPh 2 3 2 4 2 5 2 6 2 7 2 8 k 0.00 0.05 0.10 0.15 0.20 0.25 BlogCatalog 2 3 2 4 2 5 2 6 2 7 2 8 k 0.00 0.05 0.10 0.15 0.20 Hep-th node2vec GF SDNE HOPE LE Figure 3.5: Precision@k of link prediction for dierent data sets (dimension of embedding is 128). 2 1 2 3 2 5 2 7 d 0.02 0.04 0.06 0.08 MAP PPI 2 1 2 3 2 5 2 7 d 0.00 0.05 0.10 0.15 0.20 0.25 AstroPh 2 1 2 3 2 5 2 7 d 0.00 0.05 0.10 0.15 0.20 BlogCatalog 2 1 2 3 2 5 2 7 d 0.00 0.05 0.10 0.15 0.20 Hep-th node2vec GF SDNE HOPE LE Figure 3.6: MAP of link prediction for dierent data sets with varying dimensions. 3.3.3 Link Prediction Another important application of graph embedding is predicting unobserved links in the graph. A good network representation should be able to capture the inherent structure of graph well enough to predict the likely but unobserved links. To test the performance of dierent embedding methods on this task, for each data set we randomly hide 20% of the network edges. We learn the embedding using the rest of the 80% edges and predict the most likely edges which are not observed in the training data from the learned embedding. As with graph reconstruction, we generate 5 random subgraphs with 1024 nodes and test the predicted links against the held-out links in the subgraphs. We perform this experiment for each hyper-parameter and re-run it for optimal hyper-parameters on a new random 80-20 link split. Figure 3.5 and 3.6 show the precision@k results for link prediction with 128-dimensional embeddings and MAP for each dimension respectively. Here we can see that the performance of methods is highly data set dependent. node2vec achieves good performance on BlogCatalog but 30 performs poorly on other data sets. HOPE achieves good performance on all data sets which implies that preserving higher order proximities is conducive to predicting unobserved links. Similarly, SDNE outperforms other methods with the exception on PPI for which the performance degrades drastically as embedding dimension increases above 8. Eect of dimension. Figure 3.6 illustrates the eect of embedding dimension on link prediction. We make two observations. Firstly, in PPI and BlogCatalog, unlike graph reconstruction performance does not improve as the number of dimensions increase. This may be because with more parameters the models overt on the observed links and are unable to predict unobserved links. Secondly, even on the same data set, relative performance of methods depends on the embedding dimension. In PPI, HOPE outperforms other methods for higher dimensions,whereas embedding generated by SDNE achieves higher link prediction MAP for low dimensions. 3.3.4 Node Classication 0.2 0.4 0.6 0.8 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Micro-F1 score SBM 0.2 0.4 0.6 0.8 0.075 0.100 0.125 0.150 0.175 0.200 0.225 PPI 0.2 0.4 0.6 0.8 0.20 0.25 0.30 0.35 0.40 BlogCatalog 0.2 0.4 0.6 0.8 Train ratio 0.4 0.6 0.8 1.0 Macro-F1 score SBM 0.2 0.4 0.6 0.8 Train ratio 0.05 0.10 0.15 0.20 PPI 0.2 0.4 0.6 0.8 Train ratio 0.05 0.10 0.15 0.20 0.25 BlogCatalog node2vec GF SDNE HOPE LE Figure 3.7: Micro-F1 and Macro-F1 of node classication for dierent data sets varying the train-test split ratio (dimension of embedding is 128). 31 2 1 2 3 2 5 2 7 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Micro-F1 score SBM 2 1 2 3 2 5 2 7 0.075 0.100 0.125 0.150 0.175 0.200 0.225 PPI 2 1 2 3 2 5 2 7 0.20 0.25 0.30 0.35 BlogCatalog 2 1 2 3 2 5 2 7 d 0.2 0.4 0.6 0.8 1.0 Macro-F1 score SBM 2 1 2 3 2 5 2 7 d 0.05 0.10 0.15 PPI 2 1 2 3 2 5 2 7 d 0.05 0.10 0.15 0.20 BlogCatalog node2vec GF SDNE HOPE LE Figure 3.8: Micro-F1 and Macro-F1 of node classication for dierent data sets varying the number of dimensions. The train-test split is 50%. Predicting node labels using network topology is widely popular in network analysis and has variety of applications, including document classication and interest prediction. A good network embedding should capture the network structure and hence be useful for node classication. We compare the eectiveness of embedding methods on this task by using the generated embedding as node features to classify the nodes. The node features are input to a one-vs-rest logistic regression using the LIBLINEAR library. For each data set, we randomly sample 10% to 90% of nodes as training data and evaluate the performance on the remaining nodes. We perform this split 5 times and report the mean with condence interval. For data sets with multiple labels per node, we assume that we know how many labels to predict. Figure 3.7 shows the results of our experiments. We can see that node2vec outperforms other methods on the task of node classication. As mentioned earlier, node2vec preserves homophily as well as structural equivalence between nodes. Results suggest this can be useful in node 32 classication: e.g., in BlogCatalog users may have similar interests, yet connect to others based on social ties rather than interests overlap. Similarly, proteins in PPI may be related in functionality and interact with similar proteins but may not assist each other. However, in SBM, other methods outperform node2vec as labels re ect communities yet there is no structural equivalence between nodes. Eect of dimension. Figure 3.8 illustrates the eect of embedding dimensions on node classication. As with link prediction, we observe that performance often saturates or deteriorates after certain number of dimensions. This may suggest overtting on the training data. As SBM exhibits very structured communities, an 8-dimensional embedding suces to predict the communities. node2vec achieves best performance on PPI and BlogCat with 128 dimensions. 3.3.5 Hyper-parameter Sensitivity 10 − 3 10 − 2 10 − 1 10 0 regu 0.10 0.15 0.20 0.25 0.30 GF hepth ppi sbm (a) Graph Reconstruction 10 − 3 10 − 2 10 − 1 10 0 regu 0.025 0.050 0.075 0.100 0.125 GF hepth ppi sbm (b) Link Prediction 10 − 3 10 − 2 10 − 1 10 0 regu 0.0 0.2 0.4 0.6 0.8 GF ppi sbm (c) Node Classication Figure 3.9: Eect of regularization coecient () in Graph Factorization on various tasks. 2 − 8 2 − 6 2 − 4 2 − 2 beta 0.0 0.2 0.4 0.6 HOPE hepth ppi sbm (a) Graph Reconstruction 2 − 8 2 − 6 2 − 4 2 − 2 beta 0.00 0.05 0.10 0.15 HOPE hepth ppi sbm (b) Link Prediction 2 − 8 2 − 6 2 − 4 2 − 2 beta 0.0 0.2 0.4 0.6 0.8 1.0 HOPE ppi sbm (c) Node Classication Figure 3.10: Eect of attenuation factor () in HOPE on various tasks. 33 2 1 2 2 2 3 2 4 beta 0.2 0.4 0.6 0.8 1.0 SDNE hepth ppi sbm (a) Graph Reconstruction 2 1 2 2 2 3 2 4 beta 0.05 0.10 0.15 SDNE hepth ppi sbm (b) Link Prediction 2 1 2 2 2 3 2 4 beta 0.2 0.4 0.6 0.8 1.0 SDNE ppi sbm (c) Node Classication Figure 3.11: Eect of observed link reconstruction weight in SDNE on various tasks. 2 − 2 2 − 1 2 0 2 1 q 0.2 0.3 0.4 node2vec hepth ppi (a) 2 − 2 2 − 1 2 0 2 1 q 0.01 0.02 0.03 0.04 node2vec hepth ppi (b) 2 − 2 2 − 1 2 0 2 1 q 0.202 0.204 0.206 0.208 0.210 0.212 node2vec ppi (c) 2 − 2 2 − 1 2 0 2 1 p 0.2 0.3 0.4 node2vec hepth ppi (d) Graph Reconstruction 2 − 2 2 − 1 2 0 2 1 p 0.01 0.02 0.03 0.04 node2vec hepth ppi (e) Link Prediction 2 − 2 2 − 1 2 0 2 1 p 0.2025 0.2050 0.2075 0.2100 0.2125 node2vec ppi (f) Node Classication Figure 3.12: Eect of random walk bias weights in node2vec on various tasks. In this section we plan to address the following questions: How robust are the embedding methods with respect to hyper-parameters? Do optimal hyper-parameters depend on the downstream tasks the embeddings are used for? What insights does performance variance with hyper-parameter provide about a data set? 34 4 0 4 1 4 2 4 3 4 4 q 0.075 0.100 0.125 0.150 0.175 0.200 node2vec sbm (a) Figure 3.13: Eect of random walk bias weights in node2vec on SBM. We answer these questions by analyzing performance of the embedding methods with various hyper-parameters. We present results on SBM, PPI and Hep-th. Of the methods we evaluated in this survey, Laplacian Eigenmaps has no hyper-parameters, thus is not included in the following analysis. Graph Factorization (GF). The objective function of GF contains a weight regularization term with a coecient. Intuitively, this coecient controls the generalizability of the embedding. A low regularization coecient facilitates better reconstruction but may overt to the observed graph leading to poor prediction performance. On the other side, a high regularization may under-represent the data and hence perform poorly on all tasks. We observe this eect in Figure 3.9. We see that performance on prediction tasks, namely link prediction (Fig. 3.9b) and node classication (Fig. 3.9c), improves as we increase the regularization coecient, reaches a peak and then starts deteriorating. However, graph reconstruction performance (Fig. 3.9a) 35 may deteriorate with increasing regularization. We also note that the performance change is considerable and thus the coecient should be carefully tuned to the given data set. HOPE. As HOPE factorizes a similarity matrix between nodes to obtain the embedding, the hyper-parameters depend on the method used to obtain the similarity matrix. Since in our experiments we used Katz index for this purpose, we evaluate the eect of the attenuation factor (), which can be interpreted as the higher order proximity coecient on performance. Graph structure aects the optimal value of this parameter. For well-structured graphs with tightly knit communities, high values ofbeta would erroneously assign dissimilar nodes closer in the embedding space. On the contrary, for graphs with weak community structure it is important to capture higher order distances and thus high values of beta may yield better embeddings. We validate our hypothesis in Figure 3.10. As our synthetic data SBM consists of tight communities, increasing does not improve the performance on any task. However, gain in performance with increasing beta can be observed in PPI and HEP-TH. This is expected as collaboration and protein networks tend to evolve via higher order connections. We observe that the optimal beta for these data sets is lowest for the task of graph reconstruction (Fig. 3.10(a)) and highest for link prediction (Fig. 3.10b), 2 6 and 2 4 respectively. This also follows intuition as higher order information is not necessary for reconstructing observed edges but is useful for prediction tasks. Nodes farther from each other are more likely to have common labels than links between them. For example, in a collaboration network, authors in a eld will form a weak community but many of them need not be connected to each other (in other words, most researchers of a same community won't be co-authors of one another). Thus, if we were to predict eld for an author, having information about other authors inside the weak community can help improve the accuracy of our model. SDNE. SDNE uses a coupled deep autoencoder to embed graphs. It leverages a parameter to control the weight of observed and unobserved link reconstruction in the objective function. A high parameter value would focus on reconstructing observed links disregarding the absence of unobserved links. This parameter can be crucial as a low value could hinder predicting hidden 36 links. Figure 3.11 illustrates the results of our analysis. We observe that performance on link prediction (Fig. 3.11b) greatly varies depending on the weight, with a more than 3-fold increase in MAP for Hep-th with the optimal parameter value, and about 2-fold increase for PPI. The performance on node classication (Fig. 3.11c), on the other hand, is less aected by the parameter. Overall, we see that the maximum performance is achieved for an intermediate value of the weight above which the performance drops. node2vec. node2vec performs biased random walks on the graph and embeds nodes commonly appearing together in them close in the embedding space. Of the various hyper-parameters of the method (which include walk length, context size and bias weights), we analyze the eect of bias weights on performance and adopt commonly-used values for the remaining hyper-parameter [78], namely, context size of 10 and walk length of 80. node2vec has two bias weights: (a) inout parameter (q), which controls the likelihood of random walk to go further away from the incoming node (higher values favor closer nodes), and (b) return parameter (p), which weighs the return probability (lower values favor return). Lower values of q would help capture longer distances between nodes and aims towards preserving structural equivalence. Figure 3.12 illustrates the eect on PPI and HEPTH. In node classication (Fig. 3.12c) we observe that low q values help achieve higher accuracy suggesting that capturing structural equivalence is required to accurately embed the structure of the graph. On the contrary, high q values favor link prediction (Fig. 3.12b) following intuition that capturing community structure is crucial for the task. We make similar observations for SBM in Figure 3.13 for the task of link prediction. MAP increases with increasing q until it reaches an optimal. We also note that the optimal values of q in SBM are much higher as the graph has strong community structure. 3.3.6 Discussion We summarize the strengths and weaknesses of the evaluated methods in Table 3.2. 37 SDNE HOPE node2vec GF LE Robust to Graph Structure 3 7 3 7 7 Tunable for Downstream Tasks 3 7 3 7 7 Few hyper-parameters 7 3 7 3 3 Low hyper-parameter Sensitivity 7 7 7 7 3 Scalable 7 3 3 3 3 Table 3.2: Summary of strengths and weaknesses of evaluated methods. In our experiments, we observed that SDNE performs well on most of the data sets for all downstream tasks. Since SDNE does not make assumptions on the graph structure and uses the information provided in the graph to learn a suitable autoencoder, it can be tuned for almost any given graph and task. Moreover, for dierent downstream tasks, hyper-parameters can be set separately to perform well on the task. However, SDNE has many hyper-parameters and the performance depends signicantly on their settings. The large number of parameters makes it unsuitable for large graphs. We also observed that node2vec performs well on most tasks, with the best performance in node classication. The values of random walk bias weights can be set for the task at hand. The method is linear in the number of nodes and the dimension of the embedding and is thus scalable. Moreover, it can be parallelized over a cluster by performing random walks on separate machines. HOPE and GF have very few hyper-parameters to tune but the performance of the models depend largely on the graph structure. Although regularization parameter can act as a trade- o between graph reconstruction and link prediction, the methods do not capture structural equivalence which can be crucial for node classication and visualization. Laplacian Eigenmaps is a hyper-parameter-free method which can be quickly used to assess the performance on the tasks. However, the performance is not comparable to the rest of the methods and it cannot be tuned for a task. 38 3.4 A Python Library for Graph Embedding We released an open-source Python library, GEM (Graph Embedding Methods, https://github. com/palash1992/GEM), which provides a unied interface to the implementations of all the methods presented here, and their evaluation metrics. The library supports both weighted and unweighted graphs. GEM's hierarchical design and modular implementation should help the users to test the implemented methods on new datasets as well as serve as a platform to develop new approaches with ease. GEM (https://github.com/palash1992/GEM) provides implementations of Locally Linear Embedding [178], Laplacian Eigenmaps [14], Graph Factorization [3], HOPE [158], SDNE [213] and node2vec [78]. For node2vec, we use the C++ implementation provided by the authors [129] and yield a Python interface. In addition, GEM provides an interface to evaluate the learned embedding on the four tasks presented above. The interface is exible and supports multiple edge reconstruction metrics including cosine similarity, euclidean distance and decoder based (for autoencoder-based models). For multi-labeled node classication, the library uses one-vs-rest logistic regression classiers and supports the use of other ad hoc classiers. 39 Chapter 4 Benchmarks for Graph Embedding Evaluation 4.1 Introduction Graphs are a natural way to represent relationships and interactions between entities in real systems. For example, people on social networks, proteins in biological networks, and authors in publication networks can be represented by nodes in a graph, and their relationships such as friendships, protein-protein interactions, and co-authorship are represented by edges in a graph. These graphical models enable us to understand the behavior of systems and to gain insight into their structure. These insights can further be used to predict future interactions and missing information in the system. These tasks are formally dened as link prediction and node classication. Link prediction estimates the likelihood of a relationship among two entities. This is used, for example, to recommend friends on social networks and to sort probable protein-protein interactions on biological networks. Similarly, node classication estimates the likelihood of a node's label. This is used, for example, to infer missing meta-data on social media proles, and genes in proteins. Numerous graph analysis methods have been developed. Broadly, these methods can be categorized as non-parametric and parametric. Non-parametric methods operate directly on the graph whereas parametric methods represent the properties of nodes and edges in the graph in 40 a low-dimensional space. Non-parametric methods such as Common Neighbors [153], Adamic- Adar [1] and Jaccard's coecient [181] require access and knowledge of the entire graph for the prediction. On the other hand, parametric models such as Thor et. al. [204] employ graph summarization and dene super nodes and super edges to perform link prediction. Kim et. al. [109] use Expectation Maximization to t the real network as a Kronecker graph and estimate the parameters. Another class of parametric models that have gained much attention recently are graph embeddings [72, 81, 26]. Graph embedding methods dene a low-dimensional vector for each node and a distance metric on the vectors. These methods learn the representation by preserving certain properties of the graph. Graph Factorization [3] preserves visible links, HOPE [213] aims to preserve higher order proximity, and node2vec [78] preserves both structural equivalence and higher order proximity. In this chapter, we focus our attention on graph embedding methods. While this is a very active area of research that continues to gain popularity among researchers, there are several challenges that must be addressed before graph embedding algorithms become mainstream. 4.1.1 Challenges Most research on graph embedding has focused on the development of mechanisms to preserve various characteristics of the graph in the low-dimensional space. However, very little attention has been dedicated to the development of mechanisms to rigorously compare and evaluate dierent graph embedding methods. To make matters worse, most of the existing work use simple synthetic data sets for visualization and a few real networks for quantitative comparison. Goyal et. al. [72] use Stochastic Block Models to visualize the results of graph embedding methods. Salehi at. al. [180] use the Barabasi-Albert graph to understand the properties of embeddings. Such evaluation strategy suers from the following challenges: 41 1. Properties of real networks vary according to the domain. Therefore it is often dicult to ascertain the reason behind the performance improvement of a given method on a particular real dataset (as shown in [72]). 2. As demonstrated in this chapter, the performance of embedding approaches vary greatly, and according to the properties of dierent graphs. Therefore, the utility of any specic method is dicult to establish and to characterize. In practice, the performance improvement of a method can be attributed to stochasticity. 3. Dierent methods use dierent metrics for evaluation. This makes it very dicult to compare the performance of dierent graph embedding methods on a given problem. 4. Typically, each graph embedding method has a reference implementation. This implementa- tion makes specic assumptions about the data, representation, etc. This further complicates the comparison between methods. 4.1.2 Contributions In this work, we aim to: (i) provide a unifying framework for comparing the performance of state-of-the-art and future graph embedding methods; (ii) establish a benchmark comprised of 100 real-world graphs that exhibit dierent structural properties; and (iii) provide users with a fully automated Python library that selects the best graph embedding method for their graph data. We address the above challenges (Section 4.1.1) with the following contributions: 1. We propose an evaluation benchmark to compare and evaluate embedding methods. This benchmark consists of 100 real-world graphs categorized into four domains: social, biology, technological and economic. 2. Using our evaluation benchmark, we evaluate and compare 8 state-of-the-art methods and provide, for the rst time, a characterization of their performance against graphs with 42 dierent properties. We also compare their scores with traditional link prediction methods and ascertain the general utility of embedding methods. 3. A new score, GFS-score, is introduced to compare various graph embedding methods for link prediction. The GFS-score provides a robust metric to evaluate a graph embedding approach by averaging over 100 graphs. It further has many components based on the type and property of graph yielding insights into the methods. 4. A Python library comprised of 4 state-of-the-art embedding methods, and 4 traditional link prediction methods. This library automates the evaluation, comparison against all the other methods, and performance plotting of any new graph embedding method. 4.1.3 Organization The rest of the work is organized as follows. Section 4.2 presents the notations used in the chapter and and overview of graph embeddings. Section 4.3 introduces the benchmark framework, describes the real benchmark graphs, denes the GFS-score. Section 4.4 presents the results and analysis. Section 4.5 introduces the Python library. Section 4.7 concludes. 4.2 Notations and Background This section introduces the notation used in this chapter, and provides a brief overview of graph embedding methods. For an in-depth analysis of graph embedding theory we refer the reader to [72]. 4.2.1 Notations G(V;E) denotes a weighted graph where V is the set of vertices and E is the set of edges. We represent W as the adjacency matrix of G, where W ij = 1 represents the presence of an edge betweeni andj. A graph embedding is a mappingf :V>R d , whered<<jVj and the function 43 Table 4.1: Summary of notation G Graphical representation of the data V Set of vertices in the graph E Set of edges in the graph W Adjacency matrix of the graph,jVjjVj f Embedding function S Set of synthetic graphs R D Set of real graphs in domain D D Set of domains M Set of evaluation metrics em Evaluation function for metric m A Set of graph and embedding attributes d Number of embedding dimensions Y Embedding of the graph,jVjd f preserves some proximity measure dened on graph G. It aims to map similar nodes close to each other. Function f when applied on the graph G yields an embedding Y . In this work, we evaluate four state-of-the-art graph embedding methods on a set of real graphs denoted byR and synthetic graphs denoted byS. To analyze the performance of methods, we categorize the graphs into a set of domainsD =f Social, Economic, Biology, Technologicalg. The set of graphs in a domain D2D is represented asR D . We use multiple evaluation metrics on graph embedding methods to draw insights into each approach. We denote this set of metrics as M. The notations are summarized in Table 4.1. 4.2.2 Graph Embedding Methods Graph embedding methods embed graph vertices into a low-dimensional space. The goal of graph embedding is to preserve certain properties of the original graph such as distance between nodes and neighborhood structure. Based upon the function f used for embedding the graph, existing methods can be classied into three categories [72]: factorization based, random walk based and deep learning based. 44 4.2.2.1 Factorization based approaches Factorization based approaches apply factorization on graph related matrices to obtain the node representation. Graph matrices such as the adjacency matrix, Laplacian matrix, and Katz similarity matrix contain information about node connectivity and the graph's structure. Other matrix factorization approaches use the eigenvectors from spectral decomposition of a graph matrix as node embeddings. For example, to preserve locality, LLE [178] uses d eigenvectors corresponding to eigenvalues from second smallest to (d + 1) th smallest from the sparse matrix (IW ) | (IW ). It assumes that the embedding of each node is a linear weighted combination of the neighbor's embeddings. Laplacian Eigenmaps [14] take the rst d eigenvectors with the smallest eigenvalues of the normalized Laplacian D 1=2 LD 1=2 . Both LLE and Laplacian Eigenmaps were designed to preserve the local geometric relationships of the data. Another type of matrix factorization methods learn node embeddings under dierent optimization functions in order to preserve certain properties. Structural Preserving Embedding [188] builds upon Laplacian Eigenmaps to recover the original graph. Cauchy Graph Embedding [140] uses a quadratic distance formula in the objective function to emphasize similar nodes instead of dissimilar nodes. Graph Factorization [3] uses an approximation function to factorize the adjacency matrix in a more scalable manner. GraRep [28] and HOPE [158] were invented to keep the high order proximity in the graph. Factorization based approaches have been widely used in practical applications due to their scalability. The methods are also easy to implement and can yield quick insights into the data set. 4.2.2.2 Random walk approaches Random walk based algorithms are more exible than factorization methods to explore the local neighborhood of a node for high-order proximity preservation. DeepWalk [169] and Node2vec [78] aim to learn a low-dimensional feature representation for nodes through a stream of random walks. These random walks explore the nodes' variant neighborhoods. Thus, random walk based methods are much more scalable for large graphs and they generate informative embeddings. Although very 45 similar in nature, DeepWalk simulates uniform random walks and Node2vec employs search-biased random walks, which enables embedding to capture the community or structural equivalence via dierent bias settings. LINE [198] combines two phases for embedding feature learning: one phase uses a breadth-rst search (BFS) traversal across rst-order neighbors, and the second phase focuses on sampling nodes from second-order neighbors. HARP [34] improves DeepWalk and Node2vec by creating a hierarchy for nodes and using the embedding of the coarsened graph as a better initialization in the original graph. Walklets [170] extended Deepwalk by using multiple skip lengths in random walking. Random walk based approaches tend to be more computationally expensive than factorization based approaches but can capture complex properties and longer dependencies between nodes. 4.2.2.3 Neural network approaches The third category of graph embedding approaches is based on neural networks. Deep neural networks based approaches capture highly non-linear network structure in graphs, which is neglected by factorization based and random walk based methods. One type of deep learning based methods such as SDNE [213] uses a deep autoencoder to provide non-linear functions to preserve the rst and second order proximities jointly. Similarly, DNGR [30] applies random surng on input graph before a stacked denoising autoencoder and makes the embedding robust to noise in graphs. Another genre of methods use Graph Neural Networks(GNNs) and Graph Convolutional Networks (GCNs) [23, 84, 133, 80] to aggregate the neighbors embeddings and features via convolutional operators, including spatial or spectral lters. GCNs learn embeddings in a semi-supervised manner and have shown great improvement and scalability on large graphs compared to other methods. SEAL [239] learns a wide range of link prediction heuristics from extracted local enclosing subgraphs with GNN. DIFFPOOL [232] employs a dierentiable graph pooling module on GNNs to learn hierarchical embeddings of graphs. Variational Graph Auto-Encoders(VGAE) [113] utilizes a GCN as encoder and inner product as decoder, which provides embedding with higher quality than 46 autoencoders. Deep neural network based algorithms like SDNE and DNGR can be computational costly since they require the global information such as adjacency matrix for each node as input. GCNs based methods are more scalable and exible to characterize global and local neighbours through variant convolutional and pooling layers. 4.3 GEM-BEN: Graph Embedding Methods Benchmark Unlike other elds with well established benchmark datasets (e.g. community detection [121]), the 200 0 200 400 600 800 1000 1200 1400 0.0000 0.0005 0.0010 0.0015 0 5 10 15 20 0.00 0.02 0.04 0.06 0.08 0 5 10 15 20 25 0.00 0.05 0.10 0.15 0.20 Biological 250 0 250 500 750 1000 1250 1500 0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 5 0 5 10 15 20 25 0.0 0.1 0.2 0.3 5.0 7.5 10.0 12.5 15.0 17.5 20.0 22.5 0.00 0.05 0.10 0.15 Technological 600 800 1000 1200 1400 1600 0.0000 0.0005 0.0010 0.0015 0.0020 0.0025 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00 3.25 0.0 0.5 1.0 1.5 2.0 15 20 25 30 35 40 45 0.00 0.02 0.04 0.06 0.08 0.10 Economic 5000 2500 0 2500 5000 7500 10000 12500 15000 N 0.00000 0.00005 0.00010 0.00015 5 0 5 10 15 20 25 density 0.00 0.05 0.10 0.15 0.20 0.25 0 10 20 30 40 dia 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Social Figure 4.1: Real graphs properties. graph embedding community has adopted an ad-hoc approach to evaluate new methods. Typically, graph embedding methods are evaluated on only a few real networks, and these are biased towards specic properties. This ad-hoc evaluation approach restricts us from understanding how the algorithm would behave if we vary a certain property of the graph, or how the algorithm performs on other types of graphs. In order to propose a more rigorous evaluation approach, we must rst to understand the key attributes that govern the performance of graph embedding methods. First, the size of the graph (A1) is a challenge for any method. Real graphs vary in the number of nodes, 47 from a few hundred to millions of nodes. Dierent methods make dierent assumptions on how to capture the higher order proximities and structural dependencies between nodes, and this greatly aects their scalability. Second, the density of the graph (A2) plays an important role in dening its structure. Lower density results in lesser information about the nodes which may hamper the performance of some methods. Third, the dimension of the embedding (A3) determines how concisely the method can store the information about a given graph. Higher dimension of the embedding may lead to overtting of the graph whereas lower dimension of the embedding may not be enough to capture the information the graph provides resulting in undertting. Fourth, the evaluation metric (A4) used to evaluate the method captures dierent aspects of the prediction. Global metrics are often biased towards high degree nodes whereas local metrics can be biased towards lower degree nodes. In this chapter, we take the rst step towards establishing a graph embedding benchmark. We propose a benchmark evaluation framework to answer the following questions: Q1: How does the performance of embedding methods vary with the increasing size of the graph? Q2: How does increasing the density of graph aect the model? Q3: How does the optimal embedding dimension vary with an increasing number of nodes in the graph? Q4: How does the performance vary with respect to the evaluation metric? To address the above questions, we introduce a suite of 100 real graphs and vary the above attributes (A1, :::, A4) in the graphs and the embedding methods. Varying the size of the graph (A1) in terms of number of nodes answers the rst question (Q1) and helps us understand which methods are best when used in small, medium, and large graphs. Similarly, varying the density of the graph (A2) in terms of the average degree of nodes helps us understand its eect 48 in the embedding performance. This answers the second question (Q2). Furthermore, varying the dimension of the embedding (A3) helps us draw insights into the information compression power of the embedding approach. This answers the third question (Q3). Finally, by varying the evaluation metrics (A4) we can analyze the performance sensitivity of the method and can help us infer the bias of the embedding method towards specic nodes int he graph. This answers the fourth question (Q4). 4.3.1 Real Graphs We propose a novel data set containing 100 real world graphs from four domains: social, biology, economic, and technological. To demonstrate the usefulness of this benchmark, we evaluate eight graph embedding methods and measure their performance. This provides valuable insights about every method and their sensitivity to dierent graph properties. This paves the way towards a framework that can be used to recommend the best embedding approaches for a given graph with a unique set of properties. Figure 4.1 summarizes the main properties of the graphs from dierent domains in the data set. We observe that economic graphs have a lower average density varying between 0.00160 and 0.00280 with a higher number of nodes concentrated in lower density spectrum. Technological and social graphs are denser with an average density between 0.0030 to 0.0160. It is interesting to note that despite the wide average density range densities are concentrated primarily in the lower and higher values with a gap in between. Biological graphs have an almost uniform distribution of densities ranging from 0.005 to 0.0155. Next, we observe the domain-wise pattern of diameters. Economic graphs have the widest range(20 - 40) and the highest values of diameters which justies the lowest average densities observed. Technological graphs with diameter ranges between 11 and 17.5 are less sparse when compared with economic graphs. Biological graphs have a good combination of both dense and sparse graphs with a majority of graphs lying in small diameter range. Biological graphs typically 49 have short long diameter ranges as (8 to 12) and (16 to 18) respectively. Social graphs have in general a lower diameter around 10 although some of them have higher diameters. On further investigation, we observe that biological networks have the highest clustering tendencies with an average clustering coecient as 0.10. However, economic graphs stand in absolute contrast to them with very low clustering coecient of 0.00016 as the highest recorded average clustering coecient. Technological networks are somewhere in between the aforementioned extremes with 0.03 as the highest recorded average clustering coecients. Clustering tendencies can be sought to have a high correlation with average density and diameter observations. Note that these 100 graphs include a very diverse set of graphs in terms of the size of the graph (A1) ranging from 200 to 1500 nodes, and in terms of the density of the graph (A2) ranging from an average density between 0.0015 to 0.020. As it will be shown in Section 4.4, this graph diversity is helpful in characterizing the performance of dierent embedding methods. 4.3.2 Evaluation Metrics In the graph embedding literature, there are two primary metrics that are used to evaluate the performance of the methods on link prediction: (i) Precision at k (P@k) and (ii) Mean Average Precision (MAP ). These metrics are dened as follows: P@k is the fraction of correct predictions in the top k predictions. It is dened as P@k = jE pred (1:k)\E obs j k , whereE pred (1 :k) are the topk predictions andE obs are the observed edges/hidden edges. MAP estimates the prediction precision for every node and computes the prediction average over all nodes, as follows: MAP = P i AP (i) jVj ; where AP (i) = P k P@k(i)IfE pred i (k)2E obs i g jfk:E pred i (k)2E obs i gj , P @k(i) = jE pred i (1:k)\E obs i j k , and E predi and E obsi are the predicted and observed edges for node i respectively. 50 Intuitively,P @k is a global metric that measures the accuracy of the most likely links predicted. On the other hand, MAP measures the accuracy of prediction for each node and computes their average. These metrics are often uncorrelated and re ect the properties captured by the prediction method at dierent levels (MAP on local level and P @k on global level). In this work, we present results using both these metrics to analyze each approach. 4.3.3 GFS-score We now dene a set of scores to evaluate a graph embedding model on our data set. The scores are divided into components to draw insights into a method's approach across domains and metrics. We further plot the metrics varying various graph properties to understand the sensitivity of the models to these properties. Given a set of graph domainsD, a set of evaluation metricsM and evaluation function e m (graph;approach) for m2M, we dene GFS-score for an approach a as follows: microGFSm(a) = P g2G (e m (g;a)=e m (g;random)) jGj ; (4.1) macroGFSm(a) = P d2D GFSm(d;a) jDj ; (4.2) GFSm(d;a) = P g2G d (e m (g;a)=e m (g;random)) jG d j ; (4.3) whereG d is the set of graphs in domain d. The GFS-score is a robust score which averages over a set of real graphs with varying properties. It is normalized in order to ascertain the gain in performance with respect to a random prediction. The domain scores provide insights into the applicability of each approach to the dierent graph categories. 51 4.3.4 Link Prediction Baselines Our link prediction baselines were selected to showcase the utility of embedding approaches on real graphs and establish the ground truth for comparison between the state-of-the-art methods. The link prediction baselines are: Preferential Attachment: [11] is based on the assumption that the connection to a node is proportional to its degree. It denes the similarity between the nodes as the product of their degrees. Common Neighbors: [153] denes the similarity between nodes as the number of common neighbors between them. Adamic-Adar: [1] is based on the intuition that common neighbors with very large neigh- bourhoods are less signicant than common neighbors with small neighborhoods when predicting a connection between two nodes. Formally, it is dened as the sum of the inverse logarithmic degree centrality of the neighbours shared by the two nodes. Jaccard's Coecient: [100] measures the probability that two nodesi andj have a connection to node k, for a randomly selected node k from the neighbors of i and j . 4.3.5 Embedding Approaches We illustrate the benchmark data set on four popular graph embedding techniques to illustrate the utility of the benchmark and rank the state-of-the-art embedding approaches. The techniques preserve various properties including local neighborhood, higher order proximity and structure. Laplacian Eigenmaps [14]: It penalizes the weighted square of distance between neighbors. This is equivalent to factorizing the normalized Laplacian matrix. Graph Factorization (GF) [3]: It factorizes the adjacency matrix with regularization. Higher Order Proximity Preserving [158] (HOPE): It factorizes the higher order similarity matrix between nodes using generalized singular value decomposition. 52 Table 4.2: Average and standard deviation of GFS-score micro-GFS macro-GFS GFS-bio GFS-eco GFS-soc GFS-tech MAP P@100 MAP P@100 MAP P@100 MAP P@100 MAP P@100 MAP P@100 PA 3.1 37.7 3.4 31.3 3.3 35.9 2.6 2.7 6.4 83.6 1.3 3.0 CN 2.8 77 4.6 67.2 3.2 36.1 0.1 0.0 14.8 232.8 0.5 0.0 JC 2.3 28.3 3.8 26.3 2.2 0.5 0.01 0.0 12.6 102.8 0.5 1.8 AA 2.9 74.6 4.8 66.0 3.3 28.3 0.02 0.0 15.5 234.6 0.5 1.0 LE 6.6 18.9 6.4 17.4 3.5 0.75 6.2 0.0 12.3 65.3 3.8 3.5 GF 5.7 41.2 5.2 40.9 3.3 19.2 5.7 80.5 10.5 62.2 1.35 3.4 HOPE 6.1 98.0 7.3 89.1 4.7 43.8 4.2 45.2 16.7 263.3 3.4 4.25 SDNE 11.0 90.6 10.1 91.3 6.8 33.1 11.2 143.3 18.6 170.4 4.0 18.5 Structural Deep Network Embedding (SDNE) [213]: It uses deep autoencoder along with Laplacian Eigenmaps objective to preserve rst and second order proximities. 4.4 Experiments and Analysis This section evaluates the performance of the baseline and state-of-the-art methods on link prediction on the benchmark graphs according to the methodology presented in Section 4.3. First, we present the general results, and use subsections for an in-depth analysis. Figure 4.2 shows the MAP scores achieved by the eight methods when varying the size of the graph (A1) (256, 512, and 1024 nodes), the density of the graph (A2) (degree 3, 4, and 5), and the dimension of the embedding (A3) (dimensions of 2 4 ;:::; 2 7 ). We present the results for four graph categories: economic, biological, social and technological. The MAP scores are precision scores averaged over all nodes, thus making the score unbiased towards high degree nodes. However, as most real world graphs have a long tail degree distribution it may be unfair to methods which predict the top links in the graphs but fail for nodes with less information. Overall, we observe that methods show consistent performance across data sets from the same domain as shown by the low variance. Further, the best embedding approaches obtain a MAP value of around 0.1 which is about 5 times improvement over the traditional link prediction methods. Similarly, Figure 4.3 shows the P@100 values for the eight link prediction methods using the same experimental setup. P@k is a global metric which computes the accuracy of topk predictions. 53 2 4 2 5 2 6 2 7 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Domain : Economic 2 4 2 5 2 6 2 7 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 Domain : Biological 2 4 2 5 2 6 2 7 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Domain : Technological 2 4 2 5 2 6 2 7 0.00 0.03 0.05 0.08 0.10 0.12 0.15 0.18 0.20 Domain : Social 2 4 2 5 2 6 2 7 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 # of Nodes : 256 2 4 2 5 2 6 2 7 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 # of Nodes : 512 2 4 2 5 2 6 2 7 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 # of Nodes : 1024 2 4 2 5 2 6 2 7 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 Density : 0.002 2 4 2 5 2 6 2 7 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 Density : 0.008 2 4 2 5 2 6 2 7 0.00 0.05 0.10 0.15 0.20 0.25 0.30 Dimension MAP Density : 0.02 Random Common Neighbor Adamic-Adar Preferential Attachment Jaccard Coeff Laplacian Eigmaps Graph Factorization HOPE SDNE Figure 4.2: Performance evaluation of dierent methods varying the attributes of graphs. The x axis denotes the dimension of embedding, whereas the y axis denotes the MAP scores. 54 2 4 2 5 2 6 2 7 0.00 0.01 0.02 0.03 0.04 0.05 Domain : Economic 2 4 2 5 2 6 2 7 0.00 0.05 0.10 0.15 0.20 0.25 Domain : Biological 2 4 2 5 2 6 2 7 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Domain : Technological 2 4 2 5 2 6 2 7 0.00 0.10 0.20 0.30 0.40 0.50 Domain : Social 2 4 2 5 2 6 2 7 0.00 0.05 0.10 0.15 0.20 0.25 # of Nodes : 256 2 4 2 5 2 6 2 7 0.00 0.05 0.10 0.15 0.20 0.25 # of Nodes : 512 2 4 2 5 2 6 2 7 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 # of Nodes : 1024 2 4 2 5 2 6 2 7 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 Density : 0.002 2 4 2 5 2 6 2 7 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Density : 0.008 2 4 2 5 2 6 2 7 0.00 0.20 0.40 0.60 0.80 1.00 Dimension P@100 Density : 0.02 Random Common Neighbor Adamic-Adar Preferential Attachment Jaccard Coeff Laplacian Eigmaps Graph Factorization HOPE SDNE Figure 4.3: Performance evaluation of dierent methods varying the attributes of graphs. The x axis denotes the dimension of embedding, whereas the y axis denotes the P @100 scores. 55 As for MAP, the value of P@100 is consistent across data sets. The best embedding approaches obtain 0.15-0.2 P@100 which is an improvement of 5-10 times over traditional link prediction methods. We address the evaluation metrics (A4) in Table 4.2 where the GFS-score for each method is presented. In general, SDNE obtains top performance across domains and metrics with the exception of P@100 in biology and social for which HOPE performs better. Also, we can see that in general embedding approaches outperform traditional methods. We now study the performance variations between methods in detail. 4.4.1 Domain Performance For economic graphs, we nd that traditional link prediction models perform poorly because these graphs do not have the notion of a community structure and therefore violate the assumptions of traditional methods. However, state-of-the-art methods perform better and capture the structure of these networks achieving good scores on bothMAP andP @100. We observe that Structural Deep Network Embedding (SDNE) gives the best performance for economic graphs with a GFS-score of 11.2 for MAP and 143.3 for P @100. The score for P @100 is signicantly higher than MAP . The reason is that predicting correct links for low degree nodes is more challenging than high degree nodes and thus averaging over all nodes gives a low overall performance. We also nd that Graph Factorization (GF), despite capturing rst order proximity, performs better than Higher Order Proximity Embedding (HOPE). This suggests that economic graphs do not benet from capturing longer dependency. The overall good performance of SDNE indicates that economic graphs have complex structures that benet from deep neural networks. Biological graphs show a more consistent performance across traditional methods. Preferential Attachment, Common Neighbors and Adamic Adar give a GFS-map score of about 3.3 and a GFS-P@100 score of around 30. Among the graph embedding methods, Graph Factorization and Laplacian Eigenmaps do not improve over the traditional methods and give similar scores. However, 56 HOPE and SDNE show signicant gains in performance. SDNE and HOPE give the highest MAP and P@100 performance, respectively. As HOPE captures higher order proximity, these results suggest that longer dependencies are critical to predicting top links in the biology domain. For the MAP score, SDNE performs better than HOPE. This may be because the low degree nodes may require the understanding of structure to make accurate predictions. Furthermore, we see that the GFS scores for the biology domain are lower than economy graphs. This is an indication that these networks are more dicult to predict. Technological graphs show patterns similar to economic graphs. Non-parametric link prediction methods perform poorly and are unable to outperform the random baseline. Laplacian Eigenmaps gives a score of around 3.5 for bothMAP andP @100. HOPE and SDNE perform the best among the state-of-the-art methods showing that both higher order and structure are helpful in the prediction. We also observe that for technological graphs, the gain in performance is much higher for P@100 by SDNE. This suggests that several top links are not based on neighborhood but based on distant nodes with similar structure. For social graphs, we observe that traditional methods perform very well mostly because these methods were designed for such graphs. On average, we see that both SDNE and HOPE outperform baselines for MAP and HOPE achieves the highest performance on P@100. Social graphs often have community structure and capturing local and global neighborhood can be useful in capturing the graph. Higher performance by HOPE may be because of its ability to capture such properties well. In general, we observe that the state-of-the-art graph embedding methods have higher micro and macro GFS scores than traditional methods showcasing the utility of these approaches. It is worth noting that Laplacian Eigenmaps is good for predicting links for nodes with lower degrees, but it does not perform well for top link predictions. On average, SDNE outperforms the other methods by a big margin. In conclusion, exploring deep learning based graph embedding methods is a valuable direction for future work. 57 4.4.2 Sensitivity to Graph Size Real graphs vary in the number of nodes and edges. On the one hand, representing larger graphs in low dimensions requires more compression. On the other hand, many data driven approaches require data to learn the parameters and thus may not perform well for smaller graphs as the data may not be sucient to learn their parameters. To study the eects of graph size on the embedding performance, we evaluate the performance of methods on graphs of dierent sizes ranging from 256, 512, and 1024 (as shown in Figures 4.2 and 4.3). We observe that for most methods, the absolute MAP and P @100 scores decrease as the size of the graph increases. For example, the average MAP score of HOPE across all dimensions is 0.08, 0.06 and 0.04 for graphs of sizes 256, 512 and 1024, respectively. Similar trends are observed for Graph Factorization and Laplacian Eigenmaps. However, for SDNE, the performance decreases for 512 nodes and increases for 1024 nodes. The reason may be that the deep neural network has more parameters to store information about larger graphs and benets from more data. Another key observation is that although the absolute MAP values decrease, the performance relative to the random baseline improves. As the number of nodes increases, the link prediction task becomes inherently more dicult. With more nodes there are more candidate links to choose from, thus leading to a lower absolute performance. In general, the increase of performance relative to the random random baseline suggests that the state-of-the-art embedding approaches better capture the graph structure when there is more information about the links. 4.4.3 Sensitivity to Graph Density A wide variety of network graphs in real world are sparse. However, the factor of sparsity varies depending on the domain of graphs and other connectivity patterns in the graphs. We now study the eect of graph density on the performance of graph embedding methods (Figures 4.2 and 4.3). 58 In Figure 4.2 we observe that overall the performance of embedding methods in terms of MAP is not sensitive to graph density. MAP values of Graph Factorization increases with increasing graph density but for other methods the trend is not signicant. Further, with higher density the likelihood of correct prediction increases so the trend in Graph Factorization may not signify an increase in performance with increase in density. From Figure 4.3 we observe that the models perform poorly on P@100 score for lower density. Thus, the methods make top link predictions with more accuracy when the graph is relatively denser. 4.4.4 Sensitivity to Embedding Dimension An inherent parameter of every embedding approach is the number of dimensions in the embedded space. The embedding dimension controls the amount of overtting incurred by the method. A low number of dimensions may not be sucient to capture the information contained in the graph. A high number of dimensions may overt the observed links and thus perform poorly on predicting the missing links. The optimal embedding dimension depends on three factors: (i) the input graph, (ii) the embedding method, and (iii) the evaluation metric. Theoretically, graphs with less structure and higher entropy require higher dimensions. Furthermore, embedding methods which are capable of capturing more complex information may require higher dimensions to store the information. Also, predicting top links requires a higher level understanding of the graph and thus requires lower dimensions. Finding the optimal embedding dimension for a given graph requires extensive experimentation. Figure 4.2 shows the MAP performance of the graph embedding methods when varying the embedding dimension from 2 4 to 2 7 . These results provide us with interesting insights. The performance of SDNE improves proportionally when increasing the embedding dimension. Laplacian Eigenmaps performs best for lower dimensions and quickly overts for higher dimensions. This is surprising as deep neural networks models are prone to overtting. But as deep neural 59 networks are capable of capturing more complex information, they benet from higher dimensional representation. Laplacian Eigenmaps only captures the rst order proximity and hence overts for larger dimensions. This is not surprising as the embedding vector stores rst order proximity with higher precision. HOPE and GF also improve their performance with increasing dimensions, but HOPE has higher gains comparatively. Figure 4.3 shows the P @100 performance. The SDNE upward MAP trend is consistent across the metrics and we observe similar increase for P @100. However, we observe that the performance of HOPE deteriorates with increasing dimensions. Graph Factorization's performance is almost constant across all embedding dimensions. This follows the intuition that predicting top links does not require higher dimensions. 4.5 Python Library for GEM-BEN We have created a pip installable Python library for graph embedding method benchmark called gemben (https://pypi.org/project/gemben). The documentation of the gemben library can be found at https://gemben.readthedocs.io. 4.6 Results on Synthetic Graphs 4.6.1 Synthetic Graphs People use synthetic graphs to simulate the properties of real networks. This approach is useful when analyzing network structure properties and network evolution process, especially when the real graph is very large or cannot be fully observed. Realistic synthetic graphs can provide reliable and practical statistical condence on algorithmic analysis and method evaluation. Synthetic graphs are ubiquitous in various domains including sociology, biology, medical domain, internet, teamwork dynamics, and human collaborations. 60 Many graph embedding methods for link prediction, node classication, graph visualization, and graph reconstruction have been evaluated on synthetic graphs. For example, DeepWalk [169] uses Zachary's Karate graph to visualize the quality of the embeddings. GEM [72] uses the Stochastic Block Model (SBM) graph for visualization and evaluation on link prediction, node classication, and graph reconstruction. Similarly, dyngraph2vec [71] also utilizes the SBM graph to evaluate dierent dynamic graph embedding methods on link prediction tasks. HOPE [158] uses synthetic graphs generated by the forest re model to evaluate the embedding's ability to preserve graph high-order proximity. Although synthetic graphs do not to always fully capture inherent properties of real graphs, they are still useful for analyzing the performance of graph embedding methods regarding to dierent graph characteristics. In this section, we analyze the performance of traditional and state-of-the-art methods on synthetic graphs. Our goal is to establish a better understanding on how graph embedding methods perform on some of the most popular synthetic graphs available in the literature. 4.6.2 Synthetic Graph Dataset We evaluate the graph embedding methods implemented in our benchmark on ten synthetic graphs. These synthetic graphs can be categorized in three domains: (i) social, (ii) biology, and (iii) internet. Social networks represent the relationships between users on online social platforms. They characterize friendship networks (e.g., Facebook), and follower networks (e.g., Twitter where links are explicit and direct between users). Social graphs follow small world properties and a power-law distribution. Social networks usually contain hubs and community structures. It is well known that Random Geometric graph [168], Waxman graph [218] and Stochastic block model [214] can appropriately reveal such community properties typically found in social networks. Biology graphs can represent protein-protein interaction networks, gene co-expression networks, and metabolic networks. The Watts-Strogatz graph [217] and Duplication Divergence graph [98] are heavily used 61 to simulate biological networks. Graphs in the internet domain are usually very large in size and scale-free such as the web graph in the World Wide Web. The Barabasi-Albert graph [11] and the Power-law Cluster graph [88] have been used to simulate the power-law distribution found in internet graphs. In addition to the synthetic graphs described above which are designed to simulate certain type of real graphs, Leskovec et al. [126] used the Stochastic Kronecker Graph to eectively model any real networks from dierent domains using four parameters through an iterative Kronecker product process. We use the following synthetic graphs to illustrate the inecacy of existing data sets and to highlighting the performance gaps between dierent graph embedding methods: Barabasi-Albert graph [11]: this method generates random graphs using a preferential attachment process. A graph of n nodes is constructed by adding new nodes with m edges which are connected to existing nodes based on their degree. The likelihood of connection to a node is proportional to its degree. The generated random graph has power-law properties similar to the ones found in real-world networks. Powerlaw Cluster graph [88]: it extends the Barabasi-Albert graph to include a triad formation step. When a new node u is linked to an existing node v, with a probability p u is also connected to one of v's neighbors . The likelihood of triad formation controls the clustering coecient of the graph. Watts-Strogatz graph [217]: the model creates a ring graph of n nodes. It then links each node to its k nearest neighbors. Each link from a node u is rewired randomly to a node w with a probability p. The rewiring is done to create short paths between nodes. Duplication Divergence graph [98]: this random graph model is based on the behavior of protein-protein interaction graphs. The model starts with an initial graph and follows two steps for evolution: duplication and divergence. In the duplication step, a uniformly randomly chosen target is duplicated, and then connected to each neighbor of the target node. In the divergence 62 step, each edge from the duplicate is removed with a probability 1p. The retention probability of an edge is p. Random Geometric Graph [168]: it is a spatial network that starts with an arbitrary distribution of n nodes in a metric domain. Then it creates an edge between any pair of nodes if their spatial distance is under a certain threshold. This model simulates the community structure within human social networks. Waxman Graph [218]: this model extends the Random Geometric graph by adding edges in a probabilistic process. First, it uniformly places n nodes in a rectangular space. If the distance of two nodes is within the neighborhood radius r, then it adds an edge with a probability p. The Waxman graph also demonstrates the community structure within the network. Stochastic Block Model Graph [214]: this random graph generator splits n nodes into m communities of arbitrary size. For each pair of vertices that belong to the community C i and C j respectively, this model connects two nodes under probability p ij . In order to preserve the community structure, the in-block connection probability is higher than the cross-block probability. R-Mat(Recursive Matrix) Graph [31]: this model is able to simulate any unimodal or power-law graphs with a few parameters using the recursive matrix. It recursively subdivides the adjacency matrix into four equal-sized partitions and distributes edges within these partitions with unequal probabilities. Random Hyperbolic Graph [118]: this generation method builds a graph based on hyperbolic geometry. First, it randomly places nodes in a hyperbolic disk of radius R, and then connects each pair of vertices with an edge if their distance is less than R. Stochastic Kronecker Graph [126]: similar to the R-Mat graph's recursive generation process, this model builds the graph's adjacent matrix from a 2 2 parameter matrix via iterating Kronecker product. Each component in the matrix is a real number between 0 and 1. Stochastic Kronecker Graph can simulate realistic graphs while preserving all common realistic graph properties. 63 Domain Graph Generator Generator Properties Social Stochastic Block Model dense areas with sparse connections Random Geometric Graph community structure Waxman Graph community structure Biology Watts Strogatz Graph ring shape graph Duplication Divergence Graph simulate protein-protein interations via duplication and divergence Hyperbolic Graph large networks, power-law degree distribution and high clustering Internet Barabasi Albert Graph power-law degree distribution, small-world property Powerlaw Cluster Graph power-law degree distribution, small-world property R-Mat Graph power-law degree distribution, small-world property, self-similarity All Stochastic Kronecker Graph build graph via iterating Kronecker product 4.6.3 Results This section analyses the performance of dierent graph embedding methods and link prediction heuristic methods on link prediction task on eleven synthetic graph datasets described in section 6.2. We evaluate dierent link prediction approaches from various perspectives, including graph domain, graph size, node average degree and embedding dimension. Figure 4 represents the link prediction performance MAP scores of eight methods on all synthetic graphs from three dierent domains: social, biology and internet with varying graph size from 256 to 8192 nodes, varying graph embedding dimensions from 16 to 256, and varying average node degree from 4 to 12. While varying one parameter, we keep other parameters the same across dierent graphs. We keep The MAP score shown on the graph is averaged across all the synthetic graphs belonging to that domain. Figure 5 shows P @100 scores of all methods. Dierent from the apparent advantage of state-of-the-art graph embedding methods such as SDNE and HOPE on link prediction tasks on real graphs shown from Figure 2, we observe that classic link prediction heuristics such as Jaccard's Coecient, Common Neighbors and Adamic-Adar can surpass some graph embedding methods on certain simple synthetic graphs dataset. Some synthetic graphs have naive structure and link prediction heuristics are capable of learning the network structure. Dierent graph embedding methods also perform dierently in terms of domain, graph size, node degree and embedding dimension. 64 Figure 5 shows concrete MAP scores of those methods on three chosen synthetic graphs from three dierent domains, also with varying graph properties and embedding dimensions. The chosen graphs include Random Geometric Graph from social domain, Watts Strogatz Graph in biology and Powerlaw Cluster Graph belonging to internet domain. Figure 6 shows their P @100 performance. Seen from the gures, even with same graph sizes, node average degree and embedding dimension, those methods have diverse performance on link prediction. Some synthetic graph generators contain certain network structure characteristics, which enable heuristic based and graph embedding based methods to learn. However, some synthetic graphs such as synthetic internet graphs are generated in a more stochastic way, which are less structured and all methods fail to predict correctly. 4.6.3.1 Domain Performance We divide all sythetic graph generators into three domains: social, biology and internet. Each domain contains three kinds of synthetic graphs. We add Stochastic Kronecker graph to each domain with specic parameters with respect to that domain. For social network synthetic graphs, heuristic method Jaccard's Coecient and Common Neighbours perform the best on the link prediction task. As synthetic social graphs are usually community based graphs, including Stochastic Block Model, Random Geometric Graph, Waxman Graph, nodes within the same community are more densely connected than nodes outside. Jaccard's Coecient and Common Neighbours are capable of capturing such inherent property that nodes with high degree are more likely connected with nodes with high degree within the same cluster. For graph embedding methods, Laplacian Eigenmaps has the similar MAP performance with best heuristics methods and HOPE is also good at predicting top 100 missing links. SDNE doesn't perform well on synthetic graphs probably due to that those graphs are simply structured and training links easily overt the model. 65 When it comes to biology graphs, npn-parametric methods Common Neighbor, Adamic-Adar and Jaccord Coecients have the consistent good performance. Biological graphs are usually various as objects and connections are dierent under dierent context. However, synthetic biological graphs are constructed in certain heuristic manner, which explains the good performance of the heuristic link prediction methods. Graph embedding methods are easily overtted with such graphs. Some graph embedding methods such as HOPE can still overpass all methods at predicting top 100 nodes. For Internet graphs, Barabasi Albert Graph, Powerlaw Cluster Graph and R-mat Graph all have power-law degree distribution and small-world property. Heuristic methods Adamic-Adar has the best MAP score and Jaccord Coecient achieves the best Precision@100 score. Like social and biological graphs, internet synthetic graphs are also built using such heuristics. Heuristic link prediction methods are suitable to explain edge generation rules. Graph embedding methods like Graph Factorization performs better on synthetic graphs in internet domain. 4.6.3.2 Sensitivity to Graph Size In our experiments, we test all methods on synthetic graphs from small size(1024 nodes), medium size(2048, 4096 nodes) to large size(8192 nodes). From three subplots in the rst column in Figure 4 and 5, We notice that the absolute values of MAP decrease when graph size increase accross all methods in three domains. However, some methods including Laplacian Eigenmaps and HOPE have higher P@100 scores when graph size grows. On one hand, the reason that MAP score is lower on large graphs might be that with the size of graphs increasing, it is harder to predict the possible edges over more candidate edges. For graph embedding methods, embedding using the same dimension on larger-size graphs require more information compression and it results in relatively poor performance. On the other hand, the density of graphs increases while the graphs nodes increase and average node degree stay the same. Higher density provides more information for individual nodes and it helps to increase p@100 of link prediction. 66 4.6.3.3 Sensitivity to Average Node Degree Instead of using graph density which is less controllable in the graph generators, we look into sensitivity of link prediction methods and graph embedding methods to average node degree of graphs. In Figure 4, we observe that most of methods get better MAP performance along with the increase of average node degree on social and biological synthetic graphs. When graph size is the same, the higher average node degree means each node has more edges. It is easier to predict possible hidden links. However, those methods' performance has opposite trend on internet graphs. Internet graphs might be more stochastic and noisy when the size becomes bigger. In gure 5, Hope and Laplacian Eigenmaps shows good performance on P @100. Graph embedding methods are good at predicting top links on denser graphs. 4.6.3.4 Sensitivity to Embedding Dimension The traditional link prediction methods make predictions based on node similarity metrics and they are invariant to dimension. Graph embedding methods represent nodes in embedding space and dierent embedding dimension could generate embeddings with dierent quality. Seeing from Figure 4 and Figure 5, the higher dimension the graph embedding methods use, the better MAP and P @100 scores they can get on the same graph. As all the graphs sizes are much larger than embedding dimensions, it requires less compression of graph information when embedding dimension is higher. Higher dimension preserves more features of graphs nodes and it results in better link prediction accuracy. For internet synthetic graphs, increasing embedding dimensions is a smaller boost and the reason might be that there are too many noise in the graphs. 4.7 Conclusion and Future Work This work introduced a benchmark for graph embedding techniques. We presented a suite of 100 real graphs spanning multiple domains and evaluated state-of-the-art embedding approaches 67 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 social 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 biology 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 2 10 2 11 2 12 2 13 # of nodes 0.000 0.003 0.005 0.007 0.010 0.013 0.015 0.018 0.020 internet 6 7 8 9 10 11 12 Avg. degree 0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014 0.016 2 5 2 6 2 7 2 8 Dimension 0.000 0.002 0.004 0.006 0.008 0.010 0.012 0.014 MAP Random Common Neighbor Adamic-Adar Preferential Attachment Jaccard Coeff Laplacian Eigmaps Graph Factorization HOPE SDNE Figure 4.4: Benchmark Synthetic plot. 68 0.00 0.05 0.10 0.15 0.20 0.25 social 0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.25 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 biology 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 2 10 2 11 2 12 2 13 # of nodes 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 internet 6 7 8 9 10 11 12 Avg. degree 0.00 0.02 0.04 0.06 0.08 2 5 2 6 2 7 2 8 Dimension 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 P@100 Random Common Neighbor Adamic-Adar Preferential Attachment Jaccard Coeff Laplacian Eigmaps Graph Factorization HOPE SDNE Figure 4.5: Benchmark Synthetic plot. 69 0.00 0.10 0.20 0.30 0.40 0.50 RG-graph 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.00 0.10 0.20 0.30 0.40 0.50 0.00 0.10 0.20 0.30 0.40 0.50 WS-graph 0.00 0.10 0.20 0.30 0.40 0.50 0.00 0.10 0.20 0.30 0.40 0.50 2 10 2 11 2 12 2 13 # of nodes 0.00 0.01 0.02 0.03 0.04 0.05 PC-graph 2 3 Avg. degree 0.00 0.01 0.02 0.03 0.04 0.05 2 5 2 6 2 7 2 8 Dimension 0.00 0.01 0.02 0.03 0.04 0.05 MAP Random Common Neighbor Adamic-Adar Preferential Attachment Jaccard Coeff Laplacian Eigmaps Graph Factorization HOPE SDNE Figure 4.6: Benchmark plot for individual synthetic graphs. 70 0.00 0.20 0.40 0.60 0.80 1.00 RG-graph 0.00 0.20 0.40 0.60 0.80 1.00 0.00 0.20 0.40 0.60 0.80 1.00 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 WS-graph 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 2 10 2 11 2 12 2 13 # of nodes 0.00 0.03 0.05 0.08 0.10 0.12 0.15 0.18 PC-graph 2 3 Avg. degree 0.00 0.03 0.05 0.08 0.10 0.12 0.15 0.18 2 5 2 6 2 7 2 8 Dimension 0.00 0.03 0.05 0.08 0.10 0.12 0.15 0.18 P@100 Random Common Neighbor Adamic-Adar Preferential Attachment Jaccard Coeff Laplacian Eigmaps Graph Factorization HOPE SDNE Figure 4.7: Benchmark plot for individual synthetic graphs. 71 against traditional approaches. We established that graph embedding approaches outperform traditional methods on a variety of dierent graphs. Further, we showed that the performance of the method depends on multiple attributes of the graph and evaluation: (i) size of the graph, (ii) graph density, (iii) embedding dimension, and (iv) evaluation metric. Further, the performance varies tremendously based on the domain of the graph. Finally, we presented an open-source Python library, named GEM-BEN, which provides the benchmarking tool to evaluate any new embedding approach against the existing methods. There are three promising research directions: (i) automating hyperparameter selection, (ii) graph ensemble techniques, (iii) generating realistic synthetic data. During the experimentation, we observed that a considerable eort is spent in identifying optimal hyperparameters for graph embedding approaches. Automating this process can help in their evaluation. Secondly, we showed in this work that optimal approach for a graph depends on the domain and other properties. We can extend this concept to infer that dierent subgraphs of a graph may benet from using dierent embedding approaches. Combining multiple approaches on a graph is non-trivial but can help improve link prediction performance. Finally, we show that graph embedding approaches do not perform well on synthetic graphs as the state-of-the-art synthetic graphs are too simple with few parameters. We believe generating more realistic synthetic graphs is an important problem which can benet testing a new approach. 72 Chapter 5 Embedding Networks with Edge Attributes 5.1 Introduction Networks exist in various forms in the real world including author collaboration [68, 154], social [64, 216], router communication [7, 147], biological interactions [203, 165] and word co-occurrence networks [97, 27]. Many tasks can be dened on such networks including visualization [142, 64], link prediction [135, 138], virtualization [236, 93] and node clustering [50, 233] and classication [17, 223]. For example, predicting future friendships can be formulated as a link prediction problem on social networks. Similarly, predicting page likes by a user and nding community of friends can be regarded as node classication and clustering respectively. Solving such tasks often involves nding representative features of nodes which are predictive of their characteristics. Automatic representation of networks in low-dimensional vector space has recently gained much attention and many network embedding approaches have been proposed to solve the aforementioned tasks [3, 169, 29, 198, 78, 158]. Existing network embedding algorithms either focus on vanilla networks i.e. networks without attributes, or networks with node attributes. Methods on vanilla networks [73] obtain the embedding by optimizing an objective function which preserves various properties of the graph. Of these, many methods [3, 169, 29, 198, 158] preserve proximity of nodes whereas others [78, 61, 150] 73 j i ? j i ? (a) (b) k k Figure 5.1: Users i and j are both engaged in work, family and sport topics. Aggregation of their topics over dierent interactions, will cause loss of valuable information. learn representations which are capable of identifying structural equivalence as well. More recent attempts [95, 94, 33] preserve the proximity of nodes, incorporating both topological structure and node attributes. They learn each embedding separately and align them into a unied space. However, these approaches fail to incorporate edge attributes present in many real world networks, which can provide more insight into the interactions between nodes. For example, in collaboration networks in which authors are the nodes and edges represent presence of co-authored papers, the content of a co-authored paper can be used to characterize the interaction between the author nodes. Learning representation which can capture node proximity and edge attributes, or labels 1 can yield useful insights into the network topology. Although the edge labels can be combined to form node labels, such aggregation incurs loss of information as illustrated by our experiments. Figure 5.1 shows the eect of this aggregation. Nodes i and j are both involved in work, family and sport interactions with dierent people and combining the labels obfuscates the presence of relationship between i and j. Given that the type of interaction between i and k is \work" and betweenk andj is \family", the likelihood of interaction between i andk is lesser in (a) compared to (b), where i and j have same node labels. Furthermore, the task of capturing edge attributes 1 In this work we use the terms edge attribute and edge label interchangeably. 74 is challenging. Firstly, the edge attributes can be sparse and noisy. For example, abstracts of papers in a collaboration network would consist of a few words out of a large vocabulary. Secondly, capturing the heterogeneity of information provided by the attributes and network structure is non-trivial. To overcome the above challenges, in this section, we introduce Edge Label Aware Network Embedding (ELAINE), a model which is capable of utilizing the edge labels to learn a unied representation. As opposed to linear models, ELAINE uses multiple non-linear layers to learn intricate patterns of interactions in the graph. Moreover, the proposed model preserves higher order proximity by simulating multiple random walks from each node and social roles using statistical features and edge labels. It jointly optimizes the reconstruction loss of node similarity and edge label reconstruction to learn a unied embedding. We focus our experiments on two tasks: (a) link prediction, which predicts the most likely unobserved edges, and (b) node classication, which predicts labels for each node in the graph. We compare our model, ELAINE, with the state-of-the-art algorithms for graph embedding. Furthermore, we show how each component of ELAINE aects the performance on these tasks. We show results on several real world networks including collaboration networks and social networks. Our experiments demonstrate that using a deep model which preserves higher order proximity, social roles and edge labels signicantly outperforms the state-of-the-art. Overall, this section makes the following contributions: 1. We propose ELAINE, a model for jointly learning the edge label and network structure. 2. We demonstrate that edge labels can improve performance on link prediction and node classication. 3. We extend the deep architecture for network representation to preserve higher order proximity and social roles eciently. 75 The rest of the section is organized as follows. Section 5.2 provides a summary of the methods proposed in this domain and dierences with our model. In Section 5.3, we provide the denitions required to understand the problem and models discussed next. We then introduce our model in Section 5.4. We then describe our experimental setup and obtained results (Sections 5.5 and 5.6). Finally, in Section 5.6.5 we draw our conclusions and discuss potential applications and future research directions. 5.2 Related Work Generally network embedding techniques come in two avors: rst group uses the pure network structure to map into the embedding space, we call it vanilla network embedding, and the second group combines two sources of information, the topological structure of the graph along with the nodes or link attributes, called attributed network embedding. 5.2.1 Vanilla Network Embedding There exist variety of embedding techniques for vanilla networks when there is no meta-data available besides the network structure. In general they fall into three broad categories: graph factorization, random-walk based and deep learning based models. Methods such as Locally Linear Embedding [178], Laplacian Eigenmaps [14], Graph Factorization [3], GraRep [29] and HOPE [158], factorize a representative matrix of graph, e.g. node adjacency matrix, to obtain the embedding. Dierent techniques have been proposed the factorization of the representative matrix based on the matrix properties. Random-walk based techniques are mainly recognized for their power to preserve higher order proximity between nodes by maximizing the probability of occurrence of subsequent nodes in x length random walk [169]. node2vec [78], denes a biased random walk to capture structural equivalence, while the structural similarity of nodes is up to the Skip-Gram window size. struc2vec [61] also use a weighted random walk to generate sequence of the structurally 76 similar nodes, independent of their position in network. Recently, deep learning models have been proposed with the ability to capture the non-linear structure in data. SDNE [213], DNGR [30] and VGAE [113] used deep autoencoders to embed the nodes which capture the nonlinearity in graph. Note nally that an alternative approach to network embedding is provided via generative graphical models such as the Mixed Membership Stochastic Blockmodel [5], and its extension that take into account node attributes [149, 32, 39]. 5.2.2 Attributed Network Embedding Recently few works have started to use node attributes, beside the network structure in the embedding process. HNE [33] embeds multi-modal data of heterogeneous networks into a common space. For this purpose, they rst apply nonlinear feature transformations on dierent object types and then with linear transformation project the heterogeneous components into a unied space. LANE [95] incorporates the node labels into network structure and attributes for learning the embedding representation in a supervised manner. They embed attributed network and labels into latent representations separately and then jointly embed them into a unied representation. A distributed joint learning process is proposed in [94] as an scalable solution, applicable to graphs with large number of nodes and edges. Incorporating node and edge attributes to study networks has been seen in some applications such as Virtual Network Embedding (VNE). Works in this domain ([37, 70]) utilize node capacities and channel bandwidth of the substrate network to nd a subset of nodes to satisfy each virtual network request. However, such methods do not learn a representation for each node and are application specic. Existing network embedding approaches do not directly capture edge attributes. Edge attributes can be aggregated and assigned to nodes and provided as input into these methods. However, this aggregation will cause the loss of information about the type of interaction between a node and its neighbors. To overcome this challenge, in this work we introduce a model to utilize the edge labels 77 in the process of mapping nodes in a unied space. The model can jointly learn network structure and edge attributes. Furthermore, our model also captures social roles and showcase the use of variational autoencoder in capturing such information. 5.3 Problem Statement G Graphical representation of the data V Set of vertices in the graph (size n) E Set of edges in the graph (size m) A Weighted adjacency matrix, nn E a Edge attribute matrix, mp d Number of dimensions Y Embedding of the graph, nd S Node proximity matrix, nn R Social role matrix, nk F Feature matrix of the graph, n (n +k) v; l ;r Variational, lasso and ridge loss coecients Table 5.1: Summary of notation We denote a weighted graph as G(V;E) where V is the vertex set and E is the edge set. The weighted adjacency matrix of G is denoted by A. If (i;j)2 E, we have A ij > 0 denoting the weight of edge (i;j); otherwise we have A ij = 0. We use a i = [A i;1 ; ;A i;n ] to denote the i-th row of the adjacency matrix. We use E a 2R mp to denote the edge attribute matrix ande a ij = [e a ij1 ; ;e a ijp ] to denote the attributes of edge (i;j), where p is the number of edge attributes. The notations used in the section are summarized in Table 5.1. We dene our problem as follows: Given a graph G = (V;E) and associated edge attributes E a , we aim to represent each node u in a low-dimensional vector spacey u by learning a mapping f :fV;E a g!R d , namelyy v =f(v;E a )8v2V such that dn and the mapping f preserves network structure and edge attributes. Intuitively, if two nodes u and v are \similar" in graph G, their embedding y u andy v should be close to each other in the embedding space. We use the notation f(G)2R nd for the embedding matrix of all nodes in the graph G. Note that the 78 i y i (1) W (1) y i (K) W y i (K-1) ˆ y i (1) ˆ A i ˆ W ( ˆ W ˆ y i (K-1) K-1) (K-1) (1) y = r e d o c n e r e d o c e d i A Figure 5.2: Traditional deep autoencoder model. embedding of an edge (u;v) is dened as g(u;v) = [y u ;y v ], i.e. the concatenation of embeddings of nodes u and v. It can be written as g :E!R 2d . We use g(u;v) to reconstruct the edge label e a uv . This enables us to infer the missing edge labels by using the adjacency of the incident nodes. 5.4 ELAINE We propose an edge label aware information network embedding method - ELAINE, which models l th -order proximity, social role features and edge labels using a deep variational autoencoder. The core component of the model is based on a deep autoencoder which can be used to learn the network embedding by minimizing the following loss function: L = n X i=1 k(^ a i a i ) i k 2 2 =k( ^ AA)Bk 2 F (5.1) 79 Figure 5.2 illustrates the autoencoder. The objective function penalizes inaccurate reconstruc- tion of node neighborhood. As many legitimate links are not observed in the networks, a weight i is traditionally used to impose more penalty on reconstruction of observed edges [213]. Although the above model can learn network representations which can reconstruct the graph well, it suers from four challenges. Firstly, as the model reconstructs the observed neighborhood of each vertex, it only preserves second order proximity of nodes. Wang et. al. [213] extend the model to preserve rst order proximity but their model fails to capture higher order proximities. Concretely, if two nodes have disjoint neighborhoods the model will keep them apart regardless of the similarity of their neighborhoods. Secondly, the model is prone to overtting leading to a satisfactory reconstruction performance but sub-par performance in tasks like link prediction and node classication. Wang et. al. [213] use l 1 and l 2 regularizers to address this issue but we show that using variational autoencoders can achieve better performance. For instance, the link prediction MAP for Hep-th collaboration network increases by a factor 1.3% by using variational autoencoder. Thirdly, the model does not explicitly capture social role information. Real world networks often have a role based structure understanding which can help with various prediction tasks. Lastly, the model does not consider edge labels. We show that incorporating edge label reconstruction leads to improved performance in various tasks. As an example, for Hep-th network, the link prediction MAP increases by a factor 3.6% with the addition of edge attributes. To address the above challenges, we propose a random walk based deep variational autoencoder model with an objective to jointly optimize the higher order neighborhood, role based features and edge label reconstruction. 5.4.1 Variational Autoencoder As we aim to nd a low-dimensional manifold the original graph lies in, we want to learn a representation which is maximally informative of observed edges and edge labels. At the same time, as the autoencoder penalizes reconstruction error, it encourages perfect reconstruction at the cost 80 i j G neighborhood roles y j [s i ,r i ] [s j ,r j [y i ,y j ] ˆ (0,I) j μ(X) (0,I) μ(X) (X) i [ˆ s i ,ˆ r i ] [ˆ s j ,ˆ r j ] W W W W W ˆ ˆ ˆ e F =[S, R] feature extraction l th order e ij a ] y (X) Figure 5.3: Edge label aware embedding model. ELAINE extracts higher order relations between nodes using random walks and social role based features. The coupled autoencoder jointly optimizes these features and edge attributes to obtain a unied representation. i j k i j i j (i) (ii) (b) (a) Figure 5.4: Importance of capturing higher order proximity and social roles. (a) Nodes i and j are more similar in (ii) compared to (i) but rst order proximity fails to capture this, (b) Node i and k have similar roles, but they are far apart in network. Using social role indicative statistical features can capture similarity of these nodes. 81 of overtting to the training data. This is in particular problematic for learning representations for graphs as networks are constructed from interactions which may be incomplete or noisy. We want to nd embeddings which are robust to such noise and can help us in tasks such as link prediction and node classication. Many methods have been proposed to improve the generalization of autoencoders for tasks like image and speech recognition [15]. Of these, sparse autoencoders [19], which use L 1 and L 2 penalty on weights, and stacked denoising autoencoders [212], which sample autoencoder inputs by adding Gaussian noise to data inputs, have been shown to improve performance in graph related tasks [213, 30]. Nonetheless, these models suer from various challenges. The former doesn't ensure a smooth manifold and the latter is sensitive to the number of corrupted inputs generated [176]. We propose to use variational autoencoder for graphs and illustrate in Section 5.6 that it can improve performance in dierent tasks. Variational autoencoders (VAEs) look at autoencoders from a generative network perspective. The model aims to maximize P(X) = R P(Xjz;)P(z)dz, where X is the training data, z is the latent variable. They assume P(Xjz;) to be normally distributed, i.e. P(Xjz;) = N (Xjf(z;); 2 I), where f(z;) is the decoding of z with the learned decoder parameters . Computing the integral is intractable and is approximated by summation. Moreover, for most z, P (Xjz) will be nearly zero and thus we need to nd z which are more likely, given the data. This can be written as nding the distributionQ(zjX) which is approximated using the encoder enabling us to compute E zQ P(Xjz) tractably. The model assumes a normal form for Q(zjX). Thus, we have Q(zjX) =N (zj(X;); (X;)), where are parameters of the encoder. Typically, is constrained to be a diagonal matrix to decouple the latent variables. The optimization is reduced to E zQ [logP(Xjz)]D KL [Q(zjX)jjP(z)]. The second term is the KL-divergence between two multivariate Gaussian distribution and the rst term is the likelihood of reconstruction given the latent variables constrained on their distributions learned by the autoencoder. 82 In practice, VAEs can be trained by minimizing the sum of two terms: (1) reconstruction loss and (2) KL-divergence of latent variable distribution and unit Gaussian, using backpropagation. The variance of reconstruction controls the generalization of the model which can be treated as the coecient of KL-divergence loss. This coecient ensures a smoother manifold of autoencoder weights leading to a better prediction performance. 5.4.2 Higher Order Proximity and Role Preservations Nodes in a network are related to each other via many degrees of connection. Some nodes have direct connections while others are connected through paths of varying lengths. Moreover, nodes may take several dierent roles. For e.g., in web graphs, nodes can be broadly classied to hubs and authorities. Hubs refer to nodes which refer to other nodes i.e. have high out-degree whereas authorities refer to nodes which are linked to by other nodes. A good embedding should preserve such higher order and role based relations between nodes. Naively using node adjacency as the input, an autoencoder cannot achieve this as shown in Figure 5.4 (a). Nodes i andj have dierent neighborhoods and the model cannot dierentiate between (i) and (ii). In both cases, the model will keep them far apart although in (ii), the nodes are more similar. Similarly, in Figure 5.4 (b), we see that nodes i and k are structurally similar but proximity based methods cannot utilize this. 5.4.2.1 Random Walks To preserve higher order proximities, we obtain global distance based similarities of each node with the rest of the nodes. One way to obtain such a set of vectors is to use metrics such as Katz Index [108], Adamic Adar [1] and Common Neighbors [153]. Although such metrics capture global proximities accurately, their computation is inecient and the time complexity is up to O(n 3 ). We overcome the ineciency by approximating them using random walks [9]. For each node i, 83 we simulate k random walks each of length l. Each random walk,fv i;1 ;v i;2 :::v i;l g, from node i generates a node j with probability: P (v i;j jv i;j1 ) = 8 > > < > > : 1 dj1 if (v i;j1 ;v i;j )2E 0 otherwise where d k is the degree of node k. Note that since a random walk of length l from node i is equivalent to a random walk of length l 1 for node v i;1 , generating k random walks of length l only requires O(k) time each node. 5.4.2.2 Role preserving features Social roles in a network are characterized by various local and global statistics. For example, high degree can be re ective of social importance. Broadly, we classify role discriminating features into two categories: (a) statistical features, and (b) edge attributes. We consider the following statistical features which have been shown to correlate with social roles[85]: (i) node's degree, (ii) weighted degree, (iii) clustering coecient, (iv) eccentricity, (v) structural hole and (vi) local gatekeeper. We append these features with node's neighborhood as input to our model. Having such statistical features helps obtain an embedding which preserves social roles. On the other hand, a node can take dierent roles with dierent neighbors (henceforth referred as interactive roles) which cannot be captured by such statistical features. For example, in a collaboration network, author i may take the role of Professor with his student j and colleague with another professor k. Identifying such distribution of roles can help model the network more accurately. For this we use the edge attributes which can be re ective of such interactions. Concretely, we consider the topics of conversation between nodes and jointly optimize their reconstruction of node neighborhood reconstruction. 84 5.4.3 Incorporating edge labels Autoencoder dened above takes node neighborhood and statistical role preserving features as input and aims to reconstruct them. One possible approach to incorporate edge attributes is to aggregate them for each node and append them with other node features. The drawback of this approach is that information loss can incur following aggregation. Such aggregation cannot preserve interactive roles between nodes. We propose to overcome this problem by coupling copies of autoencoders for nodes i and j. The model is composed of a coupled autoencoder and an edge attribute decoder, Figure 5.3. The intuition is to force the embeddings of nodes i and j to capture information pertaining to the attributes of the edge between them. This is ensured by adding the edge attribute reconstruction loss to the objective function. Thus, we learn model parameters by minimizing a loss function with the following terms: 5.4.3.1 Neighborhood and social role reconstruction The lth-order neighborhood of each node along with the social role preserving statistical features: L n =k([ ^ S; ^ R] [S;R])Bk 2 F ; where each row of S2R nn and R2R nr compose of neighborhood similarity and role statistics respectively. Henceforth, we will refer to [ ^ S; ^ R] by ^ F2R nn+r and [S;R] by F2R nn+r . 5.4.3.2 Edge label/attributes reconstruction For each pair of nodes, we reconstruct the attributes of the edge between them: L e =k ^ E a E a k 2 F ; 85 Algorithm 1: ELAINE Function ELAINE (Graph G = (V;E), Edge attributes E a 2R mp , Dimensions d, Random walk parameters rw param ) S RandomWalk(G, rw param ); R GetSocialRoles(G); F [S, R]; # RandomInit(); SetF =f(f i ;f j ;e a ij )g for each e = (v i ;v j )2E, f i ;f j 2F ; for iter = 1:::I do Randomly sample minibatch M fromF; L =L n + 1 L e +L reg ; grad @L=@#; # UpdateGradAdam(#, grad); Y EncoderForwardPass(G, #); return Y where each row i of E a 2R mp is the vector of attributes of the ith edge . 5.4.3.3 Regularization To avoid overtting, we use three types of regularizations: (a) Lasso (L l ), (b) Ridge (L r ), and (c) Variational loss (L v ), dened below: L v =D KL (Q(zjX)jjP (z)); L l = K X k=1 kW (k) k sum +k ^ W (k) k sum +kW (k) e k sum ; L r = K X k=1 kW (k) k 2 F +k ^ W (k) k 2 F +k ^ W e (k) k 2 F ; L reg = v L v + l L l + r L r ; where Q(zjX) corresponds to the encoder and P(z) is the prior which is assumed to be unit Gaussian. Lasso regularizer is used to ensure that the prediction for a node is independent of most of the nodes in the graph making it more robust. We also use variational loss for this purpose as 86 it ensures a smoother manifold of autoencoder weights [51]. The overall objective function thus becomes the following: L =L n + 1 L e +L reg ; (5.2) Table 5.2: Dataset Statistics Name Hep-th Twitter 20-new group Enron n 7,980 6,479 1,727 145 m 21,036 18,123 2,980,802 912 Avg. degree 5.27 5.59 1726 12.58 # of node labels 20 - 3 - # of edge attributes 100 10 6 10 5.4.4 Optimization To get the optimal parameters for the model dened above, we minimize the loss function L. The optimization involves three sets of gradients: @L=@W (k) , @L=@ ^ W (k) and @L=@ ^ W e (k) . Applying the gradients on equation 5.2, we get: @L @W (k) = @L n @W (k) + 1 @L e @W (k) + @L reg @W (k) ; @L @ ^ W (k) = @L n @ ^ W (k) + @L reg @ ^ W (k) ; @L @ ^ W e (k) = 1 @L e @ ^ W e (k) + @L reg @ ^ W e (k) ; where k =f1; 2;:::;Kg. For k =K, we have @L n @ ^ W (K) = [2( ^ FF )B][ @ a( ^ Y (K1) ^ W (K) + ^ b (K) ) @ ^ W (K) ]; @L e @ ^ W e (K) = [2( ^ E a E a )][ @ a( ^ E a (K1) ^ W e (K) + ^ b (K) ) @ ^ W e (K) ]; 87 where a() represents the activation function of the autoencoder. We use the above derivatives and backpropagate them to get the derivatives for other k values and @L : =@W (k) for each L : . After obtaining the derivatives we optimize the model using stochastic gradient descent (SGD) [179] with Adaptive Moment Estimation (Adam)[110]. The algorithm is specied in Algorithm 1. 5.4.5 Complexity Analysis In this section, we discuss the time and space complexity of ELAINE. Generating random walks of length k for a graph with n nodes takes O(nk) time. Computing the node similarity matrix S from the random walks can be done in O(kn 2 ) using pairwise computation. Similarly, the social role based feature matrix R can be computed in O(rn 2 ) time where r is the number of roles. The model runs I iterations, each of which loops over the edges of the graph, a single computation of which takes O((n +r)d l ) where d l is the dimension of the largest hidden layer. Thus, the overall time complexity isO(kn 2 +rn 2 + (n +r)md l I). For a sparse network, it becomes O((k +r +d l I)n 2 ) and is thus quadratic in the number of nodes. This is similar to other deep learning based methods [30, 213]. As we only process the data per batch, the space complexity if O((n +r)B) where B is the batch size. 5.5 Experiments In this section, we rst describe the data sets used and then discuss the baselines we use to compare our model. This is followed by the evaluation metrics for our experiments and parameter settings. All the experiments were performed on a Ubuntu 14.04.4 LTS system with 32 cores, 128 GB RAM and clock speed of 2.6 GHz. The GPU used was Nvidia Tesla K40C. 88 Twitter Hep-th Enron Figure 5.5: Precision@k and MAP of link prediction for dierent data sets. 5.5.1 Datasets We conduct experiments on four real-world datasets to evaluate our proposed algorithm. The datasets are summarized in Table 5.2. Hep-th [68]: The original data set contains abstracts of papers in High Energy Physics Theory conference in the period from January 1993 to April 2003. We create a collaboration network for the rst ve years. We get the node labels using the Google Scholar API 2 to obtain university labels for each author. We apply NMF [191] on the set of abstracts to get topic distribution for each abstract. We aggregate the topic distribution of all the coauthored papers between two authors to get the edge attributes. Twitter [60]: The data set consists of tweets on the French election day, 7th May, 2017. The tweets were obtained using keywords related to election including France2017, LePen, Macron and FrenchElections. We construct the mention network by connecting users who mention each other in a tweet. The topic distribution of the tweet between user i and j, obtained from NMF, is regarded as the edge attribute e a ij . 2 https://pypi.python.org/pypi/scholarly/0.2 89 20-Newsgroup 3 : This dataset contains about 20,000 newsgroup documents each corresponding to one of 20 topics. For our experiments, we selected all documents in three news group \computer graphics", \sport-baseball" and \politics and guns". From this we construct a document similarity graph using the cosine similarity of their tf-idf vectors. Similar to Hep-th, we use NMF to get topic distribution of each document and use common topics between documents as the edge attributes. Enron [114]: This dataset contains emails communicated among about 150 users, mostly senior management of Enron. We connect two users if they have exchanged an email. Edge attribute between node i and j is the extracted topics from each set of emails between them using NMF. 5.5.2 Baselines We compare our model with the following state-of-the-art methods: Graph Factorization (GF) [3]: It factorizes the adjacency matrix with regularization. Structural Deep Network Embedding (SDNE) [213]: It uses deep autoencoder along with Laplacian Eigenmaps objective to preserve rst and second order proximities. Higher Order Proximity Preserving [158] (HOPE): It factorizes the higher order similarity matrix between nodes using generalized singular value decomposition [160]. node2vec [78]: It preserves higher order proximity by maximizing the probability of occurrence of subsequent nodes in xed length biased random walks. They use shallow neural networks to obtain the embeddings. DeepWalk is a special case of node2vec with the random walk bias set to 0. 3 http://qwone.com/ ~ jason/20Newsgroups/ 90 5.5.3 Evaluation Metrics In our experiments, we evaluate our model on tasks of link prediction, node classication and visualization. For link prediction, we use precision@k and Mean Average Precision (MAP) as our metric. For node classication, we use microF 1 and macroF 1. 20-Newsgroup Hep-th Figure 5.6: Node classication results for dierent data sets. The formulae used for these metrics are as follows: precision@k: It is the fraction of correct predictions in top k predictions. It is dened as jE pred (k)\Egtj k , where E pred and E gt are the predicted and ground truth edges respectively. MAP: It averages the precision over all nodes. It can be written as MAP = P i AP(i) jVj , where AP (i) = P k precision@k(i)IfE pred i (k)2Egt i g jfk:E pred i (k)2Egt i gj and precision@k(i) = jE pred i (1:k)\Egt i j k macro-F1, in a multi-label classication task, is dened as the average F1 of all the labels, i.e., macroF 1 = P l2L F1(l) jLj , where F 1(l) is the F 1-score for label l. micro-F1 calculates F1 globally by counting the total true positives, false negatives and false positives, giving equal weight to each instance. It is thus 2PR P+R , where P and R are overall precision and recall respectively. 5.5.4 Parameter settings In our experiments, we use two hidden layers for feature encoder and decoder with size [500, 300]. For the edge attribute decoder, we experiment with a single hidden layer with 1000 neurons and without any hidden layer. Optimal values of other hyperparameters such as 1 , r , l and v are 91 obtained using grid search over [10 5 ; 10 3 ] in factors of 10. The value of maximum iterations I is set to 100 in all our experiments. 5.6 Results and Analysis In this section, we present results of our model on link prediction and node classication, and provide a comparison with baselines. Moreover, we discuss the eect of each component of our model to the overall precision gain. We then discuss the sensitivity of our model to dierent hyperparameters. 5.6.1 Link Prediction Information networks are meant to capture the interactions in real world. This translation of interactions can be noisy and inaccurate. Predicting missing links in the constructed networks and links likely to occur in the future is an important and dicult task. We test our model on this link prediction task to understand the generalizability of our model. For each network, we randomly hide 20% of the network edges. We use the rest of the network to learn the embeddings of nodes and sort the likelihood of each unobserved edge to predict the missing links. As number of node pairs for a network of size N is N(N 1)=2, we randomly sample 1024 nodes for evaluation (similar to [73]). We get 5 samples for each data set and report the mean and standard deviation of precision and MAP values. Figures 5.5 illustrates the link prediction precision@k and MAP values for the methods on data sets. We observe that our model signicantly outperforms baselines on Hep-th. This implies that using the topic distribution of abstracts can help us understand the relation between authors. On Twitter and Enron, we observe that gain in precision@k isn't as signicant as gain in MAP. Thus, our model improves predictions considerably for nodes with lesser incident edges although the top predicted edges are slightly better than baselines. This follows intuition since edge labels 92 for such nodes provide higher information about their relation with other nodes than for the nodes for which we have ample edge information. We also observe that our model achieves higher improvement over baselines on Hep-th and Enron compared to Twitter. This can be attributed to the characteristic of tweets which tend to be more unstructured and noisy and hence more challenging to model. From the low MAP values for all models, we may also conclude that the network itself is less structured compared to other data sets. Overall, gain in performance consistently for dierent kinds of data sets shows that our model can utilize edge attributes in dierent domains and improve link prediction performance. 5.6.2 Node Classication Table 5.3: Common interests of authors in Hep-th string theory theoretical physics physics quantum eld theory mathematical physics 21.64% 19.65% 16.17% 15.17% 13.68% cosmology quantum gravity particle physics high energy physics machine learning 9.20% 7.21% 7.21% 6.22% 4.48% supersymmetry black holes bioinformatics gravity noncomm. geometry 4.23% 4.23% 4.23% 3.98% 3.98% mathematics condensed matter neuroscience astrophysics quantum information 3.73% 3.48% 3.48% 3.23% 3.23% Node classication refers to the task of predicting missing labels for nodes. In these experiments, we evaluate the performance of our model on a multi-label node classication task, in which each node can be assigned one or more labels. To test our model against the baselines, we use the node embeddings learned by the models as input to a one-vs-rest logistic regression using the LIBLINEAR library. We vary the train to test ratio from 20% to 90% and report results on microF 1 and macroF 1. As for link prediction, we perform each split 5 times and plot mean and standard deviation. We obtain the node labels for Hep-th collaboration network by searching for author interests using the Google Scholar API. In the interest of keeping label dimensionality and class imbalance 93 low, we consider the top 20 most common interests. Table 5.3 enumerates the top interests and percentage of authors with those interests. Figure 5.6 illustrates the results of our experiments. For Hep-th, we observe that our model achieves higher macroF 1 than the baselines although it doesn't show much improvement over SDNE in themicroF 1 scores. This can possibly be explained by the generality of most frequent labels. As Hep-th is a theoretical physics conference, having interests such as \theoretical physics" are not surprising. BettermacroF 1 performance shows that our model can predict low occurring class labels better by utilizing the topics discussed in the abstract. For 20-Newsgroup, our model outperforms baselines in both microF 1 and macroF 1. 5.6.3 Eect of Each Component As our model composes of several modules, we test the eect each component has on the prediction. For this experiment, to test each module, we set the coecients of the other components as zero. Table 5.4: Eect of each component on link prediction for Hep-th. Algorithm MAP Autoencoder (AE) 15% Variational Autoencoder (VAE) 15.2% VAE+ Higher Order (HO) 21.6% VAE+ HO + Roles (HO-R) 22.2% VAE+ HO-R + Node aggr. edge attributes (NA-ELAINE) 22.7% VAE + HO-R + Edge attributes (ELAINE) 23% Table 5.4 illustrates the results. We see that addition of Variational Autoencoder improves the MAP value by 0.2% showing that eective regularization can positvely impact generalizability. Adding higher order information benets the most, increasing the MAP by 6.4 %, followed by edge attributes and role based features which further improve MAP by 0.8% and 0.6% respectively. We also compare our model against NA-ELAINE (Node Aggregated ELAINE) to show the loss the of information in aggregating edge labels for each node. We observe NA-ELAINE improves 94 1 Figure 5.7: Eect of 1 , coecient of edge label reconstruction, on link prediction MAP. Figure 5.8: Eect of embedding dimensions on link prediction MAP. It shows that link prediction performance peaks at 128. MAP by 0.5% whereas ELAINE achieves a higher value of 23% achieving an increase of 0.8% over the edge attribute unaware features illustrating the above eect. 5.6.4 Hyperparameter Sensitivity In this set of experiments, we evaluate the eect of hyperparameters on the performance to understand their roles. Specically, we evaluate the performance gain as we vary the number of embedding dimensions, d, and the coecient of edge label reconstruction, 1 . We report MAP of link prediction on Hep-th for these experiments. 95 Eect of dimensions: We vary the number of dimensions from 2 to 256 in powers of two. Figure 5.8 illustrates its eect on MAP. As the number of dimensions increases the link prediction performance improves until 128 as higher dimensions are capable of storing more information. The performance degrades as we increase the dimensions further as the model overts on the observed edges and performs poorly on predicting new edges. Eect of 1 : The value of 1 determines the balance between neighborhood prediction and edge attribute prediction. We test the values from 10 2 , which prioritizes neighborhood and edge attribute, to 10 2 , which penalizes edge label reconstruction loss more heavily. Figure 5.7 shows the relation of MAP with 1 . We observe that initially as we increase 1 , MAP increases which suggests that the model benets from having link attributes. Increasing it further by 10 times drastically reduces the performance as now the embedding almost solely represents the edge labels which on their own cannot predict missing labels well. This demonstrates that having edge labels can signicantly improve link prediction performance. 5.6.5 Discussion We studied the eect of edge attributes on network tasks including link prediction and node classication. Our evaluation on real world networks reveals that edge attributes are useful in predicting unseen edges in the network. The improvement is highest when the edge attributes are well structured. For example, we consider abstracts and emails for the Hep-th and Enron datasets, which are both structured text thus yielding better performance. However, if the attributes are less structured, we see fewer improvements, e.g., the case of tweets. We also conclude that edge attributes can help predict classes of low degree nodes but may not improve predictions for the nodes with high degree. 96 Chapter 6 Updating Embedding in Dynamic Networks 6.1 Introduction Many important tasks in network analysis involve making predictions over nodes and/or edges in a graph, which demands eective algorithms for extracting meaningful patterns and constructing predictive features. Among the many attempts towards this goal, graph embedding, i.e., learning low-dimensional representation for each node in the graph that accurately captures its relationship to other nodes, has recently attracted much attention. It has been demonstrated that graph embedding is superior to alternatives in many supervised learning tasks, such as node classication, link prediction and graph reconstruction [3, 169, 29, 198, 78, 158]. Various approaches have been proposed for static graph embedding [73]. Examples include SVD based models [14, 178, 29, 158], which decompose the Laplacian or high-order adjacency matrix to produce node embeddings. Others include Random-walk based models [78, 169] which create embeddings from localized random walks and many others [198, 3, 30, 157]. Recently, Wang et al. designed an innovative model, SDNE, which utilizes a deep autoencoder to handle non-linearity to generate more accurate embeddings [213]. Many other methods which handle attributed graphs and generate a unied embedding have also been proposed in the recent past [33, 94, 95]. 97 However, in practical applications, many graphs, such as social networks, are dynamic and evolve over time. For example, new links are formed (when people make new friends) and old links can disappear. Moreover, new nodes can be introduced into the graph (e.g., users can join the social network) and create new links to existing nodes. Usually, we represent the dynamic graphs as a collection of snapshots of the graph at dierent time steps. Existing works which focus on dynamic embeddings often apply static embedding algorithms to each snapshot of the dynamic graph and then rotationally align the resulting static embeddings across time steps [79, 120]. Naively applying existing static embedding algorithms independently to each snapshot leads to unsatisfactory performance due to the following challenges: Stability: The embedding generated by static methods is not stable, i.e., the embedding of graphs at consecutive time steps can dier substantially even though the graphs do not change much. Growing Graphs: New nodes can be introduced into the graph and create new links to existing nodes as the dynamic graph grows in time. All existing approaches assume a xed number of nodes in learning graph embeddings and thus cannot handle growing graphs. Scalability: Learning embeddings independently for each snapshot leads to running time linear in the number of snapshots. As learning a single embedding is already computationally expensive, the naive approach does not scale to dynamic networks with many snapshots. Other approaches have attempted to learn embedding of dynamic graphs by explicitly imposing a temporal regularizer to ensure temporal smoothness over embeddings of consecutive snapshots [243]. This approach fails for dynamic graphs where consecutive time steps can dier signicantly, and hence cannot be used for applications like anomaly detection. Moreover, their approach is a Graph Factorization (abbreviated as GF hereafter) [3] based model, and DynGEM outperforms these models as shown by our experiments in section 6.5. [46] learn embedding of dynamic graphs, although they focus on a bipartite graphs specically for user-item interactions. 98 In this chapter, we develop an ecient graph embedding algorithm, referred to as DynGEM, to generate stable embeddings of dynamic graphs. DynGEM employs a deep autoencoder at its core and leverages the recent advances in deep learning to generate highly non-linear embeddings. Instead of learning the embedding of each snapshot from scratch, DynGEM incrementally builds the embedding of snapshot at time t from the embedding of snapshot at time t 1. Specically, we initialize the embedding from previous time step, and then carry out gradient training. This approach not only ensures stability of embeddings across time, but also leads to ecient training as all embeddings after the rst time step require very few iterations to converge. To handle dynamic graphs with growing number of nodes, we incrementally grow the size of our neural network with our heuristic, PropSize, to dynamically determine the number of hidden units required for each snapshot. In addition to the proposed model, we also introduce rigorous stability metrics for dynamic graph embeddings. On both synthetic and real-world datasets, experiment results demonstrate that our approach achieves similar or better accuracy in graph reconstruction and link prediction more eciently than existing static approaches. DynGEM is also applicable for dynamic graph visualization and anomaly detection, which are not feasible for many previous static embedding approaches. 6.2 Denitions and Preliminaries We denote a weighted graph as G(V;E) where V is the vertex set and E is the edge set. The weighted adjacency matrix of G is denoted by S. If (u;v)2 E, we have s ij > 0 denoting the weight of edge (u;v); otherwise we have s ij = 0. We uses i = [s i;1 ; ;s i;jVj ] to denote the i-th row of the adjacency matrix. Given a graphG = (V;E), a graph embedding is a mappingf :V !R d , namelyy v =f(v)8v2 V . We require that djVj and the function f preserves some proximity measure dened on the graph G. Intuitively, if two nodes u and v are \similar" in graph G, their embeddingy u andy v 99 should be close to each other in the embedding space. We use the notation f(G)2R jVjd for the embedding matrix of all nodes in the graph G. In this chapter, we consider the problem of dynamic graph embedding. We represent a dynamic graphG as a series of snapshots, i.e.G =fG 1 ; ;G T g, where G t = (V t ;E t ) and T is the number of snapshots. We consider the setting with growing graphs i.e. V t V t+1 , namely new nodes can join the dynamic graph and create links to existing nodes. We consider the deleted nodes as part of the graph with zero weights to the rest of the nodes. We assume no relationship between E t and E t+1 and new edges can form between snapshots while existing edges can disappear. A dynamic graph embedding extends the concept of embedding to dynamic graphs. Given a dynamic graphG =fG 1 ; ;G T g, a dynamic graph embedding is a time-series of mappings F =ff 1 ; ;f T g such that mapping f t is a graph embedding for G t and all mappings preserve the proximity measure for their respective graphs. A successful dynamic graph embedding algorithm should create stable embeddings over time. Intuitively, a stable dynamic embedding is one in which consecutive embeddings dier only by small amounts if the underlying graphs change a little i.e. if G t+1 does not dier from G t a lot, the embedding outputs Y t+1 =f t+1 (G t+1 ) and Y t =f t (G t ) also change only by a small amount. To be more specic, let S t ( ~ V ) be the weighted adjacency matrix of the induced subgraph of node set ~ V V t and F t ( ~ V )2R j ~ Vjd be the embedding of all nodes in ~ V V t of snapshot t. We dene the absolute stability as S abs (F;t) = kF t+1 (V t )F t (V t )k F kS t+1 (V t )S t (V t )k F : In other words, the absolute stability of any embeddingF is the ratio of the dierence between embeddings to that of the dierence between adjacency matrices. Since this denition of stability 100 depends on the sizes of the matrices involved, we dene another measure called relative stability which is invariant to the size of adjacency and embedding matrices: S rel (F;t) = kF t+1 (V t )F t (V t )k F kF t (V t )k F kS t+1 (V t )S t (V t )k F kS t (V t )k F : We further dene the stability constant: K S (F) = max ; 0 jS rel (F ;)S rel (F ; 0 )j: We say that a dynamic embeddingF is stable as long as it has a small stability constant. Clearly, the smaller the K S (F) is, the more stable the embeddingF is. In the experiments, we use the stability constant as the metric to compare the stability of our DynGEM algorithm to other baselines. 6.3 DynGEM: Dynamic Graph Embedding Model Recent advances in deep unsupervised learning have shown that autoencoders can successfully learn very complex low-dimensional representations of data for various tasks [16]. DynGEM uses a deep autoencoder to map the input data to a highly nonlinear latent space to capture the connectivity trends in a graph snapshot at any time step. The model is semi-supervised and minimizes a combination of two objective functions corresponding to the rst-order proximity and second-order proximity respectively. The autoencoder model is shown in Figure 6.1, and the terminology used is in Table 6.1 (as in [213]). The symbols with hat on top are for the decoder. The neighborhood of a nodev i is given bys i 2R n . For any pair of nodesv i andv j from graph G t , the model takes their neighborhoods as input: x i =s i andx j =s j , and passes them through the autoencoder to generate d-dimensional embeddingsy i =y (K) i andy j =y (K) j at the outputs 101 Symbol Denition n number of vertices K number of layers S =fs 1 ; ;s n g adjacency matrix of G X =fx i g i2[n] input data ^ X =f^ x i g i2[n] reconstructed data Y (k) =fy (k) i g i2[n] hidden layers Y =Y (K) embedding =fW (k) ; ^ W (k) ;b (k) ; ^ b (k) g weights, biases Table 6.1: Notations for deep autoencoder of the encoder. The decoder reconstructs the neighborhoods ^ x i and ^ x j , from embeddingsy i and y j . 6.3.1 Handling growing graphs Handling dynamic graphs of growing sizes requires a good mechanism to expand the autoencoder model, while preserving weights from previous time steps of training. A key component is to decide how the number of hidden layers and the number of hidden units should grow as more nodes are added to the graph. We propose a heuristic, PropSize, to compute new layer sizes for all layers which ensures that the sizes of consecutive layers are within a certain factor of each other. PropSize: We propose this heuristic to compute the new sizes of neural network layers at each time step and insert new layers if needed. For the encoder, layer widths are computed for each pair of consecutive layers (l k andl k+1 ), starting from its input layer (l 1 =x) and rst hidden layer (l 2 =y (1) ) until the following condition is satised for each consecutive layer pair: size(l k+1 )size(l k ); where 0 < < 1 is a suitably chosen hyperparameter. If the condition is not satised for any pair (l k ;l k+1 ), the layer width for l k+1 is increased to satisfy the heuristic. Note that the size of the embedding layery =y (K) is always kept xed at d and never expanded. If the PropSize rule is not satised at the penultimate and the embedding layer of the encoder, we add more 102 layers in between (with sizes satisfying the rule) till the rule is satised. This procedure is also applied to the decoder layers starting from the output layer (^ x) and continuing inwards towards the embedding layer (or ^ y = ^ y (K) ) to compute new layer sizes. After deciding the number of layers and the number of hidden units in each layer, we adopt Net2WiderNet and Net2DeeperNet approaches from [35] to expand the deep autoencoder. Net2WiderNet allows us to widen layers i.e. add more hidden units to an existing neural network layer, while approximately preserving the function being computed by that layer. Net2DeeperNet inserts a new layer between two existing layers by making the new intermediate layer closely replicate the identity mapping. This can be done for ReLU activations but not for sigmoid activations. The combination of widening and deepening the autoencoder with PropSize, Net2WiderNet and Net2DeeperNet at each time step, allows us to work with dynamic graphs with growing number of nodes over time and gives a remarkable performance as shown by our experiments. 6.3.2 Loss function and training To learn the model parameters, a weighted combination of three objectives is minimized at each time step: L net =L glob +L loc + 1 L 1 + 2 L 2 ; where ; 1 and 2 are hyperparameters appropriately chosen as relative weights of the objective functions. L loc = P n i;j s ij ky i y j k 2 2 is the rst-order proximity which corresponds to local structure of the graph. L glob = P n i=1 k(^ x i x i )b i k 2 2 =k( ^ XX)Bk 2 F is the second-order proximity which corresponds to global neighborhood of each node in the graph and is preserved by an unsupervised reconstruction of the neighborhood of each node. b i is a vector with b ij = 1 if s ij = 0 else b ij = > 1. This penalizes inaccurate reconstruction of an observed edge e ij 103 − 1 − 1 ෝ ෝ ( ) ( ) ෝ ( ) ( ) ෝ Figure 6.1: DynGEM: Dynamic Graph Embedding Model. The gure shows two snapshots of a dynamic graph and the corresponding deep autoencoder at each step. more than that of unobserved edges. Regularizers L 1 = P K k=1 kW (k) k 1 +k ^ W (k) k 1 1 and L 2 = P K k=1 kW (k) k 2 F +k ^ W (k) k 2 F are added to encourage sparsity in the network weights and to prevent the model from overtting the graph structure respectively. DynGEM learns the parameters t of this deep autoencoder at each time step t, and uses Y (K) t as the embedding output for graph G t . 6.3.3 Stability by reusing previous step embedding For a dynamic graphG =fG 1 ; ;G T g, we train the deep autoencoder model fully on G 1 using random initialization of parameters 1 . For all subsequent time steps, we initialize our model with parameters t from the previous time step parameters t1 , before widening/deepening the model. This results in direct knowledge transfer of structure from f t1 to f t , so the model only needs to learn about changes between G t1 and G t . Hence the training converges very fast in a few iterations for time steps:f2; ;Tg. More importantly, it guarantees stability by ensuring that 1 kWk 1 represents sum of absolute values of entries of W 104 embedding Y t stays close to Y t1 . Note that unlike [243] we do not impose explicit regularizers to keep the embeddings at time steps t 1 and t close. Since if the graph snapshots at times t 1 and t dier signicantly, then so should the corresponding embeddings f t1 and f t . Our results in section 6.5 show the superior stability and faster runtime of our method over other baselines. 6.3.4 Techniques for scalability Previous deep autoencoder models [213] for static graph embeddings use sigmoid activation function and are trained with stochastic gradient descent (SGD). We use ReLU in all autoencoder layers to support weighted graphs since ReLU can construct arbitrary positive entries of s i . It also accelerates training, since the derivative of ReLU is straightforward to compute, whereas the derivative of sigmoid requires computing exponentials [69]. Lastly, ReLU allows gradients from both objectives L loc andL glob to propagate eectively through the encoder and averts the vanishing gradient eect [69] which we observe for sigmoid. We also use nesterov momentum [196] with properly tuned hyperparameters, which converges much faster as opposed to using pure SGD. Lastly, we observed better performance on all tasks with a combination of L1-norm and L2-norm regularizers. The pseducode of learning DynGEM model for a single snapshot of the dynamic graph is shown in algorithm 2. The pseduocode can be called repeatedly on each snapshot in the dynamic graph to generate the dynamic embedding. 6.4 Experiments 6.4.1 Datasets We evaluate the performance of our DynGEM on both synthetic and real-world dynamic graphs. The datasets are summarized in Table 6.2. 105 Algorithm 2: Algorithm: DynGEM Function DynGEM (Graph G t = (V t ;E t );G t1 = (V t1 ;E t1 ), Embedding Y t1 , autoencoder weights t1 From G t , generate adjacency matrix S, penalty matrix B; Create the autoencoder model with initial architecture; F [S, R]; Initialize t randomly if t = 1, else t = t1 ; SetF =f(f i ;f j ;e a ij )g for each e = (v i ;v j )2E, f i ;f j 2F ; ifjV t j>jV t1 j then Compute new layer sizes with PropSize heuristic; Expand autoencoder layers and/or insert new layers; for i = 1; 2;::: do Sample a minibatch M fromS; Compute gradientsr t L net of objective L net on M; Do gradient update on t with nesterov momentum; jVj jEj T SYN 1,000 79,800-79,910 40 HEP-TH 1,424-7,980 2,556-21,036 60 AS 7716 10,695-26,467 100 ENRON 184 63-591 128 Table 6.2: Statistics of datasets.jVj,jEj and T denote the number of nodes, edges and length of time series respectively. Synthetic Data (SYN): We generate synthetic dynamic graphs using Stochastic Block Model [214]. The rst snapshot of the dynamic graph is generated to have three equal-sized communities with in-block probability 0.2 and cross-block probability 0.01. To generate subsequent graphs, we randomly pick nodes at each time step and move them to another community. We use SYN to visualize the changes in embeddings as nodes change communities. HEP-TH [68]: The original dataset contains abstracts of paper in High Energy Physics Theory conference in the period from January 1993 to April 2003. For each month, we create a collaboration network using all papers published upto that month. We take the rst ve years data and generate a time series containing 60 graphs with number of nodes increasing from 1; 424 to 7; 980. Autonomous Systems(AS) [129]: This is a communication network of who-talks-to-whom from the BGP (Border Gateway Protocol) logs. The dataset contains 733 instances spanning from 106 November 8, 1997 to January 2, 2000. For our evaluation, we consider a subset of this dataset which contains the rst 100 snapshots. ENRON [114]: The dataset contains emails between employees in Enron Inc. from Jan 1999 to July 2002. We process the graph as done in [164] by considering email communication only between top executives for each week starting from Jan 1999. 6.4.2 Algorithms and Evaluation Metrics We compare the performance of the following dynamic embedding algorithms on several tasks: SDNE 2 : We apply SDNE independently to each snapshot of the dynamic network. [SDNE/GF] align : We rst apply SDNE or GF algorithm independently to each snapshot and rotate the embedding as in [79] for alignment. GF init : We apply GF algorithm whose embedding at timet is initialized from the embedding at time t 1. DynGEM: Our algorithm for dynamic graph embedding. We set the embedding dimension d = 20 for ENRON and 100 for all other datasets. We use two hidden layers in the deep autoencoder with initial sizes (later they could expand) for each dataset as: ENRON = [100; 80],fHEP-TH, AS, SYNg = [500; 300]. The neural network structures are chosen by an informal search over a set of architectures. We set = 0:3, step-size decay for SGD = 10 5 , Momentum coecient = 0:99. The other parameters are set via grid search with appropriate cross validation as follows: 2 [10 6 ; 10 5 ], 2 [2; 5] and 1 2 [10 4 ; 10 6 ] and 2 2 [10 3 ; 10 6 ]. In our experiments, we evaluate the performance of above models on tasks of graph reconstruc- tion, link prediction, embedding stability and anomaly detection. For the rst two tasks, graph 2 We replaced the sigmoid activation in all our SDNE baselines with ReLU activations for scalability and faster training times. 107 reconstruction and link prediction, we use Mean Average Precision (MAP) as our metric (see [213] for denition). To evaluate the stability of the dynamic embedding, we use the stability constant K S (F) dened in section 6.2. All experiments are performed on a Ubuntu 14.04.4 LTS system with 32 cores, 128 GB RAM and a clock speed of 2.6 GHz. The GPU used is Nvidia Tesla K40C. 6.5 Results and Analysis 6.5.1 Graph Reconstruction Embeddings as a good low-dimensional representations of a graph are expected to accurately reconstruct the graph. We reconstruct the graph edges between pairs of vertices from the embeddings, using the decoder from our autoencoder model. We rank the pairs of vertices according to their corresponding reconstructed proximity. Then, we calculate the ratio of real links in top k pairs of vertices as the reconstruction precision. SYN HEP-TH AS ENRON GF align 0.119 0.49 0.164 0.223 GF init 0.126 0.52 0.164 0.31 SDNE align 0.124 0.04 0.07 0.141 SDNE 0.987 0.51 0.214 0.38 DynGEM 0.987 0.491 0.216 0.424 Table 6.3: Average MAP of graph reconstruction. The graph reconstruction MAP metric averaged over snapshots on our datasets are shown in Table 6.3. The results show that DynGEM outperforms all Graph Factorization based baselines except on HEP-TH where its performance is comparable with the baselines. 6.5.2 Link Prediction Another important application of graph embedding is link prediction which tests how well a model can predict unobserved edges. A good representation of the network should not only be able to reconstruct the edges visible to it during training but should also be able to predict edges which are 108 likely but missing in the training data. To test this, we randomly hide 15% of the network edges at SYN HEP-TH AS ENRON GF align 0.027 0.04 0.09 0.021 GF init 0.024 0.042 0.08 0.017 SDNE align 0.031 0.17 0.1 0.06 SDNE 0.034 0.1 0.09 0.081 DynGEM 0.194 0.26 0.21 0.084 Table 6.4: Average MAP of link prediction. timet (call itG t hidden ). We train a dynamic embedding using snapshotsfG 1 ; ;G t1 ;GnG t hidden g and predict the hidden edges at snapshot t. We predict the most likely (highest weighted) edges which are not in the observed set of edges as the hidden edges. The predictions are then compared against G t hidden to obtain the precision scores. The prediction accuracy averaged over t from 1 to T is shown in Table 6.4. We observe that DynGEM is able to predict missing edges better than the baselines on all datasets. Since Graph Factorization based approaches perform consistently worse than SDNE based approach and our DynGEM algorithm, we only present results comparing our DynGEM to SDNE based algorithms for the remaining tasks due to space constraints. 6.5.3 Stability of Embedding Methods Stability of the embeddings is crucial for tasks like anomaly detection. We evaluate the stability of our model and compare to other methods on four datasets in terms of stability constants in Table 6.5. Our model substantially outperforms other models and provides stable embeddings along with better graph reconstruction performance (see table 6.3 in section 6.5.1 for reconstruction errors at this stability). In the next section, we show that we can utilize this stability for visualization and detecting anomalies in real dynamic networks. SYN HEP-TH AS ENRON SDNE 0.18 14.715 6.25 19.722 SDNE align 0.11 8.516 2.269 18.941 DynGEM 0.008 1.469 0.125 1.279 Table 6.5: Stability constants K S (F ) of embeddings on dynamic graphs. 109 6.5.4 Visualization (a) DynGEM time step with 5 nodes jumping out of 1000(b) DynGEM time step with 300 nodes jumping out of 1000 Figure 6.2: 2D visualization of 100-dimensional embeddings for SYN dataset, when nodes change their communities over a time step. A point in any plot represents the embedding of a node in the graph, with the color indicating the node community. Small (big) points are nodes which didn't (did) change communities. Each big point is colored according to its nal community color. One important application of graph embedding is graph visualization. We carry out experiments on SYN dataset with known community structure. We apply t-SNE [143] to the embedding generated by DynGEM at each time step to plot the resulting 2D embeddings. To avoid instability of visualization over time steps, we initialize t-SNE with identical random state for all time steps. Figure 6.2 illustrates the results for 2D visualization of 100-dimensional embeddings for SYN dataset, when nodes change their communities over a single time step. The left (right) plot in each subgure shows the embedding before (after) the nodes change their communities. A point in any plot represents the embedding of a node in the graph with the color indicating the node community. Small (big) points are nodes which didn't (did) change communities. Each big point is colored according to its nal community color. We observe that the DynGEM embeddings of the nodes which changed communities, follow the changes in community structure accurately without disturbing the embeddings of other nodes, 110 Week 93 Week 101 Week 94 Figure 6.3: Results of anomaly detection on Enron and visualization of embeddings for weeks 93, 94 and 101 on Enron dataset even when the fraction of such nodes is very high. This strongly demonstrates the stability of our technique. 6.5.5 Application to Anomaly Detection Anomaly detection is an important application for detecting malicious activity in networks. We apply DynGEM on the Enron dataset to detect anomalies and compare our results with the publicly known events occurring to the company observed by [195]. 111 We dene t as the change in embedding between timet andt 1: t =kF t+1 (V t )F t (V t )k F , and this quantity can be thresholded to detect anomalies. The plot of t with time on Enron dataset is shown in Figure 6.3. In the gure, we see three major spikes around week 45, 55 and 94 which correspond to Feb 2001, June 2001 and Jan 2002. These months were associated with the following events: Jerey Skilling took over as CEO in Feb 2001; Rove divested his stocks in energy in June 2001 and CEO resignation and crime investigation by FBI began in Jan 2002. We also observe some peaks leading to each of these time frames which indicate the onset of these events. Figure 6.3 shows embedding visualizations around week 94. A spread out embedding can be observed for weeks 93 and 101, corresponding to low communication among employees. On the contrary, the volume of communication grew signicantly in week 94 (shown by the highly compact embedding). 6.5.6 Eect of Layer Expansion We evaluate the eect of layer expansion on HEP-TH data set. For this purpose, we run our model DynGEM, with and without layer expansion. We observe that without layer expansion, the model achieves an average MAP of 0.46 and 0.19 for graph reconstruction and link prediction respectively. Note that this is signicantly lower than the performance of DynGEM with layer expansion which obtains 0.491 and 0.26 for the respective tasks. Also note that for SDNE and SDNE align , we select the best model at each time step. Using PropSize heuristic obviates this need and automatically selects a good neural network size for subsequent time steps. 6.5.7 Scalability We now compare the time taken to learn dierent embedding models. From Table 6.6, we observe that DynGEM is signicantly faster than SDNE align . We do not compare it with Graph Factorization based methods because although fast, they are vastly outperformed by deep autoencoder based models. Assuming n s iterations to learn a single snapshot embedding from 112 SYN HEP-TH AS ENRON SDNE align 56.6 min 71.4 min 210 min 7.69 min DynGEM 13.8 min 25.4 min 80.2 min 3.48 min Speedup 4.1 2.81 2.62 2.21 Speedup exp 4.8 3 3 3 Table 6.6: Computation time of embedding methods for the rst forty time steps on each dataset. Speedup exp is the expected speedup based on model parameters. T=5 T=10 T=20 T=40 SDNE align 6.99 min 14.24 min 26.6 min 56.6 min DynGEM 2.63 min 4.3 min 7.21 min 13.8 min Speedup 2.66 3.31 3.69 4.1 Table 6.7: Computation time of embedding methods on SYN while varying the length of time series T . scratch andn i iterations to learn embeddings when initialized with previous time step embeddings, the expected speedup for a dynamic graph of length T is dened as Tns ns+(T1)ni (ignoring other overheads). We compare the observed speedup with the expected speedup. In Table 6.7, we show that our model achieves speedup closer to the expected speedup as the number of graph snapshots increase due to diminished eect of overhead computations (e.g. saving, loading, expansion and initialization of the model, weights and the embedding). Our experiment results show that DynGEM achieves consistent 2-3X speed up across a variety of dierent networks. 6.6 Conclusion In this chapter, we propose DynGEM, a fast and ecient algorithm to construct stable embed- dings for dynamic graphs. It uses a dynamically expanding deep autoencoder to capture highly nonlinear rst-order and second-order proximities of the graph nodes. Moreover, our model utilizes information from previous time steps to speed up the training process by incrementally learning embeddings at each time step. Our experiments demonstrate the stability of our technique across time and prove that our method maintains its competitiveness on all evaluation tasks e.g., graph reconstruction, link prediction and visualization. We showed that DynGEM preserves community structures accurately, even when a large fraction of nodes ( 30%) change communities across time 113 steps. We also applied our technique to successfully detect anomalies, which is a novel application of dynamic graph embedding. DynGEM shows great potential for many other graph inference applications such as node classication, clustering etc., which we leave as future work. There are several directions of future work. Our algorithm ensures stability by initializing from the weights learned from previous time step. We plan to extend it to incorporate the stability metric explicitly with modications ensuring satisfactory performance on anomaly detection. We also hope to provide theoretical insight into the model and obtain bounds on performance. 114 Chapter 7 Capturing Dynamics of Networks using Graph Embedding 7.1 Introduction Understanding and analyzing graphs is an essential topic that has been widely studied over the past decades. Many real-world problems can be formulated as link predictions in graphs [68, 64, 203, 77]. For example, link prediction in an author collaboration network [68] can be used to predict potential future author collaboration. Similarly, new connections between proteins can be discovered using protein interaction networks [165], and new friendships can be predicted using social networks [216]. Recent work on obtaining such predictions use graph representation learning. These methods represent each node in the network with a xed dimensional embedding and map link prediction in the network space to the nearest neighbor search in the embedding space [73]. It has been shown that such techniques can outperform traditional link prediction methods on graphs [78, 158]. Existing works on graph representation learning primarily focus on static graphs of two types: (i) aggregated, consisting of all edges until time T ; and (ii) snapshot, which comprise of edges at the current time step t. These models learn latent representations of the static graph and use them to predict missing links [3, 169, 29, 198, 78, 158, 74]. However, real networks often have complex dynamics which govern their evolution. As an illustration, consider the social network shown in Figure 7.1. In this example, user A moves from one friend to another in such a way 115 A C D t A B C D t+1 A B C D t+2 B Figure 7.1: User A breaks ties with his friend at each time step and befriends a friend of a friend. Such temporal patterns require knowledge across multiple time steps for accurate prediction. that only a friend of a friend is followed and making sure not to befriend an old friend. Methods based on static networks can only observe the network at time t + 1 and cannot ascertain if A will befriend B or D in the next time step. Instead, observing multiple snapshots can capture the network dynamics and predict A's connection to D with high certainty. In this work, we aim to capture the underlying network dynamics of evolution. Given temporal snapshots of graphs, our goal is to learn a representation of nodes at each time step while capturing the dynamics such that we can predict their future connections. Learning such representations is a challenging task. Firstly, the temporal patterns may exist over varying period lengths. For example, in Figure 7.1, user A may hold to each friend for a varying k length. Secondly, dierent vertices may have dierent patterns. In Figure 7.1, user A may break ties with friends whereas other users continue with their ties. Capturing such variations is extremely challenging. Existing research builds upon simplied assumptions to overcome these challenges. Methods including DynamicTriad [241], DynGEM [76] and TIMERS [240] assume that the patterns are of short duration (length 2) and only consider the previous time step graph to predict new links. Furthermore, DynGEM and TIMERS make the assumption that the changes are smooth and use a regularization to disallow rapid changes. 116 In this work, we present a model which overcomes the above challenges. dyngraph2vec uses multiple non-linear layers to learn structural patterns in each network. Furthermore, it uses recurrent layers to learn the temporal transitions in the network. The look back parameter in the recurrent layers controls the length of temporal patterns learned. We focus our experiments on the task of link prediction. We compare dyngraph2vec with the state-of-the-art algorithms for dynamic graph embedding and show its performance on several real-world networks including collaboration networks and social networks. Our experiments show that using a deep model with recurrent layers can capture temporal dynamics of the networks and signicantly outperform the state-of-the-art methods on link prediction. We emphasize that our work is targeted towards link prediction and not node classication. Overall, this chapter makes the following contributions: 1. We propose dyngraph2vec, a dynamic graph embedding model which captures temporal dynamics. 2. We demonstrate that capturing network dynamics can signicantly improve the performance on link prediction. 3. We present variations of our model to show the key advantages and dierences. 4. We publish a library, DynamicGEM 1 , implementing the variations of our model and state-of-the-art dynamic embedding approaches. 7.2 Related Work Graph representation learning techniques can be broadly divided into two categories: (i) static graph embedding, which represents each node in the graph with a single vector; and (ii) dynamic graph embedding, which considers multiple snapshots of a graph and obtains a time series of 1 https://github.com/palash1992/DynamicGEM 117 vectors for each node. Most analysis has been done on static graph embedding. Recently, however, some works have been devoted to studying dynamic graph embedding. 7.2.1 Static Graph Embedding Methods to represent nodes of a graph typically aim to preserve certain properties of the original graph in the embedding space. Based on this observation, methods can be divided into (i) distance preserving, and (ii) structure preserving. Distance preserving methods devise objective functions such that the distance between nodes in the original graph and the embedding space have similar rankings. For example, Laplacian Eigenmaps [14] minimizes the sum of the distance between the embeddings of neighboring nodes under the constraints of translational invariance, thus keeping the nodes close in the embedding space. Similarly, Graph Factorization [3] approximates the edge weight with the dot product of the nodes' embeddings, thus preserving distance in the inner product space. Recent methods have gone further to preserve higher order distances. Higher Order Proximity Embedding (HOPE) [158] uses multiple higher-order functions to compute a similarity matrix from a graph's adjacency matrix and uses Singular Value Decomposition (SVD) to learn the representation. GraRep [29] considers the node transition matrix and its higher powers to construct a similarity matrix. On the other hand, structure-preserving methods aim to preserve the roles of individual nodes in the graph. node2vec [78] uses a combination of breadth-rst search and depth-rst search to nd nodes similar to a node in terms of distance and role. Recently, deep learning methods to learn network representations have been proposed. These methods inherently preserve the higher order graph properties including distance and structure. SDNE [213], DNGR [30] and VGAE [113] use deep autoencoders for this purpose. Some other recent approaches use graph convolutional networks to learn inherent graph structure [112, 23, 84]. 118 7.2.2 Dynamic Graph Embedding Embedding dynamic graphs is an emerging topic still under investigation. Some methods have been proposed to extend static graph embedding approaches by adding regularization [244, 240]. DynGEM [75] uses the learned embedding from previous time step graphs to initialize the current time step embedding. Although it does not explicitly use regularization, such initialization implicitly keeps the new embedding close to the previous. DynamicTriad [241] relaxes the temporal smoothness assumption but only considers patterns spanning two-time steps. TIMERS [240] incrementally updates the embedding using incremental Singular Value Decomposition (SVD) and reruns SVD when the error increases above a threshold. DYLINK2VEC [174] learns embedding of links (node pairs instead of nodes) and uses temporal functions to learn patterns over time. − 50 0 50 − 20 0 20 − 50 0 50 − 20 0 20 − 20 0 20 − 50 − 25 0 25 50 − 20 0 20 − 50 0 50 (a) DynGEM − 15 − 10 − 5 0 5 10 15 − 10 − 5 0 5 10 15 − 15 − 10 − 5 0 5 10 15 − 10 − 5 0 5 10 15 − 15 − 10 − 5 0 5 10 15 − 10 − 5 0 5 10 15 − 15 − 10 − 5 0 5 10 15 − 10 − 5 0 5 10 15 (b) optimalSVD − 30 − 20 − 10 0 10 20 30 − 60 − 40 − 20 0 20 40 60 80 − 40 − 30 − 20 − 10 0 10 20 30 − 60 − 40 − 20 0 20 40 60 − 60 − 40 − 20 0 20 40 60 − 40 − 30 − 20 − 10 0 10 20 30 40 − 30 − 20 − 10 0 10 20 30 − 60 − 40 − 20 0 20 40 60 (c) DynamicTriad 0 50 − 20 0 20 − 20 0 20 − 50 − 25 0 25 0 50 − 20 0 20 − 20 0 20 − 40 − 20 0 20 40 (d) dyngraph2vecAE − 50 0 50 − 20 0 20 − 25 0 25 − 40 − 20 0 20 40 − 20 0 20 − 50 − 25 0 25 − 20 0 20 − 25 0 25 50 (e) dyngraph2vecRNN − 40 − 30 − 20 − 10 0 10 20 30 40 − 60 − 40 − 20 0 20 40 60 − 40 − 30 − 20 − 10 0 10 20 30 40 − 60 − 40 − 20 0 20 40 60 − 40 − 30 − 20 − 10 0 10 20 30 − 60 − 40 − 20 0 20 40 60 80 − 50− 40− 30− 20− 10 0 10 20 30 40 − 60 − 40 − 20 0 20 40 60 (f) dyngraph2vecAERNN Figure 7.2: Motivating example of network evolution - community shift. 119 Link embedding renders the method non-scalable for graphs with high density. Our model uses recurrent layers to learn temporal patterns over long sequences of graphs and multiple fully connected layer to capture intricate patterns at each time step. 7.2.3 Dynamic Link Prediction Several methods have been proposed on dynamic link prediction without emphasis on graph embedding. Many of these methods use probabilistic non-parametric approaches [185, 228]. NonParam [185] uses kernel functions to model the dynamics of individual node features in uenced by the neighbor features. Another class of algorithms uses matrix and tensor factorizations to model link dynamics [54, 141]. Further, many dynamic link prediction models have been proposed for specic applications including recommendation systems [197] and attributed graphs [131]. These methods often have assumptions about the inherent structure of the network and require node attributes. Our model, however, extends the traditional graph embedding framework to capture network dynamics. 7.3 Motivating Example We consider a toy example to motivate the idea of capturing network dynamics. Consider an evolution of graphG,G =fG 1 ;::;G T g, whereG t represents the state of graph at timet. The initial graph G 1 is generated using the Stochastic Block Model [215] with 2 communities (represented by colors indigo and yellow in Figure 7.3), each with 500 nodes. To properly demonstrate the changes in the community, we have only shown a total of 50 nodes (25 from each of the community) and shown only two migrating nodes in Figure 7.3. The in-block and cross-block probabilities are set to 0.1 and 0.01 respectively. The evolution pattern can be dened as a three-step process. In the rst step (shown in Figure 7.3(a)), we randomly and uniformly select 10 nodes (colored red in 120 Figure 7.3) from the yellow community. In step two (shown in Figure 7.3(b)), we randomly add 30 edges between each of the selected nodes in step one and random nodes in the Indigo community. This is similar to having more than cross-block probability but less than in-block probability. In step three (shown in Figure 7.3(c)), the community membership of the nodes selected in step 2 is changed from yellow to indigo. Similarly, the edges (colored red in Figure 7.3) are either removed or added to re ect the cross-block and in-block connection probabilities. Then, for the next time step (shown in Figure 7.3(d)), the same three steps are repeated to evolve the graph. Informally, this can be interpreted as a two-step movement of users from one community to another by initially increasing friends in the other community and subsequently moving to it. 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 (a) 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 (b) 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 (c) 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 (d) (a) (b) (c) (d) Figure 7.3: Motivating example of network evolution - community shift (for clarity, only showing 50 of 500 nodes and 2 out 10 migrating nodes). 121 Our task is to learn the embeddings predictive of the change in community of the 10 nodes. Figure 7.2 shows the results of the state-of-the-art dynamic graph embedding techniques (Dyn- GEM, optimalSVD, and DynamicTriad) and the three variations of our model: dyngraph2vecAE, dyngraph2vecRNN and dyngraph2vecAERNN (see Methodology Section for the description of the methods). Figure 7.2 shows the embeddings of nodes after the rst step of evolution. The nodes selected for community shift are colored in red. We show the results for 4 runs of the model to ensure robustness. Figure 7.2(a) shows that DynGEM brings the red nodes closer to the edge of the yellow community but does not move any of the nodes to the other community. Similarly, DynamicTriad results in Figure 7.2(c) show that it only shifts 1 to 4 nodes to its actual community in the next step. The optimalSVD method in Figure 7.2(b) is not able to shift any nodes. However, our dyngraph2vecAE and dyngraph2vecRNN, and dyngraph2vecAERNN (shown in Figure 7.2(d-f)) successfully capture the dynamics and move the embedding of most of the 10 selected nodes to the indigo community, keeping the rest of the nodes intact. This shows that capturing dynamics is critical in understanding the evolution of networks. LSTM Layers (c) Dynamic Graph to Vector Autoenncoder Recurrent Neural Network (dyngraph2vecAERNN) LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM A t A t+1 A t+l Dense Layers Encoder Decoder Predicted Graph y Embedding of (A t+l) LSTM Encoders A t A t+1 (b) Dynamic Graph to Vector Recurrent Neural Network (dyngraph2vecRNN) LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM A t+l LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM y Embedding of (A t+l) y t+1 y t Decreasing output dimension Increasing output dimension Predicted Graph Graphs LSTM decoders (a) Dynamic Graph to Vector Auto Encoder (dyngraph2vecAE) y A t+l+1 A t A t+1 A t+l Deep encoders Deep decoders Embedding of (A t+l) Graphs Predicted Graph Embedding Space Embedding Space A t+l+1 A t+l+1 Figure 7.4: dyngraph2vec architecture variations for dynamic graph embedding. 122 7.4 Methodology In this section, we dene the problem statement. We then explain multiple variations of deep learning models capable of capturing temporal patterns in dynamic graphs. Finally, we design the loss functions and optimization approach. 7.4.1 Problem Statement Consider a weighted graph G(V;E), with V and E as the set of vertices and edges respectively. We denote the adjacency matrix of G by A, i.e. for an edge (i;j)2E, A ij denotes its weight, else A ij = 0. An evolution of graph G is denoted asG =fG 1 ;::;G T g, where G t represents the state of graph at time t. We dene our problem as follows: Given an evolution of graph G,G, we aim to represent each node v in a series of low-dimensional vector space y v1 ;:::y vt , where y vt is the embedding of node v at timet, by learning mappingsf t :fV 1 ;:::;V t ;E 1 ;:::E t g!R d andy vi =f i (v 1 ;:::;v i ;E 1 ;:::E i ) such thaty vi can capture temporal patterns required to predict y vi+1 . In other words, the embedding function at each time step uses information from graph evolution to capture network dynamics and can thus predict links with higher precision. 7.4.2 dyngraph2vec Our dyngraph2vec is a deep learning model that takes as input a set of previous graphs and generates as output the graph at the next time step, thus capturing highly non-linear interactions between vertices at each time step and across multiple time steps. Since the embedding values 123 Algorithm 3: dyngraph2vec Function dyngraph2vec (GraphsG =fG 1 ;::;G T g, Dimension d, Look back lb) Generate adjacency matricesA fromG; # RandomInit(); SetF =f(A tu )g for each u2V , for each t2f1::tg; for iter = 1:::I do M getArchitectureInput(F, lb); Choose L based on the architecture used; grad @L=@#; # UpdateGradient(#, grad); Y EncoderForwardPass(G, #); return Y capture the temporal evolution of the links, it allows us to predict the next time step graph link. The model learns the network embedding at time step t by optimizing the following loss function: L t+l =k( ^ A t+l+1 A t+l+1 )Bk 2 F ; =k(f(A t ;:::;A t+l )A t+l+1 )Bk 2 F : (7.1) Here we penalize the incorrect reconstruction of edges at time t +l + 1 by using the embedding at time step t +l. Minimizing this loss function enforces the parameters to be tuned such that it can capture evolving patterns relations between nodes to predict the edges at a future time step. The embedding at time step t +d is a function of the graphs at time steps t;t + 1;:::;t +l where l is the temporal look back. We use a weighting matrixB to weight the reconstruction of observed edges higher than unobserved links as traditionally used in the literature [213]. Here,B ij = for (i;j)2E t+l+1 , else 1, where is a hyperparameter controlling the weight of penalizing observed edges. Note that represents elementwise product. We propose three variations of our model based on the architecture of deep learning models as shown in Figure 7.4: (i) dyngraph2vecAE, (ii) dyngraph2vecRNN, and (iii) dyngraph2vecAERNN. Our three methods dier in the formulation of the function f(:). A simple way to extend the autoencoders traditionally used to embed static graphs [213] to temporal graphs is to add the information about previousl graphs as input to the autoencoder. This 124 model (dyngraph2vecAE) thus uses multiple fully connected layers to model the interconnection of nodes within and across time. Concretely, for a node u with neighborhood vector set u 1::t = [a ut ;:::;a u t+l ], the hidden representation of the rst layer is learned as: y (1) ut =f a (W (1) AE u 1::t +b (1) ); (7.2) where f a is the activation function, W (1) AE 2 R d (1) nl and d (1) , n and l are the dimensions of representation learned by the rst layer, number of nodes in the graph, and look back, respectively. The representation of the k th layer is dened as: y (k) ut =f a (W (k) AE y (k1) ut +b (k) ): (7.3) Note that dyngraph2vecAE has O(nld (1) ) parameters. As most real-world graphs are sparse, learning the parameters can be challenging. To reduce the number of model parameters and achieve a more ecient temporal learning, we propose dyngraph2vecRNN and dyngraph2vecAERNN. In dyngraph2vecRNN we use sparsely connected Long Short Term Memory (LSTM) networks to learn the embedding. LSTM is a type of Recurrent Neural Network (RNN) capable of handling long-term dependency problems. In dynamic graphs, there can be long-term dependencies which may not be captured by fully connected auto-encoders. The hidden state representation of a single LSTM network is dened as: 125 y (1) ut =o (1) ut tanh(C (1) ut ) (7.4a) o (1) ut = ut (W (1) RNN [y (1) ut1 ;u 1::t ] +b (1) o ) (7.4b) C (1) ut =f (1) ut C (1) ut1 +i (1) ut ~ C (1) ut (7.4c) ~ C (1) ut = tanh(W (1) C :[y (1) ut1 ;u 1::t +b (1) c ]) (7.4d) i (1) ut =(W (1) i :[y (1) ut1 ;u 1::t ] +b (1) i ) (7.4e) f (1) ut =(W (1) f :[y (1) ut1 ;u 1::t +b (1) f ]) (7.4f) where C ut represents the cell states of LSTM, f ut is the value to trigger the forget gate, o ut is the value to trigger the output gate, i ut represents the value to trigger the update gate of the LSTM, ~ C ut represents the new estimated candidate state, and b represents the biases. There can be l LSTM networks connected in the rst layer, where the cell states and hidden representation are passed in a chain from tl tot LSTM networks. The representation of the k th layer is then given as follows: y (k) ut =o (k) ut tanh(C (k) ut ) (7.5a) o (k) ut = ut (W (k) RNN [y (k) ut1 ;y (k1) ut ] +b (k) o ) (7.5b) The problem with passing the sparse neighbourhood vector u 1::t = [a ut ;:::;a u t+l ] of nodeu to the LSTM network is that the LSTM model parameters (such as the number of memory cells, number of input units, output units, etc.) needed to learn a low dimension representation become large. Rather, the LSTM network may be able to better learn the temporal representation if the sparse neighbourhood vector is reduced to a low dimension representation. To achieve this, we propose a variation of dyngraph2vec model called dyngraph2vecAERNN. In dyngraph2vecAERNN instead of 126 passing the sparse neighbourhood vector, we use a fully connected encoder to initially acquire low dimensional hidden representation given as follows: y (p) ut =f a (W (p) AERNN y (p1) ut +b (p) ): (7.6) where p represents the output layer of the fully connected encoder. This representation is then passed to the LSTM networks. y (p+1) ut =o (p+1) ut tanh(C (p+1) ut ) (7.7a) o (p+1) ut = ut (W (p+1) AERNN [y (p+1) ut1 ;y (p) ut ] +b (p+1) o ) (7.7b) Then the hidden representation generated by the LSTM network is passed to a fully connected decoder. 7.4.3 Optimization We optimize the loss function dened above to get the optimal model parameters. By applying the gradient with respect to the decoder weights on equation 7.1, we get: @L t @W (K) = [2( ^ A t+1 A t+1 )B][ @f a (Y (K1) W (K) +b (K) ) @W (K) ]; where W (K) is the weight matrix of the penultimate layer for all the three models. For each individual model, we back propagate the gradients based on the neural units to get the derivatives for all previous layers. For the LSTM based dyngraph2vec models, back propagation through time is performed to update the weights of the LSTM networks. After obtaining the derivatives, we optimize the model using stochastic gradient descent (SGD) [179] with Adaptive Moment Estimation (Adam)[110]. The algorithm is specied in Algorithm 3. 127 7.5 Experiments In this section, we describe the data sets used and establish the baselines for comparison. Fur- thermore, we dene the evaluation metrics for our experiments and parameter settings. All the experiments were performed on a 64 bit Ubuntu 16.04.1 LTS system with Intel (R) Core (TM) i9-7900X CPU with 19 processors, 10 CPU cores, 3.30 GHz CPU clock frequency, 64 GB RAM, and two Nvidia Titan X, each with 12 GB memory. Table 7.1: Dataset Statistics Name SBM Hep-th AS Nodes n 1000 150-14446 7716 Edges m 56016 268-48274 487-26467 Time steps T 10 136 733 7.5.1 Datasets We conduct experiments on two real-world datasets and a synthetic dataset to evaluate our proposed algorithm. We assume that the proposed models are aware of all the nodes, and that no new nodes are introduced in subsequent time steps. Rather, the links between the existing nodes change with a certain temporal pattern. The datasets are summarized in Table 7.1. Stochastic Block Model (SBM) - community diminishing: In order to test the performance of various static and dynamic graph embedding algorithms, we generated synthetic SBM data with two communities and a total of 1000 nodes. The cross-block connectivity probability is 0.01 and in-block connectivity probability is set to 0.1. One of the communities is continuously diminished by migrating the 10-20 nodes to the other community. A total of 10 dynamic graphs are generated for the evaluation. Since SBM is a synthetic dataset, there is no notion of time steps.. 128 Hep-th [68]: The rst real-world data set used to test the dynamic graph embedding algorithms is the collaboration graph of authors in High Energy Physics Theory conference. The original data set contains abstracts of papers in High Energy Physics Theory conference in the period from January 1993 to April 2003. Hence, the resolution of the time step is one month. For our evaluation, we consider the last 50 snapshots of this dataset. From this dataset 2000 nodes are sampled for training and testing the proposed models. Autonomous Systems (AS) [127]: The second real-world dataset utilized is a communication network of who-talks-to-whom from the BGP (Border Gateway Protocol) logs. The dataset contains 733 instances spanning from November 8, 1997, to January 2, 2000. Hence, the resolution of time step for AS dataset is one month as well. For our evaluation, we consider a subset of this dataset which contains the last 50 snapshots. From this dataset 2000 nodes are sampled for training and testing the proposed models. 7.5.2 Baselines We compare our model with the following state-of-the-art static and dynamic graph embedding methods: Optimal Singular Value Decomposition (OptimalSVD) [159]: It uses the singular value decomposition of the adjacency matrix or its variation (i.e., the transition matrix) to represent the individual nodes in the graph. The low rank SVD decomposition with largest d singular values are then used for graph structure matching, clustering, etc. Incremental Singular Value Decomposition (IncSVD) [21]: It utilizes a perturbation matrix which captures the changing dynamics of the graphs and performs additive modication on the SVD. 129 Rerun Singular Value Decomposition (RerunSVD or TIMERS) [240]: It utilizes the incremental SVD to get the dynamic graph embedding, however, it also uses a tolerance threshold to restart the optimal SVD calculation when the incremental graph embedding starts to deviate. Dynamic Embedding using Dynamic Triad Closure Process (dynamicTriad) [241]: It utilizes the triadic closure process to generate a graph embedding that preserves structural and evolution patterns of the graph. Deep Embedding Method for Dynamic Graphs (dynGEM) [76]: It utilizes deep auto-encoders to incrementally generate embedding of a dynamic graph at snapshot t by using only the snapshot at time t 1. 7.5.3 Evaluation Metrics In our experiments, we evaluate our model on link prediction at time step t + 1 by using all graphs until the time step t . We use Mean Average Precision (MAP) as our metrics. precision@k is the fraction of correct predictions in the topk predictions. It is dened asP @k = jE pred (k)\Egtj k , where E pred and E gt are the predicted and ground truth edges respectively. MAP averages the precision over all nodes. It can be written as P i AP(i) jVj where AP (i) = P k precision@k(i)IfE pred i (k)2Egt i g jfk:E pred i (k)2Egt i gj and precision@k(i) = jE pred i (1:k)\Egt i j k . P@k values are used to test the top predictions made by the model. MAP values are more robust and average the predictions for all nodes. High MAP values imply that the model can make good predictions for most nodes. 7.6 Results and Analysis In this section, we present the performance result of various models for link prediction on dierent datasets. We train the model on graphs from time step t to t +l where l is the lookback of the model, and predict the links of the graph at time step t +l + 1. The lookback l is a model 130 hyperparameter. For an evolving graph withT steps, we perform the above prediction fromT=2 to T and report the average MAP of link prediction. Furthermore, we also present the performance of models when an increasing length of the graph sequence are provided in the training data. Unless explicitly mentioned, for the models consisting of recurrent neural network, a lookback value of 3 is used for the training and testing purpose. 64 128 256 Embedding Size 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Mean MAP incSVD rerunSVD optimalSVD dynTriad dynGEM dyngraph2vecAE dyngraph2vecRNN dyngraph2vecAERNN Figure 7.5: MAP values for the SBM dataset. 7.6.1 SBM Dataset The MAP values for various algorithms with SBM dataset with a diminishing community is shown in Figure 7.5. The MAP values shown are for link prediction with embedding sizes 64, 128 and 256. This gure shows that our methods dyngraph2vecAE, dyngraph2vecRNN and dyngraph2vecAERNN all have higher MAP values compared to the rest of the base-lines except for dynGEM. The dynGEM algorithm is able to have higher MAP values than all the algorithms. This is due to the fact that dynGEM also generates the embedding of the graph at snapshot t + 1 using the graph at snapshot t. Since in our SBM dataset the node-migration criteria are introduced only one-time step earlier, the dynGEM node embedding technique is able to capture these dynamics. However, the proposed dyngraph2vec methods also achieve average MAP values within1.5% of the MAP 131 values achieved by dynGEM. Notice that the MAP values of SVD based methods increase as the embedding size increases. However, this is not the case for dynTriad. 64 128 256 Embedding Size 0.0 0.2 0.4 0.6 0.8 1.0 Mean MAP incSVD rerunSVD optimalSVD dynTriad dynGEM dyngraph2vecAE dyngraph2vecRNN dyngraph2vecAERNN Figure 7.6: MAP values for the Hep-th dataset. 7.6.2 Hep-th Dataset The link prediction results for the Hep-th dataset is shown in Figure 7.6. The proposed dyn- graph2vec algorithms outperform all the other state-of-the-art static and dynamic algorithms. Among the proposed algorithms, dyngraph2vecAERNN has the highest MAP values, followed by dyngraph2vecRNN and dyngraph2vecAE, respectively. The dynamicTriad is able to perform better than the SVD based algorithms. Notice that dynGEM is not able to have higher MAP values than the dyngraph2vec algorithms in the Hep-th dataset. Since dyngraph2vec utilizes not only t 1 but tl 1 time-steps to predict the link for the time-step t, it has higher performance compared to other state-of-the-art algorithms. 7.6.3 AS Dataset The MAP value for link prediction with various algorithms for the AS dataset is shown in Figure 7.7. dyngraph2vecAERNN outperforms all the state-of-the-art algorithms. The algorithm with the 132 64 128 256 Embedding Size 0.0 0.1 0.2 0.3 0.4 0.5 Mean MAP incSVD rerunSVD optimalSVD dynTriad dynGEM dyngraph2vecAE dyngraph2vecRNN dyngraph2vecAERNN Figure 7.7: MAP values for the AS dataset. second highest MAP score is dyngraph2vecRNN. However, dyngraph2vecAE has a higher MAP only with a lower embedding of size 64. SVD methods are able to improve their MAP values by increasing the embedding size. However, they are not able to outperform the dyngraph2vec algorithms. 7.6.4 MAP exploration The summary of MAP values for dierent embedding sizes (64, 128 and 256) for dierent datasets is presented in Table 7.2. The top three highest MAP values are highlighted in bold. For the synthetic SBM dataset, the top three algorithms with highest MAP values are dynGEM, dyngraph2VecAERNN, and dyngraph2vecRNN, respectively. Since the change pattern for the SBM dataset is introduced only at timestep t 1, dynGEM is able to better predict the links. The model architecture of dynGEM and dyngraph2vecAE are only dierent on what data are fed to train the model. In dyngraph2vecAE, we essentially feed more data depending on the size of the lookback. The lookback size increases the model complexity. Since the SBM dataset doesn't have temporal patterns evolving for more than one-time steps, the dyngraph2vec models are only able to achieve comparable but not better result compared to dynGEM. 133 Table 7.2: Average MAP values over dierent embedding sizes. Average MAP Method SBM Hep-th AS IncrementalSVD 0.4421 0.2518 0.1452 rerunSVD 0.5474 0.2541 0.1607 optimalSVD 0.5831 0.2419 0.1152 dynamicTriad 0.1509 0.3606 0.0677 dynGEM 0.9648 0.2587 0.0975 dyngraph2vecAE (lb=3) 0.9500 0.3951 0.1825 dyngraph2vecAE (lb=5) - 0.512 0.2800 dyngraph2vecRNN (lb=3) 0.9567 0.5451 0.2350 dyngraph2vecRNN - 0.7290 (lb=8) 0.313 (lb=10) dyngraph2vecAERNN (lb=3) 0.9581 0.5952 0.3274 dyngraph2vecAERNN - 0.739 (lb=8) 0.3801 (lb=10) lb = Lookback value For the Hep-th dataset, the top three algorithm with highest MAP values are dyn- graph2VecAERNN, dyngraph2VecRNN, and dyngraph2VecAE, respectively. In fact, compared to the state-of-the-art algorithm dynamicTriad, the proposed models dyngraph2VecAERNN (with lookback=8), dyngraph2VecRNN (with lookback=8)), and dyngraph2VecAE(with lookback=5) obtain105%,102%, and42% higher average MAP values, respectively. For the AS dataset, the top three algorithm with highest MAP values are dyngraph2VecAERNN, dyngraph2VecRNN, and dyngraph2VecAE, respectively. Compared to the state-of-the-art rerunSVD algorithm, the proposed models dyngraph2VecAERNN (with lookback=10), dyngraph2VecRNN (with lookback=10), and dyngraph2VecAE (with lookback=5) obtain137%,95%, and74% higher average MAP values, respectively. These results show that the dyngraph2vec variants are able to capture the graph dynamics much better than most of the state-of-the-art algorithms in general. 134 7.6.5 Hyper-parameter Sensitivity: Lookback One of the important parameters for time-series analysis is how much in the past the method looks to predict the future. To analyze the eect of look back on the MAP score we have trained the dyngraph2Vec algorithms with various look back values. The embedding dimension is xed to 128. The look back size is varied from 1 to 10. We then tested the change in MAP values with the real word datasets AS and Hep-th. 1 2 3 4 5 6 8 10 Lookback numbers 0.2 0.4 0.6 0.8 1.0 Mean MAP dyngraph2vecAE dyngraph2vecRNN dyngraph2vecAERNN Figure 7.8: Mean MAP values for various lookback numbers for Hep-th dataset. Performance of dyngraph2Vec algorithms with various lookback values for the Hep-th dataset is presented in Figure 7.8. It can be noticed that increasing lookback values consistently increase the average MAP values. Moreover, it is interesting to notice that dyngraph2VecAE although has increased in performance until lookback size of 8, its performance is decreased for lookback value of 10. Since it does not have memory units to store the temporal patterns like the recurrent variations, it relies solely on the fully connected dense layers to encode to the pattern. This seems rather ineective compared to the dyngraph2VecRNN and dyngraph2vecAERNN for the Hep-th dataset. The highest MAP values achieved if by dyngraph2vecAERNN is 0.739 for the lookback size of 8. 135 1 2 3 4 5 6 8 10 Lookback numbers 0.0 0.2 0.4 0.6 Mean MAP dyngraph2vecAE dyngraph2vecRNN dyngraph2vecAERNN Figure 7.9: Mean MAP values for various lookback numbers for AS dataset. Similarly, the performance of the proposed models while changing the lookback size for AS dataset is presented in Figure 7.9. The average MAP values also increase with the increas- ing lookback size in the AS dataset. The highest MAP value of 0.3801 is again achieved by dyngraph2vecAERNN with the lookback size of 10. The dyngraph2vecAE model, initially, has comparable and sometimes even higher MAP value with respect to dyngraph2vecRNN. However, it can be noticed that for the lookback size of 10, the dyngraph2vecRNN outperforms dyngraph2vecAE model consisting of just the fully connected neural networks. In fact, the MAP value does not increase after the lookback size of 5 for dyngraph2vecAE. 7.6.6 Length of training sequence versus MAP value In this section, we present the impact of length of graph sequence supplied to the models during training on its performance. In order to conduct this experiment, the graph sequence provided as training data is increased one step at a time. Hence, we use graph sequence of length 1 to t2 [T;T + 1;T + 2;T + 3;:::;T +n] to predict the links for graph at time step t2 [T + 1;T + 2;:::;T +n + 1], whereTlookback. The experiment is performed on Hep-th and AS dataset with a xed lookback size of 8. The total sequence of data is 50 and it is split between 25 for training and 25 for testing. Hence, in the experiment the training data sequence increases 136 from total of 25 sequence to 49 graph sequence. The results in Figure 7.10 and 7.11 shows the average MAP values for predicting the links starting the graph sequence at 26 th to all the way to 50 th time-step. Where each time step represents a month. 25 30 35 40 45 50 Training Data 0.5 0.6 0.7 MAP Value Algorithm dyngraph2vecAE dyngraph2vecRNN dyngraph2vecAERNN Figure 7.10: MAP value with increasing amount of temporal graphs added in the training data for Hep-th dataset (lookback = 8). The result of increasing the amount of graph sequence in training data for Hep-th dataset is shown in Figure 7.10. It can be noticed that for both the RNN and AERNN the increasing amount of graph sequence in the data does not drastically increase the MAP value. For, dyngraph2vecAE there is a slight increase in the MAP value towards the end. 25 30 35 40 45 50 Training Data 0.2 0.3 0.4 MAP Value Algorithm dyngraph2vecAE dyngraph2vecRNN dyngraph2vecAERNN Figure 7.11: MAP value with increasing amount of temporal graphs added in the training data for AS dataset (lookback = 8). 137 On the other hand, increasing the amount of graph sequence for the AS dataset during training gives a gradual improvement in link prediction performance in the testing phase. However, they start converging eventually after seeing 80% (total of 40 graph sequence) of the sequence data. 7.7 Discussion Model Variation: It can be observed that among dierent proposed models, the recurrent variation was capable of achieving higher average MAP values. These architectures are ecient in learning short and long term temporal patterns and provide an edge in learning the temporal evolution of the graphs compared to the fully connected neural networks without recurrent units. Dataset: We observe that depending on the dataset, the same model architecture provides dierent performance. Due to the nature of data, it may have dierent temporal patterns, periodic, semi-periodic, stationary, etc. Hence, to capture all these patterns, we found out that the models have to be tuned specically to the dataset. Sampling: One of the weakness of the proposed algorithms is that the model size (in terms of the number of weights to be trained) increases based on the size of the nodes considered during the training phase. To overcome this, the nodes have been sampled. Currently, we utilize uniform sampling of the nodes to mitigate this issue. However, we believe that a better sampling scheme that is aware of the graph properties may further improve its performance. Large Lookbacks: While it is desirable to test large lookback values for learning the temporal evolution with the current hardware resources, we constantly ran into resource exhausted error with lookbacks greater than 10. (especially for 7.8 Future Work Other Datasets: We have validated our algorithms with a synthetic dynamic SBM and two real-world datasets including Hep-th and AS. We leave the test on further datasets as future work. 138 Hyper-parameters: Currently, we provided the evaluation of the proposed algorithm with embedding size of 64, 128 and 256. We leave the exhaustive evaluation of the proposed algorithms for broader ranges of embedding size and look back size for future work. Evaluation: We have demonstrated the eectiveness of the proposed algorithms for predicting the links of the next time step. However, in dynamic graph networks, there are various evaluations such as node classication that can be performed. We leave them as our future work. 7.9 Conclusion This chapter introduced dyngraph2vec, a model for capturing temporal patterns in dynamic networks. It learns the evolution patterns of individual nodes and provides an embedding capable of predicting future links with higher precision. We propose three variations of our model based on the architecture with varying capabilities. The experiments show that our model can capture temporal patterns on synthetic and real datasets and outperform state-of-the-art methods in link prediction. There are several directions for future work: (1) interpretability by extending the model to provide more insight into network dynamics and better understand temporal dynamics; (2) automatic hyperparameter optimization for higher accuracy; and (3) graph convolutions to learn from node attributes and reduce the number of parameters. 139 Chapter 8 Graph Embedding for Optimal Team Composition 8.1 Introduction Cooperation is a common mechanism present in real world systems at various scales and in dierent environments, from biological organization of organisms to human society. A great amount of research has been devoted to study the eects of cooperation on human behavior and performance [13, 48, 104, 201, 130]. These works include domains spanning from cognitive learning to psychology, and cover dierent experimental settings (e.g., classrooms, competitive sport environments, and games), in which people were encouraged to organize and fulll certain tasks [12, 42, 105, 38]. These works provide numerous insights on the positive eect that cooperation has on individual and group performance. Many online games are examples of modern-day systems that revolve around cooperative behavior [137, 96]. Games allow players to connect from all over the world, establish social relationships with teammates [53, 208], and coordinate together to reach a common goal, while trying at the same time to compete with the aim of improving their performance as individuals [148]. Due to their recent growth in popularity, online games have become a great instrument for experimental research. Online games provide indeed rich environments yielding plenty of contextual 140 and temporal features related to player's behaviors as well as social connection derived from the game organization in teams. In this work, we focus on the analysis of a particular type of online games, whose setting boosts players to collaborate to enhance their performance both as individuals and teams: Multiplayer Online Battle Arena (MOBA) games. MOBA games, such as League of Legends (LoL), Defense of the Ancient 2 (Dota 2), Heroes of the Storm, and Paragon, are examples of match-based games in which two teams of players have to cooperate to defeat the opposing team by destroying its base/headquarter. MOBA players impersonate a specic character in the battle (a.k.a., hero), which has special abilities and powers based on its role, e.g., supporting roles, action roles, etc. The cooperation of teammates in MOBA games is essential to achieve the shared goal, as shown by prior studies [227, 52]. Thus, teammates might strongly in uence individual players' behaviors over time. Previous research investigated factors in uencing human performance in MOBA games. On the one hand, studies focus on identifying player's choices of role, strategies as well as spatio-temporal behaviors [52, 227, 57, 184] that drive players to success [183, 63]. On the other hand, performance may be aected by player's social interactions: the presence of friends [163, 172, 184], the frequency of playing with or against certain players [137], etc. Despite the eorts of quantifying performance in presence of social connections, little attention has been devoted to connect the eect that teammates have in increasing or decreasing the actual player's skill level. Our study aims to ll this gap. We hypothesize that some teammates might indeed be benecial to improve not only the strategies and actions performed but also the overall skill of a player. On the contrary, some teammates might have a negative eect on a player's skill level, e.g., they might not be collaborative and tend to obstacle the overall group actions, eventually hindering player's skill acquisition and development. Our aim is to study the interplay between a player's performance improvement (resp., decline), throughout matches in the presence of benecial (resp., disadvantageous) teammates. To this aim, 141 we build a directed co-play network, whose links exist if two players played in the same team and are weighted on the basis of the player's skill level increase/decline. Thus, this type of network only take into account the short-term in uence of teammates, i.e. the in uence in the matches they play together. Moreover, we devise another formulation for this weighted network to take into account possible long-term eects on player's performance. This network incorporates the concept of \memory", i.e. the teammate's in uence on a player persists over time, capturing temporal dynamics of skill transfer. We use these co-play networks in two ways. First, we set to quantify the structural properties of player's connections related to skill performance. Second, we build a teammate recommendation system, based on a modied deep neural network autoencoder, that is able to predict their most in uential teammates. We show through our experiments that our teammate autoencoder model is eective in capturing the structure of the co-play networks. Our evaluation demonstrates that the model signicantly outperforms baselines on the tasks of (i) predicting the player's skill gain, and (ii) recommending teammates to players. Our predictions for the former result in a 9.00% and 9.15% improvement over reporting the average skill increase/decline, for short and long-term teammate's in uence respectively. For individual teammate recommendation, the model achieves an even more signicant gain of 19.50% and 19.29%, for short and long-term teammate's in uence respectively. Furthermore, we show that using a factorization based model only marginally improves over average baseline, showcasing the necessity of deep neural network based models for this task. 8.2 Data Collection and Preprocessing Dota 2. Defense of the Ancient 2 (Dota 2) is a well-known MOBA game developed and published by Valve Corporation. First released in July 2013, Dota 2 rapidly became one of the most played games on the Steam platform, accounting for millions of active players. 142 We have access to a dataset of one full year of Dota 2 matches played in 2015. The dataset, acquired via OpenDota [45], consists of 3,300,146 matches for a total of 1,805,225 players. For each match, we also have access to the match metadata, including winning status, start time, and duration, as well as to the players' performance, e.g., number of kills, number of assists, number of deaths, etc., of each player. As in most MOBA games, Dota 2 matches are divided into dierent categories (lobby types) depending on the game mode selected by players. As an example, players can train in the "Tutorial" lobby, or start a match with AI-controlled players in the "Co-op with AI" lobby. However, most players prefer to play with other human players, rather than with AIs. Players can decide whether the teams they form and play against shall be balanced by the player's skill levels or not, respectively in the \Ranked matchmaking" lobby and the \Public matchmaking" lobby. For Ranked matches, Dota 2 implements a matchmaking system to form balanced opposing teams. The matchmaking system tracks each player's performance throughout her/his entire career, attributing a skill level that increases after each victory and decreases after each defeat. For the purpose of our work, we take only into account the Ranked and Public lobby types, in order to consider exclusively matches in which 10 human players are involved. Preprocessing. We preprocess the dataset in two steps. First, we select matches whose information is complete. To this aim, we rst lter out matches ended early due to connection errors or players that quit at the beginning. These matches can be easily identied through the winner status (equal to a null value if a connection error occurred) and the leaver status (players that quit the game before end have leaver status equal to 0). As we can observe in Fig. 8.1, the number of matches per player has a broad distribution, having minimum and maximum values of 1 and 1; 390 matches respectively. We note that many players are characterized by a low number of matches, either because they were new to the game at the time of data collection, or because they quit the game entirely after a limited number of matches. 143 0 200 400 600 800 1000 1200 1400 # of matches 10 0 10 1 10 2 10 3 10 4 10 5 10 6 # of players Figure 8.1: Distribution of the number of matches per player in the Dota 2 dataset. In this work we are interested in assessing a teammate's in uence on the skill of a player. As described in the following section, we dene the skill score of a player by computing his/her TrueSkill [86]. However, the average number of matches per player that are needed to identify the TrueSkill score in a game setting as the one of Dota 2 is 46 1 . For the scope of this analysis, we then apply a second preprocessing step: we select all the players having at least 46 played matches. These two ltering steps yielded a nal dataset including 87; 155 experienced players. 8.3 Skill Inference Dota 2 has an internal matchmaking ranking (MMR), which is used to track each player's level and, for those game modes requiring it, match together balanced teams. This is done with the main purpose of giving similar chance of winning to both teams. The MMR score depends both on the actual outcome of the matches (win/lose) and on the skill level of the players involved in the match (both teammates and opponents). Moreover, its standard deviation provides a level of 1 https://www.microsoft.com/en-us/research/project/trueskill-ranking-system/ 144 uncertainty for each player's skill, with the uncertainty decreasing with the increasing number of player's matches. Player's skill is a fundamental feature that describes the overall player's performance and can thus provide a way to evaluate how each player learns and evolves over time. Despite each player having access to his/her MMR, and rankings of master players being available online, the ocial Dota 2 API does not disclose the MMR level of players at any time of any performed match. Provided that players' MMR levels are not available in any Dota 2 dataset (including ours), we need to reconstruct a proxy of MMR. We overcome this issue by computing a similar skill score over the available matches: the TrueSkill [86]. The TrueSkill ranking system has been designed by Microsoft Research for Xbox Live and it can be considered as a Bayesian extension of the well-known Elo rating system, used in chess [58]. The TrueSkill is indeed specically developed to compute the level of players in online games that involve more than two players in a single match, such as MOBA games. Another advantage of using such ranking system is its similarity with the Dota 2 MMR. Likewise MMR, the TrueSkill of a player is represented by two main features: the average skill of a player and the level of uncertainty for the player's skill 2 . Here, we keep track of the TrueSkill levels of players in our dataset after every match they play. To this aim, we compute the TrueSkill by using its open access implementation in Python 3 . We rst generate for each player a starting TrueSkill which is set to the default value in the Python library: = 25, and = 25 3 . Then, we update the TrueSkill of players on the basis of their matches' outcomes and teammates' levels. The resulting timelines of scores will be used in the following to compute the link weights of the co-play network. For illustrative purposes, Fig. 8.2 reports three aggregate TrueSkill timelines, for three groups of players: (i) the 10th percentile (bottom decile), (ii) the 90th percentile (top decile), and (iii) the median decile (45th-55th percentile). The red line shows the evolution of the average TrueSkill 2 https://www.microsoft.com/en-us/research/project/trueskill-ranking-system/ 3 https://pypi.python.org/pypi/trueskill 145 0 200 400 600 800 1000 1200 # matches 20 25 30 35 TrueSkill bottom decile median decile top decile Figure 8.2: TrueSkill timelines of players in the top, bottom, and median decile. Lines show the mean of TrueSkill values at each match index, while shades indicate the related standard deviations. scores of the 10% top-ranked players in Dota 2 (at the time of our data collection); the blue line tracks the evolution of the 10% players reaching the lowest TrueSkill scores; and, the green line shows the TrueSkill progress of the \average players". The condence bands (standard deviations) shrinks with increasing number of matches, showing how the TrueSkill converges with increasing observations of players' performance. 4 The variance is larger for high TrueSkill scores. Maintaining a high rank in Dota 2 becomes increasingly more dicult: the game is designed to constantly pair players with opponents at their same skill levels, thus competition in \Ranked matches" becomes increasingly harsher. The resulting score timelines will be used next to compute the link weights of the co-play network. Note that, although we selected only players with at least 46 matches, we observed timelines spanning terminal TrueSkill scores between 12 and 55. This suggests that experience alone (in terms of number of played matches) does not guarantee high TrueSkill scores, in line with prior literature [86]. 4 Note that the timelines have dierent length due to the varying number of matches played by players in each of the three deciles. In particular, in the bottom decile just one player has more than 600 matches. 146 8.4 Network generation In the following, we explain the process to compute the co-play performance networks. In particular, we dene a short-term performance network of teammates, whose links re ect TrueSkill score variations over time, and a long-term performance network, which allows to take memory mechanisms into account, based on the assumption that the in uence of a teammate on a player can persist over time. 8.4.1 Short-term Performance Network Let us consider the set of 87; 155 players in our post-processed Dota 2 dataset, and the related matches they played. For each player p, we dene TS p = [ts 1 ;ts 0 ;ts 1 ; ;ts N ] as the TrueSkill scores after each match played by p, where ts 1 is the default value of TrueSkill assigned to each player at the beginning of their history. We also dene the player history as the temporally ordered set M p = [m 0 ;m 1 ; ;m N ] of matches played by p. Each m i 2M p is the 4-tuple (t 1 ;t 2 ;t 3 ;t 4 ) of player's teammates. Let us note that each match m in the dataset can be represented as a 4-tuple because we consider just Public and Ranked matches, whose opposing teams are composed by 5 human players each. We can now dene for each teammate t of player p in match m i 2M p the corresponding performance weight, as: w p;t;i =ts i ts i1 ; (8.1) where, ts i 2TS p is the TrueSkill value of the player p after match m i 2M p . Thus, weight w p;t;i captures the TrueSkill gain/loss of player p when playing with a given teammate t. This step generates as a result a time-varying directed network in which, at each time step (here the temporal dimension is dened by the sequence of matches), we have a set of directed links connecting together the players active at that time (i.e., match) to their teammates, and the relative weights based on the uctuations of TrueSkill level of players. 147 Next, we build the overall Short-term Performance Network (SPN), by aggregating the time- varying networks over the matches of each player. This network has a link between two nodes if the corresponding players were teammates at least once in the total temporal span of our dataset. Each link is then characterized by the sum of the previously computed weights. Thus, given player p and any possible teammate t in the network, their aggregated weight w p;t is equal to w p;t = N X i=0 w p;t;i ; (8.2) where w p;t;i =ts i ts i1 if t2m i , and 0 otherwise. The resulting network has 87; 155 nodes and 4; 906; 131 directed links with weights w p;t 2 [0:58; 1:06]. It is worth noting that the new TrueSkill value assigned after each match is computed on the basis of both teammates and opponents current skill levels. However, the TrueSkill value depends on the outcome of each match, which is shared by each teammate in the winner/loser team. With this system in place, players that do not cooperate in the game, such as players that do not perform any kill or assist, and win will improve their skill level because of the teammates' eort. Nevertheless, this anomalous behavior is rare (i.e. less than 1% of matches are aected) and it is smoothed by our network model. By aggregating the weights over a long period of time, we indeed balance out these singular instances. 8.4.2 Long-term Performance Network If skills transfer from player to player by means of co-play, the in uence of a teammate on players should be accounted for in their future matches. We therefore would like to introduce a memory-like mechanism to model this form of in uence persistence. Here we show how to generate a Long-term Performance Network (LPN) in which the persistence of in uence of a certain teammate is taken into account. To this aim, we modify the weights by accumulating the discounted gain over the subsequent matches of a player as follows. Let us consider player p, his/her TrueSkill scores TS p 148 − 1.0 − 0.5 0.0 0.5 1.0 # Kendall’s tau coefficient 0 1000 2000 3000 4000 5000 6000 7000 8000 # of players Figure 8.3: Kendall's tau coecient distribution computed by comparing each player's ranking in the short-term and long-term performance networks. and his/her temporally ordered sequence of matches M p . As previously introduced, m i 2 M p corresponds to the 4-tuple (t 1 ;t 2 ;t 3 ;t 4 ) of player's teammates in that match. For each teammate t of player p in match m i 2M p the long-term performance weight is dened as w p;t;i = exp iip;t (ts i ts i1 ); (8.3) where i p;t is the index of the last match in M p in which player p played with teammate t. Note that, if the current match m i is a match in which p and t play together than i p;t =i. Analogously to the SPN construction, we then aggregate the weights over the temporal sequence of matches. Thus, the links in the aggregated network will have nal weights dened by Eq. (8.2). Conversely to the SPN, the only weights w p;t;i in the LPN being equal to zero are those corresponding to all matches previous to the rst one in which p and t co-play. The nal weights of the Long-term Performance Network are w p;t 2 [0:54; 1:06]. As we can notice, the range of weights of SPN is close to the one found in LPN. However, these two weight formulations lead not only to dierent ranges of values but also to a dierent ranking 149 of the links in the networks. When computing the Kendall's tau coecient between the ranking of the links in the SPN and LPN, we nd indeed that the two networks have a positive correlation ( = 0:77 with p-value < 10 3 ) but the weights' ranking is changed. As our aim is to generate a recommending system for each player based on these weights, we further investigate the dierences between the performance networks, by computing the Kendall's tau coecient over each player's ranking. Fig. 8.3 shows the distribution of the Kendall's tau coecient computed by comparing each player's ranking in the SPN and LPN. In particular, we have that just a small portion of players have the same teammate's ranking in both networks, and that the 87:8% of the remaining players have dierent rankings for their top-10 teammates. The recommending system that we are going to design will then provide a dierent recommendation based on the two performance networks. On the one hand, when using the SPN the system will recommend a teammate that leads to an instant skill gain. As an example, this might be the case of a teammate that is good in coordinating the team but from which not necessarily the player learns how to improve his/her performance. On the other hand, when using the LPN the system will recommend a teammate that leads to an increasing skill gain over the next matches. Thus, even if the instant skill gain with a teammate is not high, the player could learn some eective strategies and increase his/her skill gain in the successive matches. 8.4.3 LCC and network properties Given a co-play performance network (short-term or long-term), to carry out our performance prediction we have to take into account only the links in the network having reliable weights. If two players play together just few times, the condence we have on the corresponding weight is low. For example, if two players are teammates just one time their nal weight only depends on that unique instance, and thus might lead to biased results. To face this issue, we computed the distribution of the number of occurrences a couple of teammates play together in our network 150 0 100 200 300 400 500 600 700 # of occurrences 10 0 10 1 10 2 10 3 10 4 10 5 10 6 # of links Figure 8.4: Distribution of the number of occurrences per link, i.e. number of times a couple of teammates play together. Table 8.1: Comparison of the overall performance networks' characteristics and its LCC. Note that the number of nodes and links are the same for both the Short-term Performance Network (SPN) and the Long-term Performance Network (LPN), while the range of weights varies from one case to the other. # nodes # links SPN weights LPN weights Network 87,155 4,906,131 [0:58; 1:06] [0:54; 1:06] LCC 38,563 1,444,290 [0:58; 1:06] [0:54; 1:06] (shown in Fig. 8.4) and set a threshold based on these values. In particular, we decided to retain only pairs that played more than 2 matches together. Finally, as many node embedding methods require a connected network as input [4], we extract the Largest Connected Component (LCC) of the performance network, which will be used for the performance prediction and evaluation. The LCC include the same number of nodes and links for both the SPN and the LPN. In particular, it includes 38; 563 nodes and 1; 444; 290 links. We compare the characteristics of the initial network and its LCC in Tab. 8.1. 151 8.5 Performance Prediction In the following, we test whether the co-play performance networks have intrinsic structures allowing us to predict performance of players when matched with unknown teammates. Such a prediction, if possible, could help us in recommending teammates to a player in a way that would maximize his/her skill improvement. 8.5.1 Problem Formulation Consider the co-play performance network G = (V;E) with weighted adjacency matrix W. A weighted link (i;j;w ij ) denotes that playeri gets a performance variation ofw ij after playing with player j. We can formulate the recommendation problem as follows. Given an observed instance of a co-play performance network G = (V;E) we want to predict the weight of each unobserved link (i;j) = 2E and use this result to further predict the ranking of all other players j2V (6=i) for each player i2V . 8.5.2 Network Modeling Does the co-play performance network contain information or patterns which can be indicative of skill gain for unseen pairs of players? If that is the case, how do we model the network structure to nd such patterns? Are such patterns best represented via deep neural networks or more traditional factorization techniques? To answer the above questions, we modify a deep neural network autoencoder and we test its predictive power against two classes of approaches widely applied in recommendation systems: (a) factorization based [3, 116, 194], and (b) deep neural network based [213, 30, 113]. Note that the deep neural network based approaches on recommendations use dierent variations of deep autoencoders to learn a low-dimensional manifold to capture the inherent structure of the data. More recently, variational autoencoders have been tested for this task and have been shown to 152 slightly improve the performance over traditional autoencoders [113]. In this chapter, we focus on understanding the importance of applying neural network techniques instead of factorization models which are traditionally used in recommendation tasks and subtle variations in the autoencoder architecture to further improve performance is left as future work. 8.5.2.1 Factorization In a factorization based model for directed networks, the goal is to obtain two low-dimensional matrices U2R nd and V 2R nd with number of hidden dimensions d such that the following function is minimized f(U;V ) = X (i;j)2E (w ij <u i ;v j >) 2 + 2 (ku i k 2 +kv j k 2 ) (8.4) The sum in (8.4) is computed over the observed links to avoid of penalizing the unobserved one as overtting to 0s would deter predictions. Here, is chosen as a regularization parameter to give preference to simpler models for better generalization. 8.5.2.2 Traditional Autoencoder ෝ Figure 8.5: An example of deep autoencoder model. 153 Autoencoders are unsupervised neural networks that aim at minimizing the loss between reconstructed and input vectors. A traditional autoencoder is composed of two parts(cf., Figure 8.5): (a) an encoder, which maps the input vector into low-dimensional latent variables; and, (b) a decoder, which maps the latent variables to an output vector. The reconstruction loss can be written as L = n X i=1 k(^ x i x i )k 2 2 ; (8.5) wherex i s are the inputs and ^ x i =f(g(x i )). f(:) and g(:) are the decoder and encoder functions respectively. Deep autoencoders have recently been adapted to the network setting [213, 30, 113]. An algorithm proposed by Wang et al. [213] jointly optimizes the autoencoder reconstruction error and Laplacian Eigenmaps [14] error to learn representation for undirected networks. However, this \Traditional Autoencoder" equally penalizes observed and unobserved links in the network, while the model adapted to the network setting cannot be applied when the network is directed. Thus, we propose to modify the Traditional Autoencoder model as follows. 8.5.2.3 Teammate Autoencoder To model directed networks, we propose a modication of the Traditional Autoencoder model, that takes into account the adjacency matrix representing the directed network. Moreover, in this formulation we only penalize the observed links in the network, as our aim is to predict the weight and the corresponding ranking of the unobserved links. We then write our \Teammate Autoencoder" reconstruction loss as: L = n X i=1 k(^ x i x i ) [a i;j ] n j=1 k 2 2 ; (8.6) where a ij = 1 if (i;j)2E, and 0 otherwise. Here, x i represents i th row of the adjacency matrix and n is the number of nodes in the network. Thus, the model takes each row of the adjacency matrix representing the performance network as input and outputs an embedding for each player 154 such that it can reconstruct the observed edges well. For example, if there are 3 players and player 2 helps improve player 1's performance by a factor of , player 1's row would be [0;; 0]. We train the model by minimizing the above loss function using stochastic gradient descent and calculate the gradients using back propagation. Minimizing this loss functions yields the neural network weights W and the learned representation of the network Y 2R nd . The layers in the neural network, the activation function and regularization coecients serve as the hyperparameters in this model. Algorithm 1 summarizes our methodology. 8.5.3 Evaluation Framework 8.5.3.1 Experimental Setting Model training Node list Evaluation Link weights Prediction MSE MANE AvgRec@k Figure 8.6: Evaluation Framework: The co-play network is divided into training and test networks. The parameters of the models are learned using the training network. We obtain multiple test subnetworks by using a random walk sampling with random restart and input the nodes of these subnetworks to the models for prediction. The predicted weights are then evaluated against the test link weights to obtain various metrics. To evaluate the performance of the models on the task of teammates' recommendation, we use the cross-validation framework illustrated in Fig. 8.6. We randomly \hide" 20% of the weighted links and use the rest of the network to learn the embedding, i.e. representation, of each player in the network. We then use each player's embedding to predict the weights of the unobserved 155 links. As the number of player pairs is too large, we evaluate the models on multiple samples of the co-player performance networks (similar to [158, 73]) and report the mean and standard deviation of the used metrics. Instead of uniformly sampling the players as performed in [158, 73], we use random walks [9] with random restarts to generate sampled networks with similar degree and weight distributions as the original network. Fig. 8.7 illustrates these distributions for the sampled network of 1,024 players (nodes). Further, we obtain the optimal hyperparater values of the models used using a grid search over a set of values. For Graph Factorization, we vary the regularization coecient in powers of 10, 2 [10 5 ; 1]. For deep neural network based models, we use ReLU as the activation function and choose the neural network structure by an informal search over a set of architectures. We set the l 1 and l 2 regularization coecients by performing grid search on [10 5 ; 10 1 ]. − 0.4 − 0.2 0.0 0.2 0.4 weight 0 100 200 300 400 500 600 # of links Figure 8.7: Distribution of the weights of the network sampled by using random walk. 8.5.3.2 Evaluation Metrics We use Mean Squared Error (MSE), Mean Absolute Normalized Error (MANE), and AvgRec@k as evaluation metrics. MSE evaluates the accuracy of the predicted weights, whereas MANE and AvgRec@k evaluate the ranking obtained by the model. 156 First, we compute MSE, typically used in recommendation systems, to evaluate the error in the prediction of weights. We use the following formula for our problem: MSE =kw test w pred k 2 ; (8.7) wherew test is the list of weights of links in the test subnetwork, andw pred is the list of weights predicted by the model. Thus, MSE computes how well the model can predict the weights of the network. A lower value implies better prediction. Second, we use AvgRec@k to evaluate the ranking of the weights in the overall network. It is dened as: AvgRec@k = P k i=1 w test index(i) k ; (8.8) where index(i) is the index of the i th highest predicted link in the test network. It computes the average gain in performance for top k recommendations. A higher values implies the model can make better recommendations. Finally, to test the models' recommendations for each player, we dene the Mean Absolute Normalized Error (MANE), which computes the normalized dierence between predicted and actual ranking of the test links among the observed links and averages over the nodes. Formally, it can be written as MANE(i) = P jE test i j j=1 rank pred i (j)rank test i (j) jE train i jjE test i j ; MANE = P jVj i=1 MANE(i) jVj ; (8.9) 157 where rank pred i (j) represents the rank of the j th vertex in the list of weights predicted for the player i. A lower MANE value implies that the ranking of recommended players is similar to the actual ranking according to the test set. 8.5.4 Results and Analysis Figure 8.8: Short-term Performance Network. (a) Mean Squared Error (MSE) gain of models over average prediction. (b) Mean Absolute Normalized Error (MANE) gain of models over average prediction. (c) AvgRec@k of models. In the following, we evaluate the results provided by the Graph Factorization, the Traditional Autoencoder and our Teammate Autoencoder. To this aim we rst analyze the models' performance on both the SPN and the LPN with respect to the MSE measure, computed in Eq. (8.7), respectively in Fig. 8.8(a) and Fig. 8.9(a). In this case, we compare the models against an \average" baseline, where we compute the average performance of the players' couples observed in the training set and use it as a prediction for each hidden teammate link. Fig. 8.8(a) and Fig. 8.9(a) show the variation of the percentage of the MSE gain (average and standard deviation) while increasing the number of latent dimensions d for in each model. We 158 Figure 8.9: Long-term Performance Network. (a) Mean Squared Error (MSE) gain of models over average prediction. (b) Mean Absolute Normalized Error (MANE) gain of models over average prediction. (c) AvgRec@k of models. can observe that the Graph Factorization model generally performs worse than the baseline, with values in [1:64%;0:56%] and average of -1.2% for the SPN and values in [1:35%;0:74%] and average of -1.05% for the LPN. This suggests that the performance networks of Dota 2 require the use of deep neural networks to capture their underlying structure. However, a traditional model is not enough to outperform the baseline. The Traditional Autoencoder reaches indeed marginal improvements: values in [0:0%; 0:55%] and average gain of 0.18% for the SPN; values in [0:0%; 0:51%] and average gain of 0.20% for the LPN. On the contrast, our Teammate Autoencoder achieves substantial gain over the baseline across the whole spectrum and its performance in general increases for higher dimensions (they can retain more structural information). The average MSE gain for dierent dimensions over the baseline of the Teammate Autoencoder spans between 6.34% and 11.06% in the SPN and from 6.68% to 11.34% for the LPN, with an average gain over all dimensions of 9.00% for the SPN and 9.15% for the LPN. We also computed the MSE average over 10 runs and d = 1; 024, shown in Tab. 8.2, which decreases from the baseline prediction of 159 4.55 to our Teammate Autoencoder prediction of 4.15 for for the SPN, and from 4.40 to 3.91 for the LPN. We then compare the models' performance in providing individual recommendations by analyz- ing the MANE metric, computed in Eq. (8.9). Fig. 8.8(b) and Fig. 8.9(b) show the percentage of the MANE gain for dierent dimensions computed against the average baseline respectively for the SPN and the LPN. Analogously to the MSE case, the Graph Factorization performs worse than the baseline (values in [3:34%;1:48%] with average gain of -2.37% for SPN and values in [3:78%;0:78%] -2.79% for LPN) despite the increment in the number of dimensions. The Traditional Autoencoder achieves marginal gain over the baseline for dimensions higher than 128 ([0:0%; 0:37%] for SPN and [0:0%; 0:5%] for LPN), with an average gain over all dimensions of 0.16% for SPN and 0.19% for LPN. Our model attains instead signicant percentage gain in individual recommendations over the baseline. For the SPN, it achieves an average percentage of MANE gain spanning from 14.81% to 22.78%, with an overall average of 19.50%. For the LPN, the average percentage of MANE gain spans from 16.81% to 22.32%, with an overall average of 19.29%. It is worth noting that the performance in this case does not monotonically increase with dimensions. This might imply that for individual recommendations the model overts at higher dimensions. We report the average value of MANE in Tab. 8.2 for d = 1; 024. Our model obtains average values of 0.059 and 0.062, for the SPN and LPN respectively, compared to 0.078 of the average baseline for both cases. Finally, we compare our models against the ideal recommendation in the test subnetwork to understand how close our top recommendations are to the ground truth. To this aim, we report the AvgRec@k metric, which computes the average weight of the top k links recommended by the models as in Eq. (8.8). In Fig. 8.8(c) and Fig. 8.9(c), we can observe that the Teammate Autoencoder signicantly outperforms the other models, both for the SPN and LPN respectively. The theoretical maximum line shows the AvgRec@k values by selecting the topk recommendations for the entire network using the test set. For the SPN, the link with the highest predicted weight 160 by our model achieves a performance gain of 0.38 as opposed to 0.1 for Graph Factorization. This gain is close to the ideal prediction which achieves 0.52. For the LPN, instead, our model achieves a performance gain of 0.3 as opposed to 0.1 for Graph Factorization. The performance of our model remains higher for all values of k. This shows that the ranking of the links achieved by our model is close to the ideal ranking. Note that the Traditional Autoencoder yields poor performance on this task which signies the importance of relative weighting of observed and unobserved links. 8.6 Related Work There is a broad body of research focusing on online games to identify which characteristics in uence dierent facets of human behaviors. On the one hand, this research is focused on the cognitive aspects that are triggered and aected when playing online games, including but not limited to gamer motivations to play [231, 101, 40, 208], learning mechanisms [192, 193], and player performance and acquisition of expertise [187]. On the other hand, players and their performance are classied in terms of in-game specics, such as combat patterns [227, 52], roles [57, 124, 183], and actions [222, 103, 182]. Aside from these dierent gaming features, multiplayer online games especially distinguish from other games because of their inherent cooperative design. In such games, players have not only to learn individual strategies, but also to organize and coordinate to reach better results. This intrinsic social aspect has been a focal research topic [53, 137, 96, 208, 186]. In [43], authors show that multiplayer online games provide an environment in which social interactions among players can evolve into strong friendship relationships. Moreover, the study shows how the social aspect of online gaming is a strong component for players to enjoy the game. Another study [172, 173] ranked dierent factors that in uence player performance in MOBA games. Among these factors, the number of friends resulted to have a key role in a successful teams. In the present work, we focused on social contacts at a higher level: co-play relations. Teammates, either friends or 161 strangers, can aect other players' styles through communication, by trying to exert in uence over others, etc. [117, 237, 123]. Moreover, we leveraged these teammate-related eects on player performance to build a teammate recommendation system for players in Dota 2. Recommendation systems have been widely studied in the literature on applications such as movies, music, restaurants and grocery products [125, 210, 66, 122]. The current work on such systems can be broadly categorized into: (i) collaborative ltering [194, 115, 134], (ii) content based ltering [166, 190], and (iii) hybrid models [25, 102]. Collaborative ltering is based on the premise that users with similar interests in the past will tend to agree in the future as well. Content based models learn the similarity between users and content descriptions. Hybrid models combine the strength of both of these systems with varying hybridization strategy. In the specic case of MOBA games, recommendation systems are mainly designed to advise players on the type of character (hero) they impersonate 5 [44, 2, 36]. Few works addressed the problem of recommending teammates in MOBA games. In [209], authors discuss how to improve matchmaking for players based on the teammates they had in their past history. They focus on the creation and analysis of the properties of dierent networks in which the links are formed based on dierent rules, e.g., players that played together in the same match, in the same team, in adversarial teams, etc. These networks are then nally used to design a matchmaking algorithm to improve social cohesion between players. However, the author focus on dierent relationships to build their networks and on the strength of network links to design their algorithm, while no information about the actual player performance is taken into account. We here aim at combining both the presence of players in the same team (and the number of times they play together) and the eect that these combinations have on player performance, by looking at skill gain/loss after the game. 5 Dota Picker. http://dotapicker.com/ 162 8.7 Conclusions In this chapter, we set to study the complex interplay between cooperation, teams and teammates' recommendation, and players' performance in online games. Our study tackled three specic problems: (i) understanding short and long-term teammates' in uence on players' performance; (ii) recommending teammates with the aim of improving players skills and performance; and (iii) demonstrating a deep neural network that can predict such performance improvements. We used Dota 2, a popular Multiplayer Online Battle Arena game hosting millions of players and matches every day, as a virtual laboratory to understand performance and in uence of teammates. We used our dataset to build a co-play network of players, with weights representing a teammate's short-term in uence on a player performance. We also developed a variant of this weighting algorithm that incorporates a memory mechanism, implementing the assumption that player's performance and skill improvements carry over in future games (i.e. long-term in uence): in uence can be intended as a longitudinal process that can improve or hinder player's performance improvement over time. With this framework in place, we demonstrated the feasibility of a recommendation system that suggests new teammates, which can be benecial to a player to play with to improve their individual performance. This system, based on a modied autoencoder model, yields state- of-the-art recommendation accuracy, outperforming graph factorization techniques considered among the best in recommendation systems literature, closing the existing gap with the maximum improvement that is theoretically achievable. Our experimental results suggest that skill transfer and performance improvement can be accurately predicted with deep neural networks. We plan to extend this work in multiple future directions. First, our current framework takes only into account the individual skill of players to recommend teammates that are indeed benecial to improve a player's performance in the game. However, multiple aspects of the game can play a key role in in uencing individual performance. These are aspects such as the impersonated role, 163 the presence of friends or strangers in the team, cognitive budget of players, and their personality. Thus, we are planning to extend our current framework to take into account these aspects of the game and train a model that recommends teammates on the basis of these multiple factors. Second, from a theoretical standpoint, we intend to determine whether our framework can be generalized to generate recommendations and predict team and individual performance in a broader range of scenarios, beyond online games. We will explore whether more sophisticated factorization techniques based on tensors, rather than matrices, can be leveraged within our framework, as such techniques have recently shown promising results in human behavioral modeling [182, 92, 91]. We also plan to demonstrate, from an empirical standpoint, that the recommendations produced by our system can be implemented in real settings. We will carry out randomized-control trials in lab settings to test whether individual performance in teamwork-based tasks can be improved. One additional direction will be to extend our framework to recommend incentives alongside teammates: this to establish whether we can computationally suggest incentive-based strategies to further motivate individuals and improve their performance within teams. 164 Algorithm 4: Teammate Autoencoder Function CreatePerformanceNetwork (Player historiesfM p g, TrueSkill scoresfTS p g, NetworkTypefSPN;LPNg) for p = 1:::P do TS p =TS p [0 :]TS p [:1] // compute the TrueSkill gains of p for i = 0:::N do if NetworkType is SPN then W [p;t] =W [p;t] + TS p [i] for each t2M p [i] // compute the weight for each (p,t) else idx getLastSeenIndex() // get the index of the last match played by p with t W [p;t] =W [p;t] +exp(iidx)TS p [i] for each t2M p [i] // compute the weight for each (p,t) updateLastSeenIndex(p;t;i) // update index return W Function Teammate Autoencoder (Player historyfM p g, TrueSkill scoresfTS p g) W CreatePerformanceNetwork(fM p g,fTS p g) // create performance network using player history and TrueSkill scores # RandomInit() // initialize the autoencoder weights SetF =f(w i )g for each w i 2W // construct input set for autoencoder from weight matrix for iter = 1:::I do Randomly sample minibatch M fromF // get minibatch L = P n i=1 k(^ x i x i ) [a i;j ] n j=1 k 2 2 // compute loss function grad @L=@# // compute gradient # UpdateGradSGD(#, grad) // upgrade model weights using the gradient Y EncoderForwardPass(G, #) // compute embedding using encoder return Y Table 8.2: Average and standard deviation of player performance prediction (MSE) and teammate recommendation (MANE) for d = 1; 024 in both SPN and LPN. MSE SPN MANE SPN MSE LPN MANE LPN Baseline prediction 4.55/0.14 0.078/0.02 4.40/0.14 0.078/0.01 Graph Factorization 4.59/0.17 0.081/0.02 4.45/0.18 0.084/0.021 Traditional Autoencoder 4.54/0.15 0.074/0.01 4.37/0.13 0.075/0.012 Teammate Autoencoder 4.15/0.14 0.059/0.008 3.91/0.10 0.062/0.008 165 Chapter 9 Conclusion Graph embedding is a growing eld with applications in a variety of problems such as graph reconstruction, link prediction and node classication. In this thesis, I provided contributions in the eld of graph embedding in three areas: (i) universality of graph embedding, (ii) dynamic graph embedding, and (iii) embedding graphs with edge attributes. To understand the universality of embedding methods, I studied synthetic and real networks and applied embedding approaches on them for various graph tasks to understand the variance of performance. I further provided a benchmark framework to evaluate a new embedding approach as well as to recommend a method for a given graph. The thesis also delved into dynamic graphs. I introduced two models in this direction: (i) DynGEM, a model to update graph embeddings eciently for evolving graphs, and (ii) dyngraph2vec, a model to capture network dynamics. Another key contribution of this thesis was to develop embedding approach which can capture edge attributes aiding in downstream tasks such as link prediction and node classication. I proposed ELAINE, an approach which takes into account higher order proximity, social roles and edge attributes and tested to show its eect on tasks of link prediction and node classication. Finally, I presented an application to teammate recommendation to showcase the utility of graph embedding in gaming scenarios. 166 Reference List [1] Lada A Adamic and Eytan Adar. Friends and neighbors on the web. Social networks, 25(3):211{230, 2003. [2] Atish Agarwala and Michael Pearce. Learning dota 2 team compositions. Technical report, Technical report, Stanford University, 2014. [3] Amr Ahmed, Nino Shervashidze, Shravan Narayanamurthy, Vanja Josifovski, and Alexander J Smola. Distributed large-scale natural graph factorization. In Proceedings of the 22nd international conference on World Wide Web, pages 37{48. ACM, 2013. [4] Nesreen K Ahmed, Ryan A Rossi, Rong Zhou, John Boaz Lee, Xiangnan Kong, Theodore L Willke, and Hoda Eldardiry. A framework for generalizing graph-based representation learning methods. arXiv preprint arXiv:1709.04596, 2017. [5] Edoardo M. Airoldi, David M. Blei, Stephen E. Fienberg, and Eric P. Xing. Mixed member- ship stochastic blockmodels. J. Mach. Learn. Res., 9:1981{2014, June 2008. [6] Mohammad Al Hasan and Mohammed J Zaki. A survey of link prediction in social networks. In Social network data analytics, pages 243{275. 2011. [7] R eka Albert, Hawoong Jeong, and Albert-L aszl o Barab asi. Error and attack tolerance of complex networks. nature, 406(6794):378{382, 2000. [8] Arik Azran. The rendezvous algorithm: Multiclass semi-supervised learning with markov random walks. In Proceedings of the 24th international conference on Machine learning, pages 49{56, 2007. [9] Lars Backstrom and Jure Leskovec. Supervised random walks: predicting and recommending links in social networks. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 635{644. ACM, 2011. [10] Shumeet Baluja, Rohan Seth, D Sivakumar, Yushi Jing, Jay Yagnik, Shankar Kumar, Deepak Ravichandran, and Mohamed Aly. Video suggestion and discovery for youtube: taking random walks through the view graph. In Proc. 17th int. conference on World Wide Web, pages 895{904, 2008. [11] Albert-L aszl o Barab asi and R eka Albert. Emergence of scaling in random networks. science, 286(5439):509{512, 1999. [12] Victor Battistich, Daniel Solomon, and Kevin Delucchi. Interaction processes and student outcomes in cooperative learning groups. The Elementary School Journal, 94(1):19{32, 1993. [13] Bianca Beersma, John R Hollenbeck, Stephen E Humphrey, Henry Moon, Donald E Conlon, and Daniel R Ilgen. Cooperation, competition, and team performance: Toward a contingency approach. Academy of Management Journal, 46(5):572{590, 2003. 167 [14] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques for embed- ding and clustering. In NIPS, volume 14, pages 585{591, 2001. [15] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798{1828, 2013. [16] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798{1828, 2013. [17] Smriti Bhagat, Graham Cormode, and S Muthukrishnan. Node classication in social networks. In Social network data analytics, pages 115{148. Springer, 2011. [18] Smriti Bhagat, Irina Rozenbaum, and Graham Cormode. Applying link-based classication to label blogs. In Proceedings of WebKDD: workshop on Web mining and social network analysis, pages 92{101. ACM, 2007. [19] Y-lan Boureau, Yann L Cun, et al. Sparse feature learning for deep belief networks. In Advances in neural information processing systems, pages 1185{1192, 2008. [20] Matthew Brand. Continuous nonlinear dimensionality reduction by kernel eigenmaps. In IJCAI, pages 547{554, 2003. [21] Matthew Brand. Fast low-rank modications of the thin singular value decomposition. Linear algebra and its applications, 415(1):20{30, 2006. [22] Bobby-Joe Breitkreutz, Chris Stark, Teresa Reguly, Lorrie Boucher, Ashton Breitkreutz, Michael Livstone, Rose Oughtred, Daniel H Lackner, J urg B ahler, Valerie Wood, et al. The biogrid interaction database: 2008 update. Nucleic acids research, 36(suppl 1):D637{D640, 2008. [23] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203, 2013. [24] Horst Bunke and Kaspar Riesen. Recent advances in graph-based pattern recognition with applications in document analysis. Pattern Recognition, 44(5):1057{1067, 2011. [25] Robin Burke. Hybrid recommender systems: Survey and experiments. User modeling and user-adapted interaction, 12(4):331{370, 2002. [26] Hongyun Cai, Vincent W Zheng, and Kevin Chen-Chuan Chang. A comprehensive survey of graph embedding: Problems, techniques, and applications. IEEE Transactions on Knowledge and Data Engineering, 30(9):1616{1637, 2018. [27] Michel Callon, Jean Pierre Courtial, and Francoise Laville. Co-word analysis as a tool for describing the network of interactions between basic and technological research: The case of polymer chemsitry. Scientometrics, 22(1):155{205, 1991. [28] Shaosheng Cao, Wei Lu, and Qiongkai Xu. Grarep: Learning graph representations with global structural information. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pages 891{900. ACM, 2015. [29] Shaosheng Cao, Wei Lu, and Qiongkai Xu. Grarep: Learning graph representations with global structural information. In KDD15, pages 891{900, 2015. 168 [30] Shaosheng Cao, Wei Lu, and Qiongkai Xu. Deep neural networks for learning graph representations. In Proceedings of the Thirtieth AAAI Conference on Articial Intelligence, pages 1145{1152. AAAI Press, 2016. [31] Deepayan Chakrabarti, Yiping Zhan, and Christos Faloutsos. R-mat: A recursive model for graph mining. In Proceedings of the 2004 SIAM International Conference on Data Mining, pages 442{446. SIAM, 2004. [32] Jonathan Chang and David Blei. Relational topic models for document networks. In Proceedings of the Twelfth International Conference on Articial Intelligence and Statistics, 2009. [33] Shiyu Chang, Wei Han, Jiliang Tang, Guo-Jun Qi, Charu C Aggarwal, and Thomas S Huang. Heterogeneous network embedding via deep architectures. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 119{128. ACM, 2015. [34] Haochen Chen, Bryan Perozzi, Yifan Hu, and Steven Skiena. Harp: Hierarchical representa- tion learning for networks. arXiv preprint arXiv:1706.07845, 2017. [35] Tianqi Chen, Ian Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via knowledge transfer. arXiv preprint arXiv:1511.05641, 2015. [36] Zhengxing Chen, Truong-Huy D Nguyen, Yuyu Xu, Christopher Amato, Seth Cooper, Yizhou Sun, and Magy Seif El-Nasr. The art of drafting: a team-oriented hero recommendation system for multiplayer online battle arena games. In Proceedings of the 12th ACM Conference on Recommender Systems, pages 200{208. ACM, 2018. [37] Xiang Cheng, Sen Su, Zhongbao Zhang, Hanchi Wang, Fangchun Yang, Yan Luo, and Jie Wang. Virtual network embedding through topology-aware node ranking. ACM SIGCOMM Computer Communication Review, 41(2):38{47, 2011. [38] Marcus D Childress and Ray Braswell. Using massively multiplayer online role-playing games for online learning. Distance Education, 27(2):187{196, 2006. [39] Yoon-Sik Cho, Greg Ver Steeg, Emilio Ferrara, and Aram Galstyan. Latent space model for multi-modal social data. In Proceedings of the 25th International Conference on World Wide Web, WWW '16, pages 447{458, Republic and Canton of Geneva, Switzerland, 2016. International World Wide Web Conferences Steering Committee. [40] Dongseong Choi and Jinwoo Kim. Why people continue to play online games: In search of critical design factors to increase customer loyalty to online contents. CyberPsychology & behavior, 7(1):11{24, 2004. [41] Aaron Clauset, Cristopher Moore, and Mark EJ Newman. Hierarchical structure and the prediction of missing links in networks. Nature, 453(7191):98{101, 2008. [42] Elizabeth G Cohen. Restructuring the classroom: Conditions for productive small groups. Review of educational research, 64(1):1{35, 1994. [43] Helena Cole and Mark D Griths. Social interactions in massively multiplayer online role-playing gamers. CyberPsychology & Behavior, 10(4):575{583, 2007. [44] Kevin Conley and Daniel Perry. How does he saw me? a recommendation engine for picking heroes in dota 2. Np, nd Web, 7, 2013. 169 [45] Albert Cui, Howard Chung, and Nicholas Hanson-Holtry. Yasp 3.5 million data dump. -, 2015. [46] Hanjun Dai, Yichen Wang, Rakshit Trivedi, and Le Song. Deep coevolutionary network: Embedding user and item features for recommendation. 2017. [47] Micha el Deerrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural net- works on graphs with fast localized spectral ltering. In Advances in Neural Information Processing Systems, pages 3844{3852, 2016. [48] Morton Deutsch. The eects of cooperation and competition upon group process. Group dynamics, pages 552{576, 1960. [49] Giuseppe Di Battista, Peter Eades, Roberto Tamassia, and Ioannis G Tollis. Algorithms for drawing graphs: an annotated bibliography. Computational Geometry, 4(5):235{282, 1994. [50] Chris HQ Ding, Xiaofeng He, Hongyuan Zha, Ming Gu, and Horst D Simon. A min-max cut algorithm for graph partitioning and data clustering. In International Conference on Data Mining, pages 107{114. IEEE, 2001. [51] Carl Doersch. Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908, 2016. [52] Anders Drachen, Matthew Yancey, John Maguire, Derrek Chu, Iris Yuhui Wang, Tobias Mahlmann, Matthias Schubert, and Diego Klabajan. Skill-based dierences in spatio- temporal team behaviour in defence of the ancients 2 (dota 2). In Games media entertainment (GEM), 2014 IEEE, pages 1{8. IEEE, 2014. [53] Nicolas Ducheneaut, Nicholas Yee, Eric Nickell, and Robert J Moore. Alone together?: exploring the social dynamics of massively multiplayer online games. In Proceedings of the SIGCHI conference on Human Factors in computing systems, pages 407{416. ACM, 2006. [54] Daniel M Dunlavy, Tamara G Kolda, and Evrim Acar. Temporal link prediction using matrix and tensor factorizations. ACM Transactions on Knowledge Discovery from Data (TKDD), 5(2):10, 2011. [55] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Al an Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular ngerprints. In Advances in neural information processing systems, pages 2224{2232, 2015. [56] Peter Eades and Lin Xuemin. How to draw a directed graph. In Visual Languages, 1989., IEEE Workshop on, pages 13{17. IEEE, 1989. [57] Christoph Eggert, Marc Herrlich, Jan Smeddinck, and Rainer Malaka. Classication of player roles in the team-based multi-player game dota 2. In International Conference on Entertainment Computing, pages 112{125. Springer, 2015. [58] Arpad E Elo. The rating of chessplayers, past and present. Arco Pub., 1978. [59] Tom as Feder and Rajeev Motwani. Clique partitions, graph compression and speeding- up algorithms. In Proceedings of the twenty-third annual ACM symposium on Theory of computing, pages 123{133, 1991. [60] Emilio Ferrara. Disinformation and social bot operations in the run up to the 2017 french presidential election. First Monday, 22(8), 2017. 170 [61] Daniel R Figueiredo, Leonardo FR Ribeiro, and Pedro HP Saverese. struc2vec: Learning node representations from structural identity. arXiv preprint arXiv:1704.03165, 2017. [62] Francois Fouss, Alain Pirotte, Jean-Michel Renders, and Marco Saerens. Random-walk computation of similarities between nodes of a graph with application to collaborative recommendation. IEEE Transactions on knowledge and data engineering, 19(3), 2007. [63] Jesse Fox, Michael Gilbert, and Wai Yen Tang. Player experiences in a massively multiplayer online game: A diary study of performance, motivation, and social interaction. New Media & Society, 20(11):4056{4073, 2018. [64] Linton C Freeman. Visualizing social networks. Journal of social structure, 1(1):4, 2000. [65] Nir Friedman, Lise Getoor, Daphne Koller, and Avi Pfeer. Learning probabilistic relational models. In IJCAI, pages 1300{1309, 1999. [66] Yanjie Fu, Bin Liu, Yong Ge, Zijun Yao, and Hui Xiong. User preference learning with multiple information fusion for restaurant recommendation. In Proceedings of the 2014 SIAM International Conference on Data Mining, pages 470{478. SIAM, 2014. [67] Emden R Gansner and Stephen C North. An open graph visualization system and its applications to software engineering. Software Practice and Experience, 30(11):1203{1233, 2000. [68] Johannes Gehrke, Paul Ginsparg, and Jon Kleinberg. Overview of the 2003 kdd cup. ACM SIGKDD Explorations, 5(2), 2003. [69] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectier neural networks. In AISTATS11, page 275, 2011. [70] Long Gong, Yonggang Wen, Zuqing Zhu, and Tony Lee. Toward prot-seeking virtual network embedding algorithm via global resource capacity. In INFOCOM, 2014 Proceedings IEEE, pages 1{9. IEEE, 2014. [71] Palash Goyal, Sujit Rokka Chhetri, and Arquimedes Canedo. dyngraph2vec: Capturing network dynamics using dynamic graph representation learning. Knowledge-Based Systems, 2019. [72] Palash Goyal and Emilio Ferrara. Graph embedding techniques, applications, and perfor- mance: A survey. Knowledge-Based Systems, 151:78{94, 2018. [73] Palash Goyal and Emilio Ferrara. Graph embedding techniques, applications, and perfor- mance: A survey. Knowledge-Based Systems, 2018. [74] Palash Goyal, Homa Hosseinmardi, Emilio Ferrara, and Aram Galstyan. Embedding networks with edge attributes. In Proceedings of the 29th on Hypertext and Social Media, pages 38{42. ACM, 2018. [75] Palash Goyal, Nitin Kamra, Xinran He, and Yan Liu. Dyngem: Deep embedding method for dynamic graphs. In IJCAI International Workshop on Representation Learning for Graphs, 2017. [76] Palash Goyal, Nitin Kamra, Xinran He, and Yan Liu. Dyngem: Deep embedding method for dynamic graphs. arXiv preprint arXiv:1805.11273, 2018. 171 [77] Palash Goyal, Anna Sapienza, and Emilio Ferrara. Recommending teammates with deep neural networks. In Proceedings of the 29th on Hypertext and Social Media, pages 57{61. ACM, 2018. [78] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd International Conference on Knowledge Discovery and Data Mining, pages 855{864. ACM, 2016. [79] William L Hamilton, Jure Leskovec, and Dan Jurafsky. Diachronic word embeddings reveal statistical laws of semantic change. arXiv preprint arXiv:1605.09096, 2016. [80] William L Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. arXiv preprint arXiv:1706.02216, 2017. [81] William L Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Methods and applications. arXiv preprint arXiv:1709.05584, 2017. [82] Xiaofei He and Partha Niyogi. Locality preserving projections. In Advances in neural information processing systems, pages 153{160, 2004. [83] David Heckerman, Chris Meek, and Daphne Koller. Probabilistic entity-relationship models, prms, and plate models. Introduction to statistical relational learning, pages 201{238, 2007. [84] Mikael Hena, Joan Bruna, and Yann LeCun. Deep convolutional networks on graph- structured data. arXiv preprint arXiv:1506.05163, 2015. [85] Keith Henderson, Brian Gallagher, Tina Eliassi-Rad, Hanghang Tong, Sugato Basu, Leman Akoglu, Danai Koutra, Christos Faloutsos, and Lei Li. Rolx: structural role extraction & mining in large graphs. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1231{1239. ACM, 2012. [86] Ralf Herbrich, Tom Minka, and Thore Graepel. Trueskill: a bayesian skill rating system. In Advances in neural information processing systems, pages 569{576, 2007. [87] Ivan Herman, Guy Melan con, and M Scott Marshall. Graph visualization and navigation in information visualization: A survey. IEEE Trans on visualization and computer graphics, 6(1):24{43, 2000. [88] Petter Holme and Beom Jun Kim. Growing scale-free networks with tunable clustering. Physical review E, 65(2):026107, 2002. [89] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural networks, 3:551{560, 1990. [90] David W Hosmer Jr, Stanley Lemeshow, and Rodney X Sturdivant. Applied logistic regression, volume 398. John Wiley & Sons, 2013. [91] Homa Hosseinmardi, Amir Ghasemian, Shrikanth Narayanan, Kristina Lerman, and Emilio Ferrara. Tensor embedding: A supervised framework for human behavioral data mining and prediction. arXiv preprint arXiv:1808.10867, 2018. [92] Homa Hosseinmardi, Hsien-Te Kao, Kristina Lerman, and Emilio Ferrara. Discovering hidden structure in high dimensional human behavioral data via tensor factorization. In WSDM Heteronam Workshop, 2018. 172 [93] Weigang Hou, Zhaolong Ning, Lei Guo, and Mohammad S Obaidat. Service degradability supported by forecasting system in optical data center networks. IEEE Systems Journal, 2018. [94] Xiao Huang, Jundong Li, and Xia Hu. Accelerated attributed network embedding. In Proceedings of the 2017 SIAM International Conference on Data Mining, pages 633{641. SIAM, 2017. [95] Xiao Huang, Jundong Li, and Xia Hu. Label informed attributed network embedding. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pages 731{739. ACM, 2017. [96] Matthew Hudson and Paul Cairns. Measuring social presence in team-based digital games. Interacting with Presence: HCI and the Sense of Presence in Computer-mediated Environ- ments, page 83, 2014. [97] Ramon Ferrer i Cancho and Richard V Sol e. The small world of human language. Proceedings of the Royal Society of London B: Biological Sciences, 268(1482):2261{2265, 2001. [98] Iaroslav Ispolatov, PL Krapivsky, and A Yuryev. Duplication-divergence model of protein interaction network. Physical review E, 71(6):061911, 2005. [99] Paul Jaccard. Etude comparative de la distribution orale dans une portion des Alpes et du Jura. Impr. Corbaz, 1901. [100] Paul Jaccard. Nouvelles recherches sur la distribution orale. Bull. Soc. Vaud. Sci. Nat., 44:223{270, 1908. [101] Jeroen Jansz and Martin Tanis. Appeal of playing online rst person shooter games. CyberPsychology & Behavior, 10(1):133{136, 2007. [102] Zhenyan Ji, Huaiyu Pi, Wei Wei, Bo Xiong, Marcin Wozniak, and Robertas Damasevicius. Recommendation based on review texts and social communities: A hybrid model. IEEE Access, 2019. [103] Daniel Johnson, Lennart E Nacke, and Peta Wyeth. All about that base: diering player experiences in video game genres and the unique case of moba games. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pages 2265{2274. ACM, 2015. [104] David W Johnson and Roger T Johnson. Cooperation and competition: Theory and research. Interaction Book Company, 1989. [105] David W Johnson, Georey Maruyama, Roger Johnson, Deborah Nelson, and Linda Skon. Eects of cooperative, competitive, and individualistic goal structures on achievement: A meta-analysis. Psychological bulletin, 89(1):47, 1981. [106] Ian T Jollie. Principal component analysis and factor analysis. In Principal component analysis, pages 115{128. Springer, 1986. [107] Dieter Jungnickel and Tilla Schade. Graphs, networks and algorithms. Springer, 2005. [108] Leo Katz. A new status index derived from sociometric analysis. Psychometrika, 18(1):39{43, 1953. 173 [109] Myunghwan Kim and Jure Leskovec. The network completion problem: Inferring missing nodes and edges in networks. In Proceedings of the 2011 SIAM International Conference on Data Mining, pages 47{58. SIAM, 2011. [110] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [111] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. [112] Thomas N Kipf and Max Welling. Semi-supervised classication with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016. [113] Thomas N Kipf and Max Welling. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308, 2016. [114] Bryan Klimt and Yiming Yang. The enron corpus: A new dataset for email classication research. In European Conference on Machine Learning, pages 217{226, 2004. [115] Daniel Kluver, Michael D Ekstrand, and Joseph A Konstan. Rating-based collaborative ltering: algorithms and evaluation. In Social Information Access, pages 344{390. Springer, 2018. [116] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recom- mender systems. Computer, 42(8), 2009. [117] Yubo Kou and Xinning Gui. Playing with strangers: Understanding temporary teams in league of legends. In Proceedings of the rst ACM SIGCHI annual symposium on Computer- human interaction in play, pages 161{169. ACM, 2014. [118] Dmitri Krioukov, Fragkiskos Papadopoulos, Maksim Kitsak, Amin Vahdat, and Mari an Bogun a. Hyperbolic geometry of complex networks. Physical Review E, 82(3):036106, 2010. [119] Joseph B Kruskal and Myron Wish. Multidimensional scaling, volume 11. Sage, 1978. [120] Vivek Kulkarni, Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. Statistically signicant detection of linguistic change. In WWW15, pages 625{635. ACM, 2015. [121] Andrea Lancichinetti, Santo Fortunato, and Filippo Radicchi. Benchmark graphs for testing community detection algorithms. Physical review E, 78(4):046110, 2008. [122] Richard D Lawrence, George S Almasi, Vladimir Kotlyar, Marisa Viveros, and Sastry S Duri. Personalization of supermarket product recommendations. In Applications of Data Mining to Electronic Commerce, pages 11{32. Springer, 2001. [123] Alex Leavitt, Brian C Keegan, and Joshua Clark. Ping to win?: Non-verbal communication and team performance in competitive online multiplayer games. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pages 4337{4350. ACM, 2016. [124] Choong-Soo Lee and Ivan Ramler. Investigating the impact of game features and content on champion usage in league of legends. Proceedings of the Foundation of Digital Games, 2015. [125] George Lekakos and Petros Caravelas. A hybrid approach for movie recommendation. Multimedia tools and applications, 36(1):55{70, 2008. [126] Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos, and Zoubin Ghahramani. Kronecker graphs: An approach to modeling networks. Journal of Machine Learning Research, 11(Feb):985{1042, 2010. 174 [127] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graphs over time: densication laws, shrinking diameters and possible explanations. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 177{187. ACM, 2005. [128] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graph evolution: Densication and shrinking diameters. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1):2, 2007. [129] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, 2014. [130] Daniel Levi. Group dynamics for teams. Sage Publications, 2015. [131] Jundong Li, Kewei Cheng, Liang Wu, and Huan Liu. Streaming link prediction on dynamic attributed networks. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pages 369{377. ACM, 2018. [132] Juzheng Li, Jun Zhu, and Bo Zhang. Discriminative deep random walk for network classi- cation. In ACL (1), 2016. [133] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493, 2015. [134] Dawen Liang, Rahul G Krishnan, Matthew D Homan, and Tony Jebara. Variational autoencoders for collaborative ltering. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, pages 689{698. International World Wide Web Conferences Steering Committee, 2018. [135] David Liben-Nowell and Jon Kleinberg. The link-prediction problem for social networks. journal of the Association for Information Science and Technology, 58(7):1019{1031, 2007. [136] Yen-Yu Lin, Tyng-Luh Liu, and Hwann-Tzong Chen. Semantic manifold learning for image retrieval. In Proceedings of the 13th annual ACM international conference on Multimedia, pages 249{258. ACM, 2005. [137] Alexandru Losup, Ruud Van De Bovenkamp, Siqi Shen, Adele Lu Jia, and Fernando Kuipers. Analyzing implicit social networks in multiplayer online games. IEEE Internet Computing, 18(3):36{44, 2014. [138] Linyuan L u and Tao Zhou. Link prediction in complex networks: A survey. Physica A: Statistical Mechanics and its Applications, 390(6):1150{1170, 2011. [139] Qing Lu and Lise Getoor. Link-based classication. In ICML, volume 3, pages 496{503, 2003. [140] Dijun Luo, Feiping Nie, Heng Huang, and Chris H Ding. Cauchy graph embedding. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 553{560, 2011. [141] Xiaoke Ma, Penggang Sun, and Yu Wang. Graph regularized nonnegative matrix factorization for temporal link prediction in dynamic networks. Physica A: Statistical mechanics and its applications, 496:121{136, 2018. [142] Laurens van der Maaten and Georey Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9:2579{2605, 2008. 175 [143] Laurens van der Maaten and Georey Hinton. Visualizing high-dimensional data using t-sne. Journal of Machine Learning Research, 9(Nov):2579{2605, 2008. [144] Aleix M Mart nez and Avinash C Kak. Pca versus lda. IEEE transactions on pattern analysis and machine intelligence, 23(2):228{233, 2001. [145] Aleix M Mart nez and Avinash C Kak. Non-negative graph embedding. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 23(2):1{8, 2008. [146] Andrew McCallum, Kamal Nigam, et al. A comparison of event models for naive bayes text classication. In AAAI-98 workshop on learning for text categorization, volume 752, pages 41{48. Citeseer, 1998. [147] Paul A Merolla, John V Arthur, Rodrigo Alvarez-Icaza, Andrew S Cassidy, Jun Sawada, Filipp Akopyan, Bryan L Jackson, Nabil Imam, Chen Guo, Yutaka Nakamura, et al. A million spiking-neuron integrated circuit with a scalable communication network and interface. Science, 345(6197):668{673, 2014. [148] Benedikt Morschheuser, Juho Hamari, and Alexander Maedche. Cooperation or competition{ when do people contribute more? a eld experiment on gamication of crowdsourcing. International Journal of Human-Computer Studies, 2018. [149] Ramesh M. Nallapati, Amr Ahmed, Eric P. Xing, and William W. Cohen. Joint latent topic models for text and citations. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008. [150] Annamalai Narayanan, Mahinthan Chandramohan, Rajasekar Venkatesan, Lihui Chen, Yang Liu, and Shantanu Jaiswal. graph2vec: Learning distributed representations of graphs. arXiv preprint arXiv:1707.05005, 2017. [151] Saket Navlakha, Rajeev Rastogi, and Nisheeth Shrivastava. Graph summarization with bounded error. In Proceedings of the international conference on Management of data, pages 419{432. ACM, 2008. [152] Jennifer Neville and David Jensen. Iterative classication in relational data. In Proc. Workshop on Learning Statistical Models from Relational Data, pages 13{20, 2000. [153] Mark EJ Newman. Clustering and preferential attachment in growing networks. Physical review E, 64(2):025102, 2001. [154] Mark EJ Newman. The structure of scientic collaboration networks. Proceedings of the national academy of sciences, 98(2):404{409, 2001. [155] Mark EJ Newman. A measure of betweenness centrality based on random walks. Social networks, 27(1):39{54, 2005. [156] Mark EJ Newman and Michelle Girvan. Finding and evaluating community structure in networks. Physical review E, 69(2):026113, 2004. [157] Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural networks for graphs. In Proceedings of the 33rd annual international conference on machine learning. ACM, 2016. [158] Mingdong Ou, Peng Cui, Jian Pei, Ziwei Zhang, and Wenwu Zhu. Asymmetric transitivity preserving graph embedding. In Proc. of ACM SIGKDD, pages 1105{1114, 2016. 176 [159] Mingdong Ou, Peng Cui, Jian Pei, Ziwei Zhang, and Wenwu Zhu. Asymmetric transitivity preserving graph embedding. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1105{1114. ACM, 2016. [160] Christopher C Paige and Michael A Saunders. Towards a generalized singular value decom- position. SIAM Journal on Numerical Analysis, 18(3):398{405, 1981. [161] Shirui Pan, Jia Wu, Xingquan Zhu, Chengqi Zhang, and Yang Wang. Tri-party deep network representation. Network, 11(9):12, 2016. [162] Panos M Pardalos and Jue Xue. The maximum clique problem. Journal of global Optimiza- tion, 4(3):301{328, 1994. [163] Hyunsoo Park and Kyung-Joong Kim. Social network analysis of high-level players in multiplayer online battle arena game. In International Conference on Social Informatics, pages 223{226. Springer, 2014. [164] Youngser Park, C Priebe, D Marchette, and Abdou Youssef. Anomaly detection using scan statistics on time series hypergraphs. In Link Analysis, Counterterrorism and Security (LACTS) Conference, page 9, 2009. [165] Georgios A Pavlopoulos, Anna-Lynn Wegener, and Reinhard Schneider. A survey of visual- ization tools for biological network analysis. Biodata mining, 1(1):12, 2008. [166] Michael J Pazzani and Daniel Billsus. Content-based recommendation systems. In The adaptive web, pages 325{341. Springer, 2007. [167] Karl Pearson. Liii. on lines and planes of closest t to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559{572, 1901. [168] Mathew Penrose et al. Random geometric graphs, volume 5. Oxford university press, 2003. [169] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In Proceedings 20th international conference on Knowledge discovery and data mining, pages 701{710, 2014. [170] Bryan Perozzi, Vivek Kulkarni, and Steven Skiena. Walklets: Multiscale graph embeddings for interpretable network classication. arXiv preprint arXiv:1605.02115, 2016. [171] Dominique C Perrault-Joncas and Marina Meila. Directed graph embedding: an algorithm based on continuous limits of laplacian-type operators. In Advances in Neural Information Processing Systems, pages 990{998, 2011. [172] Nataliia Pobiedina, Julia Neidhardt, Maria del Carmen Calatrava Moreno, and Hannes Werthner. Ranking factors of team success. In Proceedings of the 22nd International Conference on World Wide Web, pages 1185{1194. ACM, 2013. [173] Nataliia Pobiedina, Julia Neidhardt, MC Calatrava Moreno, L Grad-Gyenge, and H Werthner. On successful team formation. Technical report, Technical report, Vienna University of Tech- nology, 2013. Available at http://www. ec. tuwien. ac. at/les/OnSuccessfulTeamFormation. pdf, 2013. [174] Mahmudur Rahman, Tanay Kumar Saha, Mohammad Al Hasan, Kevin S Xu, and Chandan K Reddy. Dylink2vec: Eective feature representation for link prediction in dynamic networks. arXiv preprint arXiv:1804.05755, 2018. 177 [175] Kaspar Riesen, Michel Neuhaus, and Horst Bunke. Graph embedding in vector spaces by means of prototype selection. In International Workshop on Graph-Based Representations in Pattern Recognition, pages 383{393. Springer, 2007. [176] Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and Yoshua Bengio. Contractive auto-encoders: Explicit invariance during feature extraction. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 833{840, 2011. [177] Jorma Rissanen. Modeling by shortest data description. Automatica, 14(5):465{471, 1978. [178] Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323{2326, 2000. [179] David E Rumelhart, Georey E Hinton, and Ronald J Williams. Neurocomputing: Founda- tions of research. JA Anderson and E. Rosenfeld, Eds, pages 696{699, 1988. [180] Fatemeh Salehi Rizi, Michael Granitzer, and Konstantin Ziegler. Properties of vector embeddings in social networks. Algorithms, 10(4):109, 2017. [181] Gerard Salton and Michael J McGill. Introduction to modern information retrieval. 1986. [182] Anna Sapienza, Alessandro Bessi, and Emilio Ferrara. Non-negative tensor factorization for human behavioral pattern mining in online games. Information, 9(3):66, 2018. [183] Anna Sapienza, Hao Peng, and Emilio Ferrara. Performance dynamics and success in online games. In 2017 IEEE International Conference on Data Mining Workshops (ICDMW), pages 902{909. IEEE, 2017. [184] Anna Sapienza, Yilei Zeng, Alessandro Bessi, Kristina Lerman, and Emilio Ferrara. Individual performance in team-based online games. Royal Society Open Science, 5(6):180329, 2018. [185] Purnamrita Sarkar, Deepayan Chakrabarti, and Michael Jordan. Nonparametric link predic- tion in dynamic networks. arXiv preprint arXiv:1206.6394, 2012. [186] Wolfgang Eugen Schlauch and Katharina Anna Zweig. Social network analysis and gaming: survey of the current state of art. In Joint International Conference on Serious Games, pages 158{169. Springer, 2015. [187] PG Schrader and Michael McCreery. The acquisition of skill and expertise in massively multiplayer online games. Educational Technology Research and Development, 56(5-6):557{ 574, 2008. [188] Blake Shaw and Tony Jebara. Structure preserving embedding. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 937{944. ACM, 2009. [189] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence, 22(8):888{905, 2000. [190] Jiangbo Shu, Xiaoxuan Shen, Hai Liu, Baolin Yi, and Zhaoli Zhang. A content-based recommendation algorithm for learning resources. Multimedia Systems, 24(2):163{173, 2018. [191] Suvrit Sra and Inderjit S Dhillon. Generalized nonnegative matrix approximations with bregman divergences. In Advances in neural information processing systems, pages 283{290, 2006. [192] Constance A Steinkuehler. Learning in massively multiplayer online games. In Proceedings of the 6th international conference on Learning sciences, pages 521{528. International Society of the Learning Sciences, 2004. 178 [193] Constance A Steinkuehler. Cognition and learning in massively multiplayer online games: A critical approach. The University of Wisconsin-Madison, 2005. [194] Xiaoyuan Su and Taghi M Khoshgoftaar. A survey of collaborative ltering techniques. Advances in articial intelligence, 2009:4, 2009. [195] Jimeng Sun, Christos Faloutsos, Spiros Papadimitriou, and Philip S Yu. Graphscope: parameter-free mining of large time-evolving graphs. In KDD07, pages 687{696, 2007. [196] Ilya Sutskever, James Martens, George E Dahl, and Georey E Hinton. On the importance of initialization and momentum in deep learning. In ICML13, pages 1139{1147, 2013. [197] Nitish Talasu, Annapurna Jonnalagadda, S Sai Akshaya Pillai, and Jampani Rahul. A link prediction based approach for recommendation systems. In 2017 international conference on advances in computing, communications and informatics (ICACCI), pages 2059{2062. IEEE, 2017. [198] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Large-scale information network embedding. In Proceedings 24th International Conference on World Wide Web, pages 1067{1077, 2015. [199] Lei Tang and Huan Liu. Relational learning via latent social dimensions. In Proceedings of the 15th international conference on Knowledge discovery and data mining, pages 817{826. ACM, 2009. [200] Lei Tang and Huan Liu. Scalable learning of collective behavior based on sparse social dimensions. In Proceedings of the 18th ACM conference on Information and knowledge management, pages 1107{1116. ACM, 2009. [201] John M Tauer and Judith M Harackiewicz. The eects of cooperation and competition on intrinsic motivation and performance. Journal of personality and social psychology, 86(6):849, 2004. [202] Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319{2323, 2000. [203] Athanasios Theocharidis, Stjin Van Dongen, Anton Enright, and Tom Freeman. Network visualization and analysis of gene expression data using biolayout express3d. Nature protocols, 4:1535{1550, 2009. [204] Andreas Thor, Philip Anderson, Louiqa Raschid, Saket Navlakha, Barna Saha, Samir Khuller, and Xiao-Ning Zhang. Link prediction for annotation graphs using graph summarization. In International Semantic Web Conference, pages 714{729. Springer, 2011. [205] Yuanyuan Tian, Richard A Hankins, and Jignesh M Patel. Ecient aggregation for graph summarization. In Proceedings of the SIGMOD international conference on Management of data, pages 567{580. ACM, 2008. [206] Hannu Toivonen, Fang Zhou, Aleksi Hartikainen, and Atte Hinkka. Compression of weighted graphs. In Proc. 17th international conference on Knowledge discovery and data mining, pages 965{973, 2011. [207] Cunchao Tu, Weicheng Zhang, Zhiyuan Liu, and Maosong Sun. Max-margin deepwalk: Discriminative learning of network representation. In IJCAI, pages 3889{3895, 2016. 179 [208] April Tyack, Peta Wyeth, and Daniel Johnson. The appeal of moba games: What makes people start, stay, and stop. In Proceedings of the 2016 Annual Symposium on Computer- Human Interaction in Play, pages 313{325. ACM, 2016. [209] Ruud Van De Bovenkamp, Siqi Shen, Alexandru Iosup, and Fernando Kuipers. Understanding and recommending play relationships in online social gaming. In Communication Systems and Networks (COMSNETS), 2013 Fifth International Conference on, pages 1{10. IEEE, 2013. [210] Aaron Van den Oord, Sander Dieleman, and Benjamin Schrauwen. Deep content-based music recommendation. In Advances in neural information processing systems, pages 2643{2651, 2013. [211] Charles F Van Loan. Generalizing the singular value decomposition. SIAM Journal on Numerical Analysis, 13(1):76{83, 1976. [212] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Man- zagol. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11(Dec):3371{3408, 2010. [213] Daixin Wang, Peng Cui, and Wenwu Zhu. Structural deep network embedding. In Proceedings of the 22nd International Conference on Knowledge Discovery and Data Mining, pages 1225{ 1234. ACM, 2016. [214] Yuchung J Wang and George Y Wong. Stochastic blockmodels for directed graphs. Journal of the American Statistical Association, 82(397):8{19, 1987. [215] Yuchung J Wang and George Y Wong. Stochastic blockmodels for directed graphs. Journal of the American Statistical Association, 82(397):8{19, 1987. [216] Stanley Wasserman and Katherine Faust. Social network analysis: Methods and applications, volume 8. Cambridge university press, 1994. [217] Duncan J Watts and Steven H Strogatz. Collective dynamics of small-worldnetworks. nature, 393(6684):440, 1998. [218] Bernard M Waxman. Routing of multipoint connections. IEEE journal on selected areas in communications, 6(9):1617{1622, 1988. [219] Harrison C White, Scott A Boorman, and Ronald L Breiger. Social structure from multiple networks. i. blockmodels of roles and positions. American journal of sociology, 81(4):730{780, 1976. [220] Scott White and Padhraic Smyth. A spectral clustering approach to nding communities in graphs. In Proceedings of the 2005 SIAM international conference on data mining, pages 274{285. SIAM, 2005. [221] John Wright, Yi Ma, Julien Mairal, Guillermo Sapiro, Thomas S Huang, and Shuicheng Yan. Sparse representation for computer vision and pattern recognition. Proceedings of the IEEE, 98(6):1031{1044, 2010. [222] Bang Xia, Huiwen Wang, and Ronggang Zhou. What contributes to success in moba games? an empirical study of defense of the ancients 2. Games and Culture, page 1555412017710599, 2017. 180 [223] Huan Xu, Yujiu Yang, Liangwei Wang, and Wenhuang Liu. Node classication in social network via a factor graph model. In Pacic-Asia Conference on Knowledge Discovery and Data Mining, pages 213{224. Springer, 2013. [224] Xiaowei Xu, Nurcan Yuruk, Zhidan Feng, and Thomas AJ Schweiger. Scan: a structural clustering algorithm for networks. In Proceedings 13th international conference on Knowledge discovery and data mining, pages 824{833, 2007. [225] Shuicheng Yan, Dong Xu, Benyu Zhang, Hong-Jiang Zhang, Qiang Yang, and Stephen Lin. Graph embedding and extensions: A general framework for dimensionality reduction. IEEE transactions on pattern analysis and machine intelligence, 29(1):40{51, 2007. [226] Cheng Yang, Zhiyuan Liu, Deli Zhao, Maosong Sun, and Edward Y Chang. Network representation learning with rich text information. In IJCAI, pages 2111{2117, 2015. [227] Pu Yang, Brent E Harrison, and David L Roberts. Identifying patterns in combat that are predictive of success in moba games. In FDG, 2014. [228] Shuo Yang, Tushar Khot, Kristian Kersting, and Sriraam Natarajan. Learning continuous- time bayesian networks in relational domains: A non-parametric approach. In Thirtieth AAAI Conference on Articial Intelligence, 2016. [229] Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. Revisiting semi-supervised learning with graph embeddings. arXiv preprint arXiv:1603.08861, 2016. [230] Zhilin Yang, Jie Tang, and William W Cohen. Multi-modal bayesian embeddings for learning social knowledge graphs. In IJCAI, pages 2287{2293, 2016. [231] Nick Yee. Motivations for play in online games. CyberPsychology & behavior, 9(6):772{775, 2006. [232] Zhitao Ying, Jiaxuan You, Christopher Morris, Xiang Ren, Will Hamilton, and Jure Leskovec. Hierarchical graph representation learning with dierentiable pooling. In Advances in Neural Information Processing Systems, pages 4800{4810, 2018. [233] Ossama Younis, Marwan Krunz, and Srinivasan Ramasubramanian. Node clustering in wireless sensor networks: recent developments and deployment challenges. IEEE network, 20(3):20{25, 2006. [234] Kai Yu, Wei Chu, Shipeng Yu, Volker Tresp, and Zhao Xu. Stochastic relational models for discriminative link prediction. In NIPS, pages 1553{1560, 2006. [235] Wayne W Zachary. An information ow model for con ict and ssion in small groups. Journal of anthropological research, 33(4):452{473, 1977. [236] Menglu Zeng, Wenjian Fang, and Zuqing Zhu. Orchestrating tree-type vnf forwarding graphs in inter-dc elastic optical networks. Journal of Lightwave Technology, 34(14):3330{3341, 2016. [237] Yilei Zeng, Anna Sapienza, and Emilio Ferrara. The in uence of social ties on performance in team-based online games. arXiv preprint arXiv:1812.02272, 2018. [238] Daokun Zhang, Jie Yin, Xingquan Zhu, and Chengqi Zhang. Homophily, structure, and content augmented network representation learning. In Data Mining (ICDM), 2016 IEEE 16th International Conference on, pages 609{618. IEEE, 2016. 181 [239] Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. In Advances in Neural Information Processing Systems, pages 5165{5175, 2018. [240] Ziwei Zhang, Peng Cui, Jian Pei, Xiao Wang, and Wenwu Zhu. Timers: Error-bounded svd restart on dynamic networks. arXiv preprint arXiv:1711.09541, 2017. [241] Le-kui Zhou, Yang Yang, Xiang Ren, Fei Wu, and Yueting Zhuang. Dynamic network embedding by modeling triadic closure process. In AAAI, 2018. [242] Yang Zhou, Hong Cheng, and Jerey Xu Yu. Graph clustering based on structural/attribute similarities. Proceedings of the VLDB Endowment, 2(1):718{729, 2009. [243] Linhong Zhu, Dong Guo, Junming Yin, Greg Ver Steeg, and Aram Galstyan. Scalable temporal latent space inference for link prediction in dynamic social networks. IEEE Transactions on Knowledge and Data Engineering, 28(10):2765{2777, 2016. [244] Linhong Zhu, Dong Guo, Junming Yin, Greg Ver Steeg, and Aram Galstyan. Scalable temporal latent space inference for link prediction in dynamic social networks. IEEE Transactions on Knowledge and Data Engineering, 28(10):2765{2777, 2016. 182
Abstract (if available)
Abstract
Learning low-dimensional representations of nodes in a graph has recently gained attention due to its wide applicability in network tasks such as graph visualization, link prediction, node classification and clustering. The models proposed often preserve certain characteristic properties of the graph and are tested on these tasks. In this thesis, I propose to extend the graph embedding work in three directions. Firstly, I yield insights into the existing models and understand the universality of the learned embeddings and embedding methods. Specifically, I study the dependence of performance on a task on the model hyperparameters of the embedding method. I also analyze the characteristics of models required for each network task. Further, I propose a benchmark to evaluate any graph embedding approach and draw insights into it. Secondly, I propose an extension of graph embedding approach which can capture edge attributes of a graph. I show that capturing such attributes can be useful in link prediction and propose a model to learn edge attributes along with higher order proximity and social roles. Thirdly, I build models which can update embeddings efficiently for streaming graphs and can capture temporal patterns in sequential graphs.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Learning distributed representations from network data and human navigation
PDF
Learning to diagnose from electronic health records data
PDF
Invariant representation learning for robust and fair predictions
PDF
Scaling up temporal graph learning: powerful models, efficient algorithms, and optimized systems
PDF
Scaling up deep graph learning: efficient algorithms, expressive models and fast acceleration
PDF
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
PDF
Effective graph representation and vertex classification with machine learning techniques
PDF
Efficient graph learning: theory and performance evaluation
PDF
Deep learning models for temporal data in health care
PDF
Human motion data analysis and compression using graph based techniques
PDF
Physics-aware graph networks for spatiotemporal physical systems
PDF
Alleviating the noisy data problem using restricted Boltzmann machines
PDF
Advanced knowledge graph embedding techniques: theory and applications
PDF
Learning affordances through interactive perception and manipulation
PDF
Fast and label-efficient graph representation learning
PDF
Dynamic graph analytics for cyber systems security applications
PDF
Transforming unstructured historical and geographic data into spatio-temporal knowledge graphs
PDF
Inferring mobility behaviors from trajectory datasets
PDF
Algorithms and systems for continual robot learning
PDF
Theory of memory-enhanced neural systems and image-assisted neural machine translation
Asset Metadata
Creator
Goyal, Palash
(author)
Core Title
Graph embedding algorithms for attributed and temporal graphs
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
12/05/2019
Defense Date
12/05/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
deep learning,graph compression,graph embedding,graph learning,graph representation learning,link prediction,neural networks,node classification,OAI-PMH Harvest
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Ferrara, Emilio (
committee chair
), Raghavendra, Cauligi (
committee member
), Sukhatme, Gaurav (
committee member
)
Creator Email
palash1992@gmail.com,palashgo@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-245437
Unique identifier
UC11673891
Identifier
etd-GoyalPalas-8005.pdf (filename),usctheses-c89-245437 (legacy record id)
Legacy Identifier
etd-GoyalPalas-8005.pdf
Dmrecord
245437
Document Type
Dissertation
Rights
Goyal, Palash
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
deep learning
graph compression
graph embedding
graph learning
graph representation learning
link prediction
neural networks
node classification