Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Efficient graph learning: theory and performance evaluation
(USC Thesis Other)
Efficient graph learning: theory and performance evaluation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Efficient Graph Learning: Theory and Performance Evaluation by Tian Xie A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2022 Copyright 2022 Tian Xie Acknowledgements First and foremost, I would like to express my deepest gratitude to my advisor, Prof. C.-C. Jay Kuo, for his guidance, encouragement, and support throughout my Ph.D. study. Prof. Kuo accepted me to his group when I was confused and struggled to find a suitable advisor. Without him, I would never achieve this far in my Ph.D. journey. His wisdom in doing research instead of chasing the hot topic instructs me what a real researcher should be. His passion and persistence inspire me to explore the essential nature of the problems. His responsibility and instruction motivate me in how to behave in my future career. Prof. Kuo is a life-long role model for me to pursue. I would also like to thank the rest of my committee for the qualifying and defense ex- ams, Prof. Antonio Ortega, Prof. Rajgopal Kannan, Prof. Aiichiro Nakano, Prof. Mahdi Soltanolkotabi, Prof. Xiang Ren, and Prof. Bhaskar Krishnamachari, for their helpful sug- gestions to my research. Especially, I would like to thank Prof. Kannan for the weekly discussion lasting for one year and for all the insightful questions and advice. I would also liketothankProf. TianshuSunandProf. CyrusShahabifortheirguidanceatthebeginning of my Ph.D. career. I would like to thank all the members of the Media Communications Laboratory (MCL): Dr. BinWang, Dr. KaitaiZhang, Dr. YejiShen, Dr. MozhdehRouhsedaghat, ZohrehAzizi, Hong-ShuoChen,HongyuFu,XiouGe,PranavKadam,XuejingLei,VasileiosMagoulianitis, Zhanxuan Mei, Mahtab Movahhedrad, Wei Wang, Xinyu Wang, Yifan Wang, Yun Cheng Wang, Chengwei Wei, Yijing Wang, Min Zhang, Ganning Zhao, Qingyang Zhou, Zhiruo Zhou, Yao Zhu and all others who are not listed here. MCL is like a big family where I have ii received great support and encouragement from all of them. I really enjoyed the time being together. In particular, I would like to thank my collaborator, Dr. Bin Wang, for his great support in my research. I would also like to thank my collaborator Dr. Chaoyang He who inspired me in graph learning research. During my Ph.D., I have had wonderful internships in Tencent AI Lab and Facebook. I would like express my gratitude to my mentors and collaborators there: Prof. Junzhou Huang, Dr. Yu Rong, Dr. Tingyang Xu, and Dr. Wenbing Huang at Tencent AI Lab; Jane Hu and Dr. Bingjun Sun at Facebook. I really enjoyed the time working with them, where I learned what excellent researchers do in the industry. Finally, I would like to thank my father Shijie Xie, mother Qingfen Tian, and brother Di (Deshawn) Xie for their eternal love and support. Family is always my backing that encourages me to go through the darkest period in my life. iii Table of Contents Acknowledgements ii List of Tables vii List of Figures ix Abstract xiii Chapter 1: Introduction 1 1.1 Significance of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 Layerwise Trained Bipartite Graph Neural Network . . . . . . . . . . 4 1.2.2 Enhanced Label Propagation for Node Classification . . . . . . . . . 5 1.2.3 New Insights into GraphHop and Its Enhancement . . . . . . . . . . 6 1.3 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . 7 Chapter 2: Research Background 8 2.1 Machine Learning Tasks on Graphs . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Graph-based Semi-supervised Learning . . . . . . . . . . . . . . . . . . . . . 10 2.3 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . . 20 Chapter 3: Layerwise Trained Bipartite Graph Neural Network 22 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 Problem Formulation and Background . . . . . . . . . . . . . . . . . . . . . 26 3.3 Proposed L-BGNN Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3.2 Interdomain Message Passing . . . . . . . . . . . . . . . . . . . . . . 29 3.3.3 Intradomain Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3.4 Layerwise Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3.5 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3.6 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 43 iv 3.4.3 Embedding Visualization . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.4.4 Effect of Intradomain Alignment . . . . . . . . . . . . . . . . . . . . 47 3.4.5 Effect of Layerwise Training . . . . . . . . . . . . . . . . . . . . . . . 47 3.4.6 Efficiency for Large-scale Bipartite Graphs . . . . . . . . . . . . . . . 50 3.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Chapter 4: Enhanced Label Propagation for Node Classification 54 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.2.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.2.2 Assumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.2.3 Smoothening Operations . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.3 GraphHop Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.3.1 Initialization Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.3.2 Iteration Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.4.1 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.4.2 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.5.2 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.5.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.5.4 Computational Complexity and Memory Requirement . . . . . . . . . 78 4.5.5 Additional Observations . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.6 Comments on Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.7 Applications and Improvements . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Chapter 5: New Insights into GraphHop and Its Enhancement 90 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.2.1 Notations and Problem Statement . . . . . . . . . . . . . . . . . . . . 94 5.2.2 Transductive Label Propagation . . . . . . . . . . . . . . . . . . . . . 95 5.2.3 GraphHop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.3 Understanding GraphHop via A Regularization Framework . . . . . . . . . . 98 5.3.1 Constrained Optimization Framework . . . . . . . . . . . . . . . . . . 98 5.3.2 Alternate Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.4 GraphHop++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.4.1 Enhancement in Classifier Training . . . . . . . . . . . . . . . . . . . 106 5.4.2 Enhancement in Label Embeddings Update . . . . . . . . . . . . . . 107 5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.5.2 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.5.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 113 v 5.5.4 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.5.5 Complexity and Memory Requirements . . . . . . . . . . . . . . . . . 116 5.5.6 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.5.7 Application to Object Recognition . . . . . . . . . . . . . . . . . . . 120 5.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 Chapter 6: Conclusion and Future Work 124 6.1 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.2.1 Efficient Graph Learning Methods . . . . . . . . . . . . . . . . . . . . 127 6.2.2 Learning across Domains . . . . . . . . . . . . . . . . . . . . . . . . . 127 Bibliography 129 Appendices 141 C L-BGNN: Proof and Experimental Details . . . . . . . . . . . . . . . . . . . 141 C.1 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 C.2 Hyperparameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . 142 D GraphHop: Implementation of LR Classifiers . . . . . . . . . . . . . . . . . . 142 E GraphHop++: Theory and Proposition Proof . . . . . . . . . . . . . . . . . 144 vi List of Tables 3.1 Statistics of eight datasets used in our experiments. . . . . . . . . . . . . . . 39 3.2 Performance comparison for the node classification task, where OOM denotes “out of memory”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3 Performance comparison for the link prediction task. . . . . . . . . . . . . . 43 3.4 Performance comparison of L-BGNN-Adv against simple domain alignment, where the performance metric is the Micro F1 score. . . . . . . . . . . . . . . 46 3.5 Performance comparison of L-BGNN-Adv and L-BGNN-MLP, where the per- formance metric is the Micro F1 score . . . . . . . . . . . . . . . . . . . . . . 46 3.6 Performance comparison of the two-layer end-to-end training with the layer- wise training for L-BGNN-Adv, where the performance metrics are the F1 score for Group, and the Micro (the first number) and Macro (the second number) F1 score for Cora, CiteSeer and PubMed.. . . . . . . . . . . . . . . 49 4.1 Statistics of six representative datasets used in experiments. . . . . . . . . . 74 4.2 The training, validation, and testing splits used in the experiments for large- scale graphs, where the node numbers and the corresponding percentages (in brackets) are listed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.3 Grid search ranges of the hyperparameters. . . . . . . . . . . . . . . . . . . . 74 4.4 Hyperparameters used in the experiments for the largest label rate. . . . . . 75 4.5 Classification accuracy (%) for three citation datasets with different label rates. The highest accuracy in each column is highlighted in bold and the top three are underlined. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.6 Classification accuracy (%) for three large-scale graph datasets, where the column of labeled samples is measured in terms of percentages of the entire dataset and OOM means “out of memory”. . . . . . . . . . . . . . . . . . . . 76 vii 4.7 Comparison of training time and GPU memory usage. The averaged running time per epoch/iteration and the total running time are given outside and inside the parantheses, respectively. . . . . . . . . . . . . . . . . . . . . . . . 81 4.8 Comparison of accuracy performance of GraphHop and its two variants. . . . 82 5.1 Benchmark dataset properties and statistics. . . . . . . . . . . . . . . . . . . 109 5.2 Test accuracy for the Cora dataset with extremely low label rates measured by “mean accuracy (%) ± standard deviation”. The highest mean accuracy is in bold while the second and third ones are underlined. . . . . . . . . . . 111 5.3 TestaccuracyfortheCiteseerdatasetwithextremelylowlabelratesmeasured by “mean accuracy (%) ± standard deviation”. The highest mean accuracy is in bold while the second and third ones are underlined. . . . . . . . . . . 112 5.4 TestaccuracyforthePubMeddatasetwithextremelylowlabelratesmeasured by “mean accuracy (%) ± standard deviation”. The highest mean accuracy is in bold while the second and third ones are underlined. . . . . . . . . . . 112 5.5 Test accuracy for the Amazon Photo dataset with extremely low label rates measured by “mean accuracy (%)± standard deviation”. The highest mean accuracy is in bold while the second and third ones are underlined. . . . . . 113 5.6 Test accuracy for the Coauthor CS dataset with extremely low label rates measured by “mean accuracy (%)± standard deviation”. The highest mean accuracy is in bold while the second and third ones are underlined. . . . . . 113 5.7 Test accuracy with three label rates measured by “mean accuracy (%) ± standard deviation”, where the highest mean accuracy is marked inbold and the second and third are underlined. . . . . . . . . . . . . . . . . . . . . . . 121 viii List of Figures 1.1 Graphsareamoregeneraldatastructurethanimages. (a)TheOriginalimage in the Euclidean space. (b) The shuffled image fails to be recognized. (c) The original graph with a specific ordering. (d) The shuffled ordering results in the same graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.1 Common machine learning tasks on graphs. Different colors denote different labels. (a) Node classification. (b) Link prediction. (c) Recommendation. . . 9 2.2 A toy example of Label Propagation algorithm on the pattern of two moons. Two colors denote to two different classes. The initial labeled points are represented as solid shapes and the model prediction are in hollow shapes. The figure is from [184]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 A graph with positive and negative signals showing in red and blue bars respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 The graph Laplacian eigenvectors associated to different eigenvalues, where λ 0 < λ 1 < λ 50 . Clearly, f 0 varies slowly compared to f 50 . The figure is from [129]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5 Graph Convolutional Network. The figure is from [66]. . . . . . . . . . . . . 19 2.6 AnillustrationofdistributionmappinginGAN.Theblackdottedlinedenotes the data distribution, blue dashed line is the discriminator distribution, and the green solid line is the generator distribution. The figure is from [37].. . . 21 3.1 A bipartite graph for the e-commerce system, where node entities in two disjoint sets follow different feature distributions and intradomain and inter- domain neighbors are connected by red and blue edges, respectively. . . . . . 23 3.2 The power-law distribution of node degrees in an user-item network from the Amazon dataset [53]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3 An overview of the L-BGNN method. Given inputs X u and X v from two domains, we obtain their embeddings as H u and H v via IDMP and IDA. To aggregate the multi-hop neighbor information, we stack multiple layers to formulate a deep L-BGNN which is trained in a layerwise mechanism. . . . . 27 ix 3.4 Comparisonoflayerwisetrainingwithconventionalend-to-endtraining,where both networks have a two-layer structure. (a) Layerwise training. (b) End- to-end training. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.5 Performance comparison of the recommendation task for the Movielens dataset. 45 3.6 Visualization of embedding results on the Yelp dataset, where colors are used to indicate different business categories. (a) BiNE. (b) HeGAN. (c) metap- ath2vec. (d) L-BGNN-Adv. . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.7 The training loss of L-BGNN-MLP as a function of the iteration number for the PubMed dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.8 The training loss of L-BGNN-Adv as a function of the iteration number for the PubMed dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.9 Performancecurvesasafunctionofthelayernumberforthenodeclassification task, where the x-axis denotes the number of layers and the y-axis is the F1 score. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.10 Memory cost and training time on Group dataset. The x axis in left figure denotes the wall-clock time in second, whereas the y axis in both figures are the memory cost. The short blue line of L-BGNN and orange line of AS-GCN mean the training has finished, whereas the training time of GraphSAGE is too long to be shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.1 The plot of accuracy curves based on lower frequency (in blue) and higher frequency (in red) for citation datasets, respectively. The x-axis denotes the selected top-k lower or higher frequency components in percentage and the top 20% low frequency region is shaded in green. . . . . . . . . . . . . . . . . 58 4.2 An overview of the proposed GraphHop method, where the left subfigure illustrates the initialization stage, the middle subfigure shows the label aggre- gation and label update steps in the iteration stage, and the converged result is given in the right subfigure. Red and blue arrows denote to the adaptation of the same framework in the initialization stage and iteration stage, respec- tively. Note that unique classifiers are applied to multi-hop aggregation with different M values shown in the middle.. . . . . . . . . . . . . . . . . . . . . 64 4.3 Illustration of attribute or label embedding aggregation in GraphHop, where multi-hop embeddings are averaged and then concatenated to form the input to the LR classifier. The example in the figure gives the aggregation of the center green node with M = 2, where the red and blue regions correspond to one-hop and two-hop neighbors, respectively. . . . . . . . . . . . . . . . . . . 67 x 4.4 GraphHop’s test accuracy curves as a function of the iteration number under different label rates, where the label rate is expressed as the number and the percentage of labeled nodes per class for citation and large-scale graph datasets, respectively. (a) Cora dataset. (b) CiteSeer dataset. (c) PubMed dataset. (d) Reddit dataset. (e) Amazon2M dataset. (f) PPI dataset. . . . . 79 4.5 Training loss curves of the LR classifiers of GraphHop as a function of the epoch number for six datasets. (a) Cora dataset. (b) CiteSeer dataset. (c) PubMed dataset. (d) Reddit dataset. (e) Amazon2M dataset. (f) PPI dataset. 80 4.6 (a) Comparison of time complexity vs. memory usage of different methods on Reddit, where the lower left corner indicates the desired region that has low training complexity and low GPU memory consumption. (b) The trade-off between training time and memory usage by varying the training minibatch size on Reddit for GraphHop, where the green curve indicates the training time and the orange curve indicates the memory usage. . . . . . . . . . . . . 81 4.7 The convergence curves of GraphHop and Variant II on citation datasets, where inner figures show the initial 10 iterations. . . . . . . . . . . . . . . . 83 4.8 The convergence of GraphHop’s test accuracy curves with (right) and with- out (left) residual connections for Reddit, where shaded areas indicate the standard deviation range. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.9 Plots of the cumulative percentages of nodes in subsetV 1 ∪V 2 ∪...∪V k , where different colors indicates different label rates. . . . . . . . . . . . . . . . . . . 84 5.1 An overview of the GraphHop++ method, where the left subfigure shows the initialization for the subsequent process while the right subfigure depicts the alternate optimization process. The latter consists of: 1) iterations for label embeddings update and 2) classifier training for classifier parameters update. 105 5.2 Illustration of growing curriculum sets in the alternate optimization process: (a) the original graph, (b) and (c) selected reliable nodes in the curriculum set in the first and second rounds of the optimization process. Two labeled nodes of two classes are colored in blue and red, unlabeled nodes and nodes in the curriculum set are colored in gray and green, respectively. The dotted ellipses show the corresponding curriculum sets of two labeled nodes. . . . . 105 5.3 Convergenceresultsofthelabelembeddingsforthefivebenchmarkingdatasets: (a)Cora,(b)CiteSeer,(c)PubMed,(d)AmazonPhoto,and(e)CoauthorCS. The x-axis is the number of alternate optimization rounds and the y-axis is the test accuracy (%). Different curves show the mean accuracy values under different label rates and the shaded areas represent the standard deviation. 115 xi 5.4 Convergence results of the LR classifiers for the benchmarking datasets: (a) Cora, (b) CiteSeer, (c) PubMed, (d) Amazon Photo, and (e) Coauthor CS. Thex-axisisthenumberoftrainingepochsandthey-axisisthetrainingloss. The label rate is 20 labels per class. . . . . . . . . . . . . . . . . . . . . . . . 115 5.5 Comparsion of computational efficiency of different methods measured by log(second) for Cora, CiteSeer dataset, PubMed, Amazon Photo and Coau- thor C&S datasets, where the label rate is 20 labeled samples per class and GraphHop++(20) is the result of GraphHop++ with 20 alternate optimiza- tion rounds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.6 Comparsion of of GPU memory usages of different methods measured by log(Mega Bytes) for Cora, CiteSeer dataset, PubMed, Amazon Photo and Coauthor C&S datasets, where the label rate is 20 labeled samples per class. 117 5.7 Performance of GraphHop++ with various hyper parameter settings for the CiteSeer dataset, where the z-axis is the accuracy result and the label rate is one label per class: (a) performance of T and β with fixed α , (b) performance of T and α with fixed β , (c) performance of α and β with fixed T. . . . . . . 117 5.8 Performance of GraphHop++ with an increasing number of iterations for five datasets: (a) Cora dataset. (b) CiteSeer dataset. (c) PubMed dataset. (d) Amazon Photo dataset. (e) Coauthor C&S dataset, where the x-axis is the number of alternate optimization rounds and the y-axis is the test accuracy (%). The averaged accuracy under one specific number of iterations is represented as dots and the standard deviation as vertical bars. Different lines denote results under different label rates. . . . . . . . . . . . . . . . . . 118 5.9 Performance comparison of GraphHop++ and GraphHop-V under different label rates: (a) Cora, (b) CiteSeer, (c) PubMed, (d) Amazon Photo, and (e) Coauthor C&S. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.10 Illustration of several exemplary images of the COIL20 dataset. . . . . . . . 121 6.1 Data from distinct domains have the exactly the same representations. Left: the text and its parse tree. The figure is from [3]. Right: the image, i.e., grid-like graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 xii Abstract Graphs are generic data representation forms that effectively describe the geometric struc- tures of data domains in various applications, e.g., social, sensors, transportation, commu- nication, citation networks, etc. Graph learning, which learns knowledge from this graph- structured data, is an important machine learning application on graphs. Examples range fromcollectiveclassification,suchasdocumentclassificationinacademicgraphs,torelational learning,suchasitemrecommendationine-commercenetworks. Thecurrentstate-of-the-art graphlearningalgorithms,graphneuralnetworks(GNNs),haveshownsuperiorperformance to traditional methods in tasks such as semi-supervised node classification, link prediction, and representation learning. Although significant progress has been achieved in numerous graph learning applications, there is still a wide range of problems where either GNNs’ ap- plicabilities have not been explored or are restricted by the internal deficiencies of the GNN framework, such as interpretability, scalability, and label efficiency. In this dissertation, we investigate and propose new graph learning techniques from the aspects of graph convolu- tionalnetworks,graphsignalprocessing,andregularizationframeworks,whichgeneralizethe GNN application and identify a new path in solving graph learning problems with efficiency and effectiveness co-design. We first extend GNN to a specific bipartite graph structure for representation learn- ing. The main challenges are 1) bipartite graphs have two different but correlated feature domains; 2) unsupervised network embedding and computational efficiency should be con- sidered simultaneously. Accordingly, we propose an efficient layerwise training bipartite graph neural network (L-BGNN), which employs a customized message passing on bipartite xiii networks followed by an adversarial domain message alignment. The adversarial training enables L-BGNN to learn the node representations in an unsupervised manner without la- bel supervision. In addition, L-BGNN adopts a layerwise training mechanism that can be efficiently generalized to large-scale bipartite graphs without performance deterioration. Ex- tensiveexperimentsonvariousscalesofnetworksandnumerousdownstreamtasksverifythe superior performance of the proposed L-BGNN method. Then, we propose a node classification algorithm named GraphHop, which is label ef- ficiency with a dominant performance in extremely small label rates and can be directly generalized to large-scale networks with low memory cost and fast running time. We regard node attributes and labels as signals on graphs, where designed low-pass filters are respec- tively applied for signal smoothening. The two signal types are connected through classifier predictions. Thisseparationofsignalsmootheningandfeaturespacetransformationreduces memory storage of parameters, enabling the generalization to large graphs. In addition, classifiers are introduced to train on local neighborhoods and make predictions to the whole nodes, serving as further smoothening to the label signals. The effective low-pass filtering to the graph signals enhances the capability in extremely limited label rates. Finally, we derive a different insight into the GraphHop method from a regularization framework. We show that the GraphHop model can be approximately cast into an alter- nate optimization of a particular regularization function on graphs. Then, based on this variational interpretation, we propose two approaches to address the approximation in the optimization of the GraphHop method. Experiments show that equipped with these two im- provements, our model called GraphHop++ achieves significantly better performance than the former GraphHop model, and the state-of-the-art methods on various benchmarking datasets and applications to object recognition with extremely limited labels. xiv Chapter 1 Introduction 1.1 Significance of the Research Graphs are a ubiquitous data structure that captures a set of objects (i.e., nodes) and their relationships (i.e., edges) efficiently. Many actual data structures from various domains can be readily abstracted in the same graph topology, e.g., social networks [123], molecular structures [25], recommender systems [127], knowledge graphs [81], and many other real- world applications [65, 153]. Apart from representing structured knowledge, graphs also serve a key role in many machine learning tasks. The graph-structured data are often used as feature information for machine learning applications to make predictions or discover new patterns such as node classification, link prediction, and community detection. The critical point in machine learning on graphs is finding a way to encode graph topol- ogy information through model learning. For example, in a node classification problem, incorporating the local neighbor structure and the global graph topology are essential for making accurate predictions. Alternatively, in the case of recommendation, nodes sharing similar topology neighbors, e.g., number of commonly purchased items, often have the same shopping preference and result in a proper recommendation based on these structural simi- larities. However, graphlearningisanontrivialandevenchallengingtask. First, thenumber ofnodesinagraphcanbevariable, whichproducesagreatchallengefortraditionalmachine learning models that can only take fixed-size input. Besides, the number of nodes is often 1 (a) (b) 1 2 3 4 (c) 3 1 4 2 (d) Figure 1.1: Graphs are a more general data structure than images. (a) The Original image in the Euclidean space. (b) The shuffled image fails to be recognized. (c) The original graph with a specific ordering. (d) The shuffled ordering results in the same graph. large, e.g., millions of people in one social network, which further poses difficulties to the model scalability. Second, graphs are permutation-invariant and sit in the non-Euclidean (e.g., manifold) domains [185]. For example, the image pixels follow a fixed order shown in Fig. 1.1a, where randomly shuffling the pixels will change the image entirely (Fig. 1.1b). However, the same thing may not happen in graphs. Instead, permuting the node indexes may result in the same graph structure shown in Fig. 1.1c and 1.1d, which brings additional challenges to distinguishing graphs. Third, the graph contains rich node and edge infor- mation beyond the topology for the learning tasks, which is extremely hard to extract and learn. Graphs not only can represent concrete entities in real-world but can also reveal the un- derlying manifold structure of high-dimensional data. Early researchers exploited this prop- erty by conducting semi-supervised node classification on the data-generated graphs. The fundamental hypothesis is the manifold assumption in semi-supervised learning, where data points on the same low-dimensional manifold should have the same label [15]. All methods in this field can be summarized as a general regularization framework for variational opti- mization, wheredifferentregularizationfunctionsaredesignedinthelabelembeddingspace. Instead of directly solving the optimization problem, the optimal can also be equivalent de- rived from an iteration process named label propagation (LP) [184, 192], which iteratively propagates the label values until convergence. Although these graph-based methods have shown superior performance in semi-supervised node classification tasks and the capability 2 for generalizing to large-scale networks, challenges appear in modern scenarios where graphs often consist of rich node and edge information. In general, node attributes are failed or not effectively encoded jointly into model learning. The node features are employed for graph generation, where their distributions w.r.t. the graph structure is often ignored. Instead of directly deriving the node label predictions, another line of research proposes to learn the general node representations from the graph topology, where the encoded low- dimensional feature embeddings are then utilized in the downstream machine learning tasks, suchasnodeclassificationandlinkprediction. Theideaisthatwithconcreteandbetternode representations, higher performance can be achieved in downstream tasks with less labeled samples for supervision [7]. One approach for learning representations for nodes, named matrix factorization [4, 14], is inspired by the classic dimensionality reduction techniques. The edge connection matrix (i.e., adjacency matrix) is decomposed into the product of two lower dimensionality matrices which serve as the node representations. Another line of work follows the idea of the success of word embedding [95], where the skip-gram model is applied to the generated random walk sequences from graph [113, 133]. Both types of approaches have shown a great improvement in the representation capacity of the graph topology. However, they still suffer from several problems, such as shallow embedding [41], difficulties in generalization, and long training time. Recently, the breakthrough of neural networks and deep learning have drawn great at- tention to applications to graph-structured data. In the pioneering work named graph con- volutional network (GCN) [66], a convolutional layer tailored for graph structure has been proposed. Eachlayerappliesatransformationandpropagationtothenodeembeddingswith theendingnonlinearactivationfunctions. Theparametersarelearnedfromthelabelsupervi- sionbygradientbackpropagation, whichfallsinthecommonend-to-endtrainingframework. The combination of graph regularization and feature transformation together trained by the label supervision makes GCN and its variants achieve state-of-the-art performance on semi- supervised node classification, node representation learning, network embedding, etc [40, 67, 3 111, 169, 177]. Despite the success of graph neural networks (GNNs) in specific problems, GNNsstilleitherdonotfindapplicabilitiesinmanyotherimportantgraphlearningproblems or are incapable due to the deficiencies inherited from the neural network architecture. Apartfromdesigningsuperiormethodsforgraphlearning,thetrainingcostshouldalsobe considered in general. Nowadays, the success of deep learning [74] and neural networks come at the cost of large amounts of labeled data and long training time with large model sizes. Thereisawasteofresourcesinhumanlabeling,trainingtime,andmemoryusage. Efficiency is especially critical in the graph research area due to the enormous size of one single graph. Forexample,thereusuallycontainsmillionsofusers(i.e.,nodes)andrelations(i.e.,edges)in a current social network. Labeling them all or even a tiny portion is impractical (Compared to the image dataset CIFAR-10 [71] which only contains tens of thousands of images). Semi- supervised[15,135]andunsupervisedlearning[42]areessentialastheycanlearnfromample availableunlabeleddatatoachievethesameperformanceresultsassupervisedlearning,with less or without any label supervision. So how to design the model without heavy reliance on a large amount of labeled data and at the same time lightweight with the model size is a vital and compelling task. Inthisdissertation,weexploreanddesignnewgraphlearningalgorithmsfromtheaspects of graph convolutional networks, graph signal processing, and regularization frameworks, which simultaneously achieve state-of-the-art performance and low-resource consumption. 1.2 Contributions of the Research 1.2.1 Layerwise Trained Bipartite Graph Neural Network DirectlyapplyingordinaryGNNtobipartitegraphsisnontrivialandsuboptimal. Anunique challenge is that the two disjoint sets in a bipartite graph often come with different but correlated feature domain distributions, where the simple aggregation operator, such as sum, is incapable of encoding the complex correlation. To address this challenge, we propose 4 an adversarial method [37] to align the different domain distributions with special message passingbetweennodesets. Thisadversarialtrainingeffectivelyencodesbothgraphtopology structuresandnodeattributes,enablinglearningthenoderepresentationsinanunsupervised way without label supervision. In addition, we propose a layerwise training mechanism to efficiently scale our method to large-scale networks without performance degradation in theory. The contributions are summarized below. • We propose a novel L-BGNN method to effectively deal with domain inconsistency for bipartite graph representation learning, and it can be efficiently scaled to large-scale bipartite graphs through a layerwise training mechanism. • We provide theoretical analysis on the effect of domain alignment and show that our proposed method can effectively encode the information from two domains in the final node embeddings. • WeconductexperimentstodemonstratetheperformanceofL-BGNNonseveralpublic graph datasets and one large-scale network through a series of downstream tasks. Results show that L-BGNN consistently and significantly outperforms various state- of-the-art methods. 1.2.2 Enhanced Label Propagation for Node Classification Node attributes and labels can also be regarded as signals on graphs. We first analyze the spectrum distributions of these two types of signals, which results in a general assumption of the dataset networks. Accordingly, we propose an enhanced label propagation algorithm for signal smoothening to attributes and labels. In particular, the initialization stage conducts attribute signal smoothening and adopts logistic regression (LR) classifiers for label predic- tions. Then, an iteration stage applies label signal smoothening to the initialized inference, which yields an LP-liked mechanism. To strengthen the propagation effect in extremely low label rates, LR classifiers are introduced to train on local neighborhoods and infer the label 5 embeddings for the following iterations. This classifier training enables parameter sharing of label embeddings and effectively enhances the information propagation to longer distances. The simple LR classifier training and propagation design enable our method to be scalable to large-scale networks. We summarize the contributions below. • We propose an enhanced LP-based method, called GraphHop, which conducts joint attribute and label smoothening on graphs through regression classifiers. • We provide theoretical justification with some approximation to the fact that Graph- Hop converges faster than the traditional LP. • We empirically verify that the collaborative model design can address the three weak- nesses of the traditional LP. • We conduct extensive experiments to validate GraphHop’s effectiveness concerning small label rates and large-scale graphs against state-of-the-art GCN algorithms. To the best of our knowledge, we are the first to show that an enhanced LP-based method can outperform GCN baselines on serveral well-known benchmarking datasets. 1.2.3 New Insights into GraphHop and Its Enhancement Although the superior performance of GraphHop model is explained intuitively with joint node attributes and labels smoothening, its rigorous mathematical treatment is lacking. We further dive deeper into understanding the GraphHop method by analyzing it from a con- strained optimization viewpoint. We show that GraphHop offers an alternate optimization to a specific regularization problem defined on graphs. Based on the interpretation, we pro- posetwoideastoimproveGraphHopfromthetwoalternatesubproblems. Thecontributions are summarized below. 1. We analyze GraphHop theoretically from a variational viewpoint and show that it cor- responds to an alternate optimization process that provides a solution to a regularized optimization problem. 6 2. Basedontheoreticalanalysis,weproposetwoenhancementsonGraphHop,whichleads to an even more powerful semi-supervised solution called GraphHop++. 3. We demonstrate the effectiveness and efficiency of GraphHop++ with extensive ex- periments on five commonly used datasets as well as an object recognition task at extremely low label rates 1.3 Organization of the Dissertation The rest of the proposal is organized as follows. In Chapter 2, we review the background, including graph-based semi-supervised learning, graph convolutional networks, and adver- sarial learning. In Chapter 3, we propose a representation learning method tailored for bipartite graphs without label supervision, where domain inconsistency and network scales have been jointly considered in model design. In Chapter 4, we propose an enhanced label propagation method named GraphHop for node classification based on our assumption of the dataset networks. The method achieves state-of-the-art performance compared to GNNs in extremely low label rates and can be directly generalized to large-scale graphs efficiently. In Chapter 5, we investigate the achievement of GraphHop from the viewpoint of regular- ization frameworks. Based on this variational interpretation, we propose two enhancements to the approximation in the alternate optimization procedure. Finally, concluding remarks and future research directions are given in Chapter 6. 7 Chapter 2 Research Background Inthissection,wegiveabackgroundintroductionrelatedtoourresearch. First,weintroduce some common machine learning tasks on graphs. Then, we discuss a general regularization framework and its label propagation (LP) solution in graph-based semi-supervised learning, which provides an efficient method for the node classification task. Next, we introduce the state-of-the-art graph convolutional network (GCN), a generalization of neural networks to graph-structured data based on the analysis of graph signals. Finally, we discuss one important generative method based on the success of neural networks named generative adversarial networks (GANs), which we have adopted in our model design. 2.1 Machine Learning Tasks on Graphs Graphs exist in very different application domains, e.g., social networks, computer vision, sensor networks, and biological networks, where a generic representation as graphs can be derived from the structure present in the datasets. However, different research domains may exit distinct objectives on graph learning. For example, in graph signal processing [108], the application goals are mainly about signal compression, denoising, or reconstruction of signals on graphs. While for machine learning and data mining, researchers focus more on how to encode the graph data into model learning and make inferences on corresponding 8 graphproperties. Here,wemainlyintroducesomeclassicalmachinelearningtasksongraphs, which are also related to this dissertation. • Node classification . In this context, each node represents one data point to which a label is associated. The objective is to predict the label of a given node shown in Fig. 2.1a. Graphs can be regarded as side information that encodes the relationships of the data examples. A higher performance accuracy can be achieved by embedding this relationship information into model learning. • Linkprediction. Given one graph, one intuitive task is whether we can link two non- linked nodes shown in Fig. 2.1b, i.e., the criteria of connections. The basic assumption isthatthelinkednodessharemoresimilarities,e.g.,neighborhoodstructure,attributes, or global positions, than non-linked nodes. • Recommendation. Apart from only predicting connections between two nodes, a rank of similarities among all nodes is also important in real-world applications, such as movie or item recommendations. Link prediction can be regarded as a special case where top-one prediction is considered for linking. Fig. 2.1c gives one example of the recommendation for the two blue-colored nodes. ? ? ? (a) ? ? ? (b) 1. ? 2. ? 3. ? 4. .. 1. ? 2. ? 3. ? 4. .. (c) Figure 2.1: Common machine learning tasks on graphs. Different colors denote different labels. (a) Node classification. (b) Link prediction. (c) Recommendation. Therearemoretasksformulatedongraphs,e.g.,communitydetection,networksimilarity, graph generation, etc. We refer [3] for more frontier machine learning tasks on graphs. 9 2.2 Graph-based Semi-supervised Learning Semi-supervised learning is halfway between supervised and unsupervised learning. In addi- tiontothelabeleddataasinsupervisedlearning, thealgorithmisprovidedwithaportionof unlabeled examples generated from the marginal distribution. In this case, the data set X = {x i } i∈[n] ,x i ∈X can be divided into two parts, the points X l :={x 1 ,x 2 ,...,x l } with corre- spondinglabelsY l :={y 1 ,y 2 ,...,y l },y i ∈Y,|Y| =c, andthepointsX u :={x l+1 ,x l+2 ,...,x n }, for which the labels are unknown, followed by n =l+u. In most practical settings, the un- labeled samples are way more than labeled one, i.e., u≫ l. The objective of semi-supervised learning is to learn a function f :X →Y, f ∈F given both the labeled and unlabeled data. The success of semi-supervised learning, i.e., a perfect mapping f : X → Y, relies heavily on whether the unlabeled data are relevant for the classification problem. To yield an improvement over supervised learning, the unlabeled samples generated from p(x),x ∈ X should carry information that is useful in the inference of p(y|x),y ∈ Y. Therefore, several general assumptions should be held w.r.t. the data distribution. In particular, graph-based semi-supervised learning exploits the manifold and smoothness assumptions; that is, data can be embedded in low-dimensional manifolds through a graph where data points on the same manifold should have the same label [15, 135]. Here, the embedded graph is often represented by a weighted graph with the edge weights denoting the pairwise similarity between two entities. A missing edge corresponds to infinite distance. In most cases, the graph needs to be constructed from the data by manually defining the similarity measurement. However, sometimes the data can be naturally expressed by a graph, such as social and citation networks. Graph-based semi-supervised learning has shown superior performance and efficiency to large-scale networks on the semi-supervised node classification task [130]. Asdiscussedabove, theapplicationofthegraph-basedmethodfirstrequiresconstruction of a graph over the input data. Let G = (V,E) be a graph where nodes V = {1,...,n} 10 representthedatasamplesandedgesE denotethesimilaritydistancesbetweenthem. These distances are given by a adjacency matrix W∈R n× n and defined by W ij := w(e) if e = (i,j)∈E, 0 if e = (i,j) / ∈E , (2.1) where w : E → R is the edge weight. There are different metrics for characterizing the similarity distance. For example, in a social network, W ij can de defined as 1 if two users are friends. Another typical metric for calculating weights is the Gaussian kernel of width σ W ij =e − ||x i − x j || 2 2σ 2 , (2.2) where x i and x j are node features. Note that the adjacency matrix W is a positive semi- definite matrix due to the distance meaning of each entry. Based on the smoothness assumption, once we have derived the graph from the data, regularization (i.e., smoothness) should be applied to the label value of each node to achieve correct predictions, i.e., forcing the local neighborhoods to have the same label predictions. This general idea leads to a common regularization framework where most graph-based methods can be cast into with different design choices of the graph regularization. Formally, this regularization framework can be expressed by a variational problem min f n X i=1 ℓ(f(x i ))+µ l X i=1 ℓ(f(x i ),y i ), (2.3) where f(x i ) is the prediction function at the ith node and µ is a positive balancing hyper- parameter between two terms. The first term is the graph regularization loss applied to the entire node set, and the second term is the label fitness to the ground-truth labeled samples. There are numerous methods with different design choices of the regularization [102, 130, 180, 184, 189, 190, 192]. Here, we introduce one typical method named LGC [184] since it 11 can also be solved in a label propagation process instead of directly optimizing, and related to this dissertation. We first introduce some notations. A matrix F = (f T 1 ,...,f T n ) T ∈ R n× c corresponds to a classification on the dataset X by labeling each sample x i as a label y i = argmax j≤ c f ij . Define a matrix Y = (y T 1 ,...,y T n ) T ∈R n× c withY ij = 1 ifx i is labeled asy i =j andY ij = 0 otherwise. Then, the optimization problem in LGC can be expressed as min f n X i=1 ℓ(f(x i ))+µ l X i=1 ℓ(f(x i ),y i ) = n X i,j=1 W ij || 1 √ D ii f i − 1 p D jj f j || 2 +µ n X i=1 ||f i − y i || 2 = tr(F T ˜ LF)+µ tr (F− Y) T (F− Y) , (2.4) where D ∈ R n× n is the degree matrix of W, ˜ L = D − 1/2 (D− W)D − 1/2 ∈ R n× n is the symmetric normalized Laplacian matrix. The first term is to force the connected nodes with the same label predictions and the second term is the Frobenius norm between the predictions and the labels. Note that the regularization and label fitness are directly defined inthelabelspace. Thecriterionfunctionisinthequadraticformwhereclosed-formsolution can be derived by taking the derivative and set to zero. Formally, the optimal is F ∗ = (I+µ ˜ L) − 1 Y, (2.5) whereI∈R n× n istheidentitymatrix. Ontheotherhands,sameoptimalcanbederivedfrom an iteration process of the label values. Formally, in each iteration, the label propagation can be described as F (t) =α ˜ WF (t− 1) +(1− α )Y, (2.6) where α ∈ (0,1) is a weighting hyper-parameter and ˜ W = D − 1/2 WD − 1/2 ∈ R n× n is the symmetric normalized adjacency matrix. We can regard the iteration as the prediction of 12 Figure 2.2: A toy example of Label Propagation algorithm on the pattern of two moons. Two colors denote to two different classes. The initial labeled points are represented as solid shapes and the model prediction are in hollow shapes. The figure is from [184]. each node is partially received from its neighbors, and the rest contributed from its initial label. An illustration of this propagation process is shown in Fig. 2.2. The converged predictions can be derived by taking the limit of Eq. (2.6) resulting in F ∗ = (1− α )(I− α ˜ W) − 1 Y. (2.7) We can see that the result is the same as Eq. (2.5) if we let µ = α 1− α . 2.3 Graph Neural Networks Due to the recent success of deep learning and neural networks, there has been an effort to generalize such methods to graph-structured data. In summary, the proposed algorithms following this track can be divided into two categories [22]: 1) Directly employing the graph 13 structure, i.e., spatial-based methods; 2) Design convolutional filters in the graph spectrum domain, i.e., spectral-based methods. The emerging graph convolutional network (GCN) shows the identity between these two categories and inspires the following model design. Different from other types of data structure, e.g., images, videos, and texts, graphs lie in the non-Euclidean space without a fixed grid structure. This arbitrary form hinders the neural network methods from directly applying to graphs due to the induced irregular recep- tivefields. Alternatively, methodsaredesignedbyexploitingthegraphstructuretogenerate applicable receptive fields. For example, some methods explore graph structure as context information, based on random walks (e.g., DeepWalk [113], node2vec[39]), neighborhoods (e.g., LINE [133], SNDE [140]) or high-order proximity (e.g., DCNN [2], GraRep [14]). Then a skip-gram model [95] can be applied to the generated context sequences to learn the latent representations. Others directly inspect the neighbors of each node and leverage the graph labeling algorithm to preserve the local ordering of the receptive fields [106, 174, 177]. Then, the common convolutional neural networks (CNNs) [36] can be directly applied to these generated receptive fields for downstream applications. Althoughspatial-basedmethodsareeffectiveingraphlearning, theyaredeficientinboth model and time complexity, e.g., a particular method needs to be designed for generating receptive fields. Meanwhile, another line of work, which tried to implement neural networks inthegraphspectrumdomain,hasdevisedanefficientandeffectivesolution. Thesemethods rely on spectrum analysis to graph signals, which we will briefly introduce in the following. GraphSignalProcessing. GivenagraphG = (V,E),agraphsignalcanbeformulated as a function f :V →R defined on the vertices. It can be represented as a vector f ∈R N , wheretheithentryofthevectorf denotesthefunctionvalueattheithvertexinV asshown in Fig. 2.3. The processing of graph signals, e.g., Fouries transform, filtering, convolution, etc, can be intuitively analogized from the classical signal processing. Graph Laplacian. We first recall the graph Laplacian matrix defined as L := D− W, which is introduced in Sec. 2.2. Since the graph Laplacian L is a real symmetric matrix, it 14 Figure 2.3: A graph with positive and negative signals showing in red and blue bars respec- tively. Figure 2.4: The graph Laplacian eigenvectors associated to different eigenvalues, where λ 0 <λ 1 <λ 50 . Clearly, f 0 varies slowly compared to f 50 . The figure is from [129]. has a complete set of eigenvectors. The eigendecomposition of graph Laplacian can thus be given as L =ΦΛΦ T , (2.8) where Φ = (ϕ 1 ,...,ϕ n ) are the eigenvectors and Λ = diag(λ 1 ,...,λ n ) are the associated eigenvalues. We further assume the eigenvalues are ordered as λ 1 <λ 2 ≤ ...≤ λ n :=λ max . Graph Fourier Transform. The classical Fourier transform is the expansion of a function in terms of the eigenvectors of the Laplace operator. Analogously, the graph Fourier trans- form ˆ f of a graph signal f ∈ R N is an expansion in terms of the eigenvectors of the graph Laplacian ˆ f =Φ T f. (2.9) 15 The inverse graph Fourier transform is then defined by f =Φ ˆ f. (2.10) Similarly, the eigenvalues Λ carry a notion of frequency response as in the classical setting. Specifically,thegraphsignalsprojectedonthedirection(eigenvector)associatedwithsmaller eigenvaluesaresmootherthanlargereigenvalues,i.e.,thevaluesofincidencenodeswithlarge weight vary slowly. One example is given in Fig. 2.4. Convolution and Spectral Filtering. In classical signal processing, spectral filtering is the process of amplifying or attenuating the components in the frequency domain, i.e., a convolutional of the signal and filter in the time domain. Once we have defined the graph Fourier transform, we can derive the graph spectral filtering analogously. Formally, given a graph signal f and a filter g, their convolution results in f ⋆g =Φ (Φ T f)◦ (Φ T g) =Φ diag(ˆ g 1 (λ 1 ),...,ˆ g n (λ n )) Φ T f =Φ ˆ g(Λ )Φ T f = ˆ g(ΦΛΦ T )f = ˆ g(L)f , (2.11) where◦ denotes the element-wise product. There are more concepts in classical signal processing, such as translation, modulation, dilation,etc,thatcanbeanalogouslydefinedongraphs. However,theyareoutofthescopeof this dissertation, and we refer [129] for more detailed information. In the following sections, we will introduce the development of graph convolutional networks, which is a combination of basic graph spectrum analysis and the modern neural network paradigm. Graph Neural Networks. Many intrigued concepts have been introduced in modern neural network design, e.g., nonlinear activation function, multi-layers, backpropagation, etc [36]. However, the intrinsic idea is that we can parameterize the unknown variables through neural networks and derive their optimal values in a data-driven way, i.e., parameters are 16 learned from gradient backpropagation of the loss function evaluated on data samples. So an intuitive way to incorporate such paradigm to graph spectral convolution in Eq. (2.11) is 1) parameterizing the uncertain filter g; 2) stacking multiple same convolutional layers; 3) includingnonlinearactivationfunctionsbetweenlayers,e.g.,ReLU;4)designingareasonable loss function, e.g., cross-entropy loss for classification. These result in the method [12]. Spectral CNN [12]. Based on the late analysis, the spectral convolutional layer results in the form f (k) l =σ p X l ′ Φ k ˆ G l,l ′Φ T k f (k− 1) l ′ , (2.12) whereF (k− 1) = (f (k− 1) 1 ,...,f (k− 1) p )∈R n× p andF (k) = (f (k) 1 ,...,f (k) q )∈R n× q representtheinput and output graph signals at thekth layer, respectively,Φ k = (ϕ 1 ,...,ϕ k )∈R n× k is the first k eigenvectors (i.e., w.r.t. the top-k smallest eigenvalues), ˆ G l,l ′ = diag(ˆ g l,l ′ ,1 ,...,ˆ g l,l ′ ,1 )∈R k× k is a diagonal matrix denoting the learnable filter, and σ is the nonlinear activation function. The reason for only leveraging the first k eigenvectors is that signals associated with large eigenvalues are often noisy without information contained. This can also be regarded as a smoothening operation to the input signals. Although conceptually important, this method still has several deficiencies. First, the spectral filter depends on the eigenvectors of the graphLaplacian,whichcauseshightimecomplexityineigendecomposition-O(n 3 )andbasis dependent. Second, the multiplication of the dense matrix Φ k incurs high computational cost - O(n 2 ). Third, this designed filter is not localized in graph spatial domain, i.e., the parameters in ˆ G l,l ′ are all free without spatial restrictions. Chebyshev Spectral CNN (ChebNet) [26]. To solve the problems in former Spectral CNN, ChebNet proposed to use a polynomial filter with a Chebyshev expansion of the basis for a localized filter design and efficient computation. Specifically, the localized polynomial filter is expressed as ˆ g θ (Λ ) = K− 1 X k=0 θ k Λ k , (2.13) 17 where θ ∈ R K are the learnable parameters. The localization is achieved by the fact that in vertex domain, this filter results in a linear combination of the input signal within K- hop neighborhoods of each node. To alleviate the cost of eigendecomposition to the graph Laplacian, ChebNet proposed to employ the expansion in the Chebyshev polynomial basis for an approximation ˆ g θ (Λ ) = K− 1 X k=0 θ k T k ( ˆ Λ ), (2.14) where T k ( ˆ Λ ) is the Chebyshev polynomial of order k evaluated at scaled matrix ˆ Λ = 2Λ /λ max − I. The spectral filter can then be defined as f out = ˆ g θ (L)f in , (2.15) wheref in ,f out ∈R n aretheinputandoutputsignal,respectively. Undersuchapproximation, the time complexity has reduced to O(K|E|) (|E| for number of edges) and no need for eigendecomposition. Graph Convolutional Network (GCN) [66]. Although ChebNet has achieved a great improvement on the filter efficiency, GCN further simplified their results in Eq. (2.14) by setting K = 2, assuming λ max ≈ 2, and θ = θ 0 =− θ 1 . This results in a filter with only one single parameter θ ˆ g θ =θ (I+D − 1 2 AD − 1 2 ), (2.16) where A∈ R n× n is the adjacency matrix, D is the corresponding degree matrix, and I is the identity matrix. However, such simple approximation is numerically unstable since the maximumeigenvalueofthefiltermatrixistwo. Tosolvethisproblem, theyfurtherproposed a renormalization trick to the filter matrix by adding a self loop to each node ˜ A =A+I. (2.17) 18 Figure 2.5: Graph Convolutional Network. The figure is from [66]. This results in a smaller maximum eigenvalue to the filter matrix and Eq. (2.16) transforms to ˆ g θ =θ ( ˜ D − 1 2 ˜ A ˜ D − 1 2 ), (2.18) where ˜ D is the degree matrix of ˜ A. Considering signals with C input channels and F filters, Eq. (2.18) can be generalized to F out = ˜ D − 1 2 ˜ A ˜ D − 1 2 F in Θ (2.19) where F in ∈ R N× C and F out ∈ R N× F are the input and output signals, respectively, and Θ ∈ R C× F is the parameter matrix. This convolutional operation can be computed more efficiently than the filter in ChebNet. Similarly, one convolutional operation can be regarded as one layer, and multiple layers can be stacked with nonlinear activation functions in be- tween, resulting in the complete graph convolutional network. Note that by setting K = 2 inthepolynomialbasis, onelayerofgraphconvolutioncanonlyinvolvesignalsfromone-hop neighborhoods of each node. Interestingly, if we compare the graph convolutional layer Eq. (2.19) with the label propagation iteration in Eq. (2.6), we can see that similar regulariza- tion (i.e., smoothening) operations are applied to the node representations, i.e., taking the information partially from the neighborhoods and partially from node itself. 19 Due to the superior performance with efficient running time, there is an explosion of research in graph learning and applications in various domains, e.g., knowledge graph, scene graph, data mining, molecular, etc. For example, GraphSAGE [40] generalized GCN to large-scale graphs in the inductive setting. GAT [137] introduced the attention mechanism into layer convolution. SGC [148] showed that GCN could be explained as low-pass filtering totheinputsignalsandfurthersimplifieditaslogisticregressiontothepropagatedfeatures. APPNP[68]furtherintroducedtheideainPageRank[110]toenhancethelearningcapability. We refer [149, 152, 183, 185] for a comprehensive review of the latest graph neural network research and applications. 2.4 Generative Adversarial Networks A machine learning algorithm can be categorized into either a discriminative or genera- tive approach. The difference is that discriminative models are designed to estimate the conditional probability p(y|x), while the generator models are directly estimating the data likelihood p(x,y). Some classical machine learning methods, such as the Gaussian mixture model, Bayesian networks, Hidden Markov model, etc, are all instances of generative mod- els. Recently, due to the universal approximation power of neural networks [50], it has been exploitedbyseveralmethodsformodelingthecomplexdatadistribution, suchasBoltzmann machines [121], deep belief networks [48], variational autoencoders [63], etc. In particular, generative adversarial network (GAN) [37] is the most intriguing method among them. A generative adversarial network consists of two components, i.e., a generator G and a discriminator D. These two parts can be formulated as a minimax adversarial game, where the generator aims to map data samples from a prior distribution to data space serving as fakesamples,whilethediscriminatortriestotellfakesamplesfromrealdata. Intheend,the discriminator is unable to distinguish samples between the real and fake, i.e., the generator can perfectly map the real data distribution from the prior distribution (showing in Fig. 20 Figure 2.6: An illustration of distribution mapping in GAN. The black dotted line denotes the data distribution, blue dashed line is the discriminator distribution, and the green solid line is the generator distribution. The figure is from [37]. 2.6). This final convergent point of both the generator and the discriminator is called the Nash equilibrium. Formally, these can be described in optimizing an objective function min θ G max θ D E x∼ p data logD(x;θ D ) +E z∼ pg log 1− D(G(z;θ G );θ D ) , (2.20) where θ D and θ G denote the parameters of the discriminator and generator, respectively. They can be implemented by different neural networks, e.g., multilayer perceptron or con- volutional neural networks. In practice, an iterative training strategy is adopted between the generator and discriminator training, and− logD(G(· ;θ G );θ D ) is often employed for the generator optimization. The success of distribution mapping in GAN has given rise to various applications, e.g., image processing [124, 188], adversarial attack [114, 166], text translation [24, 173], domain adaptation [49, 89], and graph learning [52, 144, 179]. The main idea behind these appli- cations is that we have two or more perspectives of data distributions. Then, GAN can effectively model the discrepancies and learn the transformation among distributions. 21 Chapter 3 Layerwise Trained Bipartite Graph Neural Network In this chapter, we introduce our work on GNN-based graph representation learning tailored for bipartite graphs. In particular, we focus on learning the representations of nodes within a given graph. The high-level idea is that nodes with similar attributes and neighborhood structure should be encoded closely in the embedding space. However, the inconsistent but corresponded feature domains in bipartite graphs impede the general message passing in GNNs. Besides, the enormous size of bipartite graphs in the real-world limits GNNs from directapplication. Tothisend,weproposealayerwise-trainedbipartitegraphneuralnetwork (L-BGNN), which is effective on the unique bipartite structure and efficient on large-scale networks. 3.1 Introduction Graphs are often used to capture andrepresent complexstructural relationships among data items in various application domains such as social network analysis [104, 105, 115, 140], drugdiscovery[62,160],visualunderstanding[138,156],etc. Amongvariousforms,bipartite graphs are prevalent in data mining applications. A bipartite graph is a graph whose nodes are divided into two disjoint sets (or partitions) and edges only exist between nodes of these two sets. For an e-commerce recommendation system as shown in Fig. 3.1, where the two sets are represented by users and products, respectively, and edges denote users’ purchase 22 Users Items U V Gender Age Location Salary … Attribute P Price Brand Year Category … Attribute Q Figure 3.1: A bipartite graph for the e-commerce system, where node entities in two disjoint sets follow different feature distributions and intradomain and interdomain neighbors are connected by red and blue edges, respectively. history[82]. Recently,graphrepresentationlearninghasemergedasapromisingdirectionfor performance improvement of downstream tasks [150]. Efficient embedding of the bipartite graph structure and the associated semantic information in a joint way is studied in this work. Early graph embedding methods explore the graph structure as the context information and, then, apply an adaptive skip-gram model [95] to the generated context sequences to learn the latent representations. Examples include random walks (e.g., DeepWalk [113], node2vec[39])), neighborhoods (e.g., LINE [133], SDNE [140] or high-order proximity (e.g., DCNN [2], GraRep [14]). These methods preserve graph topology and node relations in the embedding but fail to incorporate the rich semantic information (e.g., node features). With the recent advancement of deep learning, Graph Neural Networks (GNNs) [66, 122] have made tremendous progress in learning structure- and semantic-preserving representations. In general, GNNs recursively update the features of each node by aggregating its neighbors through message passing. As a result, both graph topology and node features are captured. Despite the effectiveness and prevalence of GNNs, they are suboptimal and not tailored to bipartite graphs for the following two reasons. 23 First, features of nodes in each partition of a bipartite graph may follow different distri- butions. For example, users and products in Fig. 3.1 have completely different attributes. Representing the bipartite graph as a homogeneous one, as is typically done in GNNs, fails to exploit the extra knowledge from the two distinct domains. One may convert two-hop neighbors to one-hop connections in the same domain. Yet, this approach does not exploit feature correlations across two sets. Alternatively, one may treat the bipartite graph as a heterogeneous one and utilize methods such as metapath2vec [27], HIN2Vec [30], or HeGAN [52] to embed different entities. These heterogeneous embedding methods are either ne- glecting the rich semantic information in learning [27, 30, 52] or simply projecting different entities into the same feature space [54, 146, 169]. Besides, they are mainly based on the generated meta-path relationships [27, 52], which are however limited in bipartite graphs. For example, the bipartite graph in Fig. 3.1 contains only one type of meta-path; namely, user-item-user (UIU). The lack of sufficient meta-path types degrades their performance. Second,whendealingwithlarge-scalebipartitegraphs,itisdifficulttogetalargenumber of labeled examples and its computational complexity is high. There are serious problems in practical applications. For example, an e-commerce bipartite graph contains billions of users and their labeling is expensive and impractical. Although semi-supervised embedding methods[66,79]canbeusedtoalleviatethisproblem,itisstillchallengingwhenthenumber of labeled data is significantly less than that of unlabeled data. Unsupervised methods have also been proposed to address the lack of labels in graph embedding [55, 84]. However, they are either not applicable to bipartite graphs or not scalable to large-scale graphs. It was reported in [32] that node degrees in bipartite graphs often follow a power-law distribution as shown in Fig. 3.2, where most nodes only have a few neighbors. The neighborhood sampling method, which is adopted by GNNs [19, 40, 57] to alleviate the scaling problem, tends to yield either a high bias or a high variance due to incomplete message passing in each layer [34]. 24 10 1 10 2 10 3 10 4 Degree 10 0 10 1 10 2 10 3 Count Figure 3.2: The power-law distribution of node degrees in an user-item network from the Amazon dataset [53]. To address the above-mentioned challenges, we propose a layerwise-trained bipartite graph neural network (L-BGNN) embedding method in this work. For the first challenge, L-BGNN has two operations, interdomain message passing (IDMP) and intradomain align- ment (IDA), to deal with domain inconsistency effectively. The IDMP operation facilitates message passing between two domains through connected edges. Inspired by distribution matching in adversarial learning [37, 111], the IDA operation minimizes the divergence be- tween the passed information from the opposite domain and that of the intrinsic domain in an adversarial manner. For the second challenge, L-BGNN can be efficiently trained using a layerwise method so that it can maintain a low memory cost and capture higher-order graph structural information. The adoption of layerwise training releases the need to store all intermediate activation maps of neural layers as required by the conventional end-to-end training. Additionally, the domain shift in shallower layers, (i.e., discrepancy between two domain features that exist in an early training phase) is not passed to deeper layers in lay- erwise training. In contrast, this type of error accumulates as the layer depth increases in the conventional end-to-end training. 25 This work has the following contributions. 1. WeproposeanovelL-BGNNmethodtodealwithdomaininconsistencyeffectively,and it can be efficiently scaled to large-scale bipartite graphs through a layerwise training mechanism. 2. We provide theoretical analysis on the effect of domain alignment and show that the proposed L-BGNN model can effectively encode the information from two domains in the final graph embedding. 3. WeconductexperimentstodemonstratetheperformanceofL-BGNNonseveralpublic graph datasets and one large-scale network through a series of downstream tasks. Results show that L-BGNN consistently and significantly outperforms various state- of-the-art methods. The rest of this chapter is organized below. Some background information is given in Sec. 3.2. The L-BGNN method is presented in Sec. 3.3. Experimental results are shown in Sec. 3.4. Related work is reviewed in Sec. 3.5. Concluding remarks are given in Sec. 3.6. 3.2 Problem Formulation and Background The problem of bipartite graph embedding is formulated and background on the Generative Adversarial Network (GAN) is introduced in this section. Problem formulation. Let G = (U,V,E) be a bipartite graph, where U and V denote the set of nodes in two partitions and u i and v j denote the i-th and the j-th node in U and V, respectively, where i = 1,2,··· ,M and j = 1,2,··· ,N. Edges only exist between nodes in distinct partitions. They are denoted by e ij ∈E, describing the connection between node u i and node v j . The graph connections can be represented by incidence matrices; namely, B u ∈R M× N for set U andB v ∈R N× M for set V. For each entry in incidence matrixB u (or B v ), B u,ij = 1 (or B v,ij = 1) indicates e ij ∈ E; otherwise, B u,ij = 0 (or B v,ij = 0). Besides 26 graph connections, each node has its feature information. Feature matrix X u ∈ R M× P contains vector x u,i ∈ R P that is the feature vector of node u i in set U. Feature matrix X v ∈ R N× Q can be defined for features of node v j in set V. The goal of bipartite graph embedding is to learn a mapping function that projects each node to a point in a low- dimensional space by preserving the topological structure and tackling different features in two domains. Generative adversarial networks. Our work is inspired by the success of GANs in distribution mapping. It can be regarded as a minimax game between two players, i.e. generator G and discriminator D, in the following manner min θ G max θ D E x∼ p data logD(x;θ D ) +E z∼ pg log 1− D(G(z;θ G );θ D ) , (3.1) whereθ D andθ G denotetheparametersofthediscriminatorandthegenerator, respectively. Thegeneratoraimstomapdataexamplesfromapriordistributiontothedataspaceserving as fake examples while the discriminator tries to tell fake examples from real data. … IDA Inter-domain Message Passing (IDMP) Intra-domain Alignment (IDA) depth 1 D H k u →v H k −1 v Adversarial Loss Cascaded Training IDMP U V H k u H u →v H k v H 1 u →v X v X u H 1 u H 1 v →u H k u →v H k v →u depth k H 1 v U V mini-batch IDMP IDA IDMP IDA IDMP IDA IDMP IDA H k −1 v H k −1 u … output representation Layer 1 Layer k Layer-wise training Figure3.3: AnoverviewoftheL-BGNNmethod. GiveninputsX u andX v fromtwodomains, we obtain their embeddings as H u and H v via IDMP and IDA. To aggregate the multi-hop neighborinformation, westackmultiplelayerstoformulateadeepL-BGNNwhichistrained in a layerwise mechanism. 27 3.3 Proposed L-BGNN Method We begin with an overview on the L-BGNN framework in Sec. 3.3.1. It is followed by elaborations on IDMP and IDA in Secs. 3.3.2 and 3.3.3, respectively. Then, the layerwise training mechanism is proposed in Sec. 3.3.4. Together, we analyze the time complexity of the algorithm in Sec. 3.3.5. At the end, we analyze how the embedding results benefit from the adversarial distribution alignment in Sec. 3.3.6. 3.3.1 System Overview The objective of bipartite graph representation learning is to find the embedding represen- tation H u and H v for nodes in sets U and V, respectively. Let f emb be a bipartite graph embedding model with parametersθ . We can expressH u andH v as (H u ,H v ) =f emb (X u ,B u ,X v ,B v ;θ ). (3.2) The proposed architecture of f emb is shown in Fig. 3.3. It contains three key designs: 1) IDMP, 2) IDA, and 3) layerwise training. The IDMP operation is denoted by blue blocks in Fig. 3.3. Its goal is to aggregate the information from both domains through connected edges. It can be written as H v→u =f u (X v ,B u ;θ u ) H u→v =f v (X u ,B v ;θ v ) , (3.3) where f u and f v are the IDMP functions for these two domains, respectively, and H v→u (resp. H u→v ) represents the aggregated message passed from V (resp. U) to U (resp. V). 28 The IDA operation is represented by orange blocks in Fig. 3.3. Once the aggregated features from the other domain H v→u , H u→v are attached, we use IDA to fuse these two distinct features into a single representation. It can be expressed as L u =L adv (H v→u ,X u ) L v =L adv (H u→v ,X v ) , (3.4) where L adv is an unsupervised adversarial loss. Once the convergent criterion is met, em- beddings are inferred from Eq. (3.3). To encode higher-order neighbor information, the message passing layer (i.e., graph con- volutional layer) in GCN methods [40, 66] is stacked contiguously and with the task-specific supervision layer at the end. This training mechanism is called end-to-end training. Yet, there are two problems in applying it to the bipartite graph. First, domain inconsistency hinders message passing between two domains. Second, the end-to-end training procedure suffers from large memory usage [151]. The latter makes large-scale graph training a big challenge. To address it, we propose a layerwise training method by conducting one message passing (i.e., IDMP) and one adversarial training (i.e., IDA) in a layerwise fashion. 3.3.2 Interdomain Message Passing Wefirstintroducethemessagepassingoperationtailoredforbipartitegraphs. Theadjacency matrix of a bipartite graph can be written as A = 0 u B u B v 0 v , (3.5) 29 where B u and B v are incidence matrices for sets U and V, respectively, and the diagonals areallzeros. Forstability,wenormalizeB u as ˆ B u =D − 1 u B u ,whereD u isthedegreematrix ofB u . The same normalization is done forB v . Then, the IDMP process is defined as H (k) v→u =σ ( ˆ B u H (k) v W (k) u ) H (k) u→v =σ ( ˆ B v H (k) u W (k) v ) , (3.6) whereσ denotes an activation function (e.g., ReLU),W (k) u andW (k) v are weight matrices for the two sets, respectively, H (k) v→u (resp. H (k) u→v ) are hidden features of nodes in set U (resp. V) passed from the features in V (resp. U), and k indicates the layer index. Note that the inputs are the outputs from the previous training layer in layerwise training. The inputs turn into graph feature matrices only when k = 0, i.e.,H (0) u =X u andH (0) v =X v . There are two distinctions between IDMP and conventional message passing layers in GCNs [40]. First, IDMP only performs aggregation on node’s neighborhoods without in- volving the node itself, while conventional graph convolutional layers usually include the self-loop in aggregation. Second, IDMP has distinct weight matrices, each of which is in- dependent of inputs from the respective set. The dimensionality of each weight matrix is decided by its input features accordingly. 3.3.3 Intradomain Alignment The aggregated neighborhood messages of a node are combined with those of itself by av- erage [66], maximization or concatenation [40] in each graph convolutional layer. However, due to domain inconsistency in bipartite graphs, these simple combinations are invalid or suboptimal. To address such a problem, we design two combination schemes to align H (k) u with H (k) v→u after IDMP. They are called IDA. In the following, we introduce the IDA from the perspective of set U and the same idea applies to set V as well. 30 Adversarial Alignment The first alignment scheme employs adversarial learning [31, 37]. A discriminator is trained to discriminate between vectors sampled from H (k) v→u and H (k) u . Meanwhile, IDMP behaves as a generator to fake the discriminator on the domain where the embeddings are from. Inshort,thisisatwo-playerminimaxgame,wherethediscriminatoraimstomaximize the ability to identify two distinct feature representations, and IDMP aims to prevent the discriminator from doing so by making the encoded representation,H (k) v→u (source), to be as close to, H (k) u (target), as possible. Eventually, they will reach the Nash equilibrium where the discriminator cannot separate the source domain and the target domain any longer. Then, the feature distribution of set U (target) is aligned with set V (source) in an effective way [179]. This type of alignment is referred to as L-BGNN-Adv. The loss function of L-BGNN-Adv can be derived as follows. Discriminator’s objective. The probability that input feature vectorh (k) is from the source domain H (k) v→u is denoted by P θ D ,θ G(source = 1|h). Conversely, the probability that input feature vectorh (k) is from the target domainH (k) u is denoted byP θ D ,θ G(source = 0|h). Then, the discriminator loss function can be expressed as L D (θ D |θ G ) =− 1 M M X i=1 logP θ D ,θ G(source = 0|h (k) u,i ) − 1 M M X i=1 logP θ D ,θ G(source = 1|h (k) v→u,i ) . (3.7) We adopt a neural network as the discriminator to classify the feature embeddings between the source and the target domains due to their capability [72]. The parameters, θ D , of the discriminator can be optimized by minimizingL D . 31 Generator’sobjective. Inthegenerativesetting,IDMPistrainedtoaligntheencoded representation,H (k) v→u , toH (k) u so that the discriminator is unable to distinguish. Thus, the generator loss function is in form of L G (θ G |θ D ) =− 1 M M X i=1 logP θ D ,θ G(source = 0|h (k) v→u,i ). (3.8) The parameters,θ G , of the generator can be optimized by minimizingL G . MLP Alignment Another intuitive approach is to directly minimize the distance between the source and the target embedding spaces. Specifically, we first apply a multi-layer perceptron (MLP) to the aggregated message to transform the embeddings into the same dimensionality as the target domain. Then, a linear mapping is conducted to minimize the distance between the transformed source and the target embedding spaces. The loss function for set U can be written as L u =||MLP(H (k) v→u )− H (k) u || 2 F , (3.9) where ||·|| F denotes the Frobenius norm and the loss is symmetric for set V. Although this approach is straightforward, we can utilize it as a reference to judge whether adversarial learning can align two feature domains more effectively. This type of alignment is referred to as L-BGNN-MLP. 3.3.4 Layerwise Training L-BGNN adopts a layerwise training procedure for efficient training of large-scale bipartite graphs. Fig. 3.4comparesthelayerwisetrainingprocessincomparisonwiththeconventional end-to-endtrainingprocess. Inlayerwisetraining,weregardonelayerastrainingonaBGNN block (i.e., IDMP and IDA). The messages are first passed by IDMP, followed by IDA for domain alignment. Parameters of the one layer are trained by solving the adversarial loss 32 IDMP IDA Loss function IDMP IDA Loss function H (1) u H (1) v Layer 1 Layer 2 H ( 2) v H (2) u Forward propagation Backward propagation X u X v (a) Layer 1 Layer 2 Loss function H (1) u H (1) v X u X v H ( 2) v H (2) u Forward propagation Backward propagation (b) Figure 3.4: Comparison of layerwise training with conventional end-to-end training, where both networks have a two-layer structure. (a) Layerwise training. (b) End-to-end training. given in Eqs. (3.7) and (3.8). Once the generator and the discriminator reach the Nash equilibrium, embeddings are inferred from the generators as H (k+1) u =H (k) v→u H (k+1) v =H (k) u→v (3.10) Then, they are used as the input to subsequent layerwise training. We can stack multiple layers to build a deeper model. The embedding outputs of the last layer are then used for downstream tasks. In contrast, messages propagate through multiple layers, and parameters are updated simultaneously in the conventional end-to-end training process as shown in Fig. 3.4b. In the following, we argue that layerwise training is more memory-efficient and demanding less training time. Additionally, L-BGNN can still capture information from multi-hop neighbors by stacking multi-layer BGNN blocks. Onlyonelayertrainingisalive. ThispropertygivesL-BGNNasignificantadvantage inloweringthememorycost. AsshowninFig. 3.4a,eachlayerinthelayerwisetrainingtakes embeddings from the previous layer as the training input. Thus, the memory used in the previouslayercanbereleased. Bykeepingonlyonelayeralivethroughouttheentiretraining 33 process,thememorycostofL-BGNNislower. Notethattherepresentationsinhiddenlayers need to be cached in end-to-end training. Memory grows along with the number of network layers. Avoid exponential neighborhood expansion. Training deeper neural networks for large-scale graphs results in an exponential neighborhood expansion problem [18, 23]. How- ever,theneighborexpansionproblemvanishesinlayerwisetrainingsinceonlymessagesfrom one-hop neighbors are involved in each layer and only one layer training is alive during the entire training process. Additionally, we can reduce the memory cost and training time fur- thermore by applying minibatch training in each layer. We rewrite the message passing and domain alignment of set U in Eqs. (3.4) and (3.6) for the ith node in the kth layer as h (k) v→u,i =σ (( X j∈N(i) ˆ b ij h (k) v,j )W (k) u ), (3.11) L u = 1 M M X i=1 L adv (h (k) v→u,i ,h (k) u ), (3.12) where set V is symmetrically defined. Based on Eqs. (3.11) and (3.12), we can feed the embeddings in a minibatch and optimize the minimax game in each layer accordingly. Be more robust in training. The layerwise training is more robust in the sense that it can be trained with fewer hyper-parameter tuning. The discrepancy between two domain feature distributions will degenerate the early training phase [85, 117, 159], and this kind of error accumulates as the layer depth increases in end-to-end training. In contrast, this kind of error does not accumulate in layerwise training. Offer comparable statistical performance . Based on the proof in [163], we have the following theorem on the representation capacity of layerwise training. Theorem 1. LetA =R◦L (L) ···◦L (1) be a GNN where a) the layer mappingL (l) ,l∈L is injective, b) the readout mappingR is injective. IfA can be trained in an end-to-end fashion, A can also be layerwise-trained with the result achieving the same capacity. Otherwise, if 34 the layer mapping is not guaranteed to be injective but it can distinguish differently h (k− 1) i , then the capacity of the network is monotonically non-decreasing with deeper layers. In plain words, if a network architecture is powerful enough through conventional end- to-end training, the same representation power can be achieved through layerwise training. To train L-BGNN, we adopt the iterative optimization technique that is widely used in the training of GANs. For each layer at an iteration, we alternate the training between the generator and the discriminator. That is, we first fix θ G of the generator and generate examples through message passing in Eq. (3.6) to optimize θ D through Eq. (3.7) so as to improvethepowerofthediscriminator. Then, wefix θ D , andoptimizeθ G throughEq. (3.8) to yield better domain alignment. This process is repeated until the equilibrium is reached. The training for L-BGNN is summarized in Algorithm 1. Algorithm 1 L-BGNN 1: Input: Graph G = (U,V,E); node features{X u ,X v } 2: Output: Node representationsZ u andZ v 3: H (0) u ← X u ;H (0) v ← X v 4: for k = 0,1,...K do 5: while not converged do 6: Sample minibatches (h (k) u ,h (k) v ) from (H (k) u ,H (k) v ) 7: for each minibatch do 8: IDA(h (k) u ,IDMP(h (k) v ) | {z } h (k) v→u ) ▷ Eqs. (3.11) and (3.12) 9: IDA(h (k) v ,IDMP(h (k) u ) | {z } h (k) u→v ) ▷ Eqs. (3.11) and (3.12) 10: end for 11: end while 12: Infer{H (k+1) u ,H (k+1) v } for the (k+1)th layer training ▷ Eqs. (3.6) and (3.10) 13: Release the memory of{IDMP, IDA,H (k) u ,H (k) v } 14: end for 15: Z u ← H (K) u ;Z v ← H (K) v 35 3.3.5 Complexity Analysis The computational complexity of L-BGNN can be analyzed by studying its feature prop- agation mechanism. Suppose that the number of nodes in the network is (M + N), the minibatch size is B, the layer number is L, the number of edges is ||A|| 0 , and the same feature dimension D for simplicity. The complexity of ˆ B u H (k) v in Eqs. (3.6) is ||A|| 0 M+N BD for each minibatch, where ||A|| 0 M+N B denotes to the average number of edges in each minibatch. Then, the complexity of the multiplication between the weight matrixW (k) u is simply BD 2 . The IDA stage is implemented with a multi-layer perceptron. So the time complexity of fea- ture propagation is a constant times BD 2 , where we have ignored the constant coefficients. Therefore, the overall time complexity of L-BGNN for L layers is O(L ||A|| 0 M+N BD +LBD 2 ). Note that the complexity of each iteration in our model is linear with minibatch size B, indicating efficiency and scalability of L-BGNN. 3.3.6 Theoretical Analysis In this subsection, we provide a theoretical analysis on the relation between the domain alignment effect and the representation model in each domain. The derivation is conducted on set U as an illustration. The same process applies to set V as well. We denote the label of a node byy and its representation vector byh. There are two domains,H (k) u (target) and H (k) v→u (source) from the output of IDMP at the kth layer. Model M outputs the conditional probability, ˆ P(y|h;θ ), of node y based on its embedding vectorh and model parametersθ . The loss function of model M in domainH (k) u can be written as L M,u =E(D( ˆ P u (y|h;θ ),P(y|h))), (3.13) 36 where P(y|h) is the ground truth, D(P 1 ,P 2 ) measures the distance of two distributions and ˆ P u (y|h;θ )denotesmodelM trainedonrepresentationsinthetargetdomain. Wecanrewrite the expectation as L M,u = X h p u (h)· D( ˆ P u (y|h;θ ),P(y|h)), (3.14) where p u (h) is the distribution of the target embedding space. Similarly, the loss function of model M trained on the source domain representations, i.e.,H (k) v→u , can be written as L M,v→u =E(D( ˆ P v→u (y|h;θ ),P(y|h))), = X h p v→u (h)· D( ˆ P v→u (y|h;θ ),P(y|h)), (3.15) where p v→u (h) is the distribution of the source embedding space. Then, the following theo- rem can be proved. Theorem 2. If the following two inequalities |p u (h)− p v→u (h)| p v→u (h) <ϵ, ∀h∈H (3.16) D( ˆ P u (y|h;θ ), ˆ P v→u (y|h;θ ))<d,∀h∈H (3.17) are met, the following inequality also holds: L M,u −L M,v→u ≤ ϵ L M,v→u +d. (3.18) Its proof is given in the Appendix C.1. The above theorem ensures that an embedding algorithm can support domain alignment better when 1. the two distributions of the embedding space are close, i.e., p u (h) and p v→u (h) are indistinguishable which lead to a smaller upper bound ϵ in Eq. (3.16); 37 2. the classifier makes the same prediction of two embedding space, i.e., ˆ P u (y|h;θ ) and ˆ P v→u (y|h;θ ) result the same with a smaller upper bound d in Eq. (3.17). The first condition can be achieved by adversarial distribution alignment. The second con- dition can be met via direct embedding space alignment as given in Eq. (3.10). Due to the layerwise training in L-BGNN, the new embedding from the target domain is inferred from thesourcedomainandutilizedastheinputtosucceedingtraining. Consequently,embedding space alignment can help keep ˆ P u (y|h;θ ) and ˆ P v→u (y|h;θ ) close almost everywhere. 3.4 Experiments We conduct experiments with the following three objectives: • providing performance benchmarks on graph representation between L-BGNN and state-of-the-art methods; • verifying the effectiveness of domain alignment and layerwise training; • evaluating the efficiency of L-BGNN by studying memory and time complexities in the embedding of a large-scale bipartite network. 3.4.1 Experimental Setup We perform experiments on both real-world and synthetic datasets of a wide range of sizes and characteristics. Real-world datasets. We examine four public benchmark datasets from real-world applications. They are Movielens [77] (user and movie), DBLP [32] (author and paper), Yelp [53] (user and business), and Amazon [53] (user and item). The original networks contain more than two node types. Two main entities are selected to form the bipartite graph as summarized in Table 3.1. We perform the recommendation task on Movielens and 38 the link prediction task on the other three. Besides, we experiment on a large-scale real- world social network collected from the company. To preserve anonymity, we use Group to represent this dataset. Nodes in the network represent user and community, respectively, and both are described by dense off-the-shelf feature vectors. Edges only exist between distinctive entities, which results in a bipartite graph. A small portion of nodes is labeled with two classes for evaluation since labeling every node is impractical for networks on such ascale. AnodeclassificationtaskisconductedontheGroupdataset. Wewillgiveadetailed description of various tasks in their respective experiments. Synthetic datasets. We synthesize three bipartite graphs by modifying three citation networks: Cora, CiteSeer and PubMed [66]. To generate bipartite graphs, we divide nodes of each class into two equal-sized subsets. Then, we remove all edges in the same set and all isolated nodes. Besides, an uniform subsampling is conducted on the node attributes to simulate the distinct distributions. The resulting network sizes are given in Table 3.1. The same node classification task is conducted on these synthetic datasets. Table 3.1: Statistics of eight datasets used in our experiments. Datasets #Nodes U #Nodes V #Edges #Feat U #Feat V #Classes Group 619,030 90,044 991,734 8 16 2 Cora 734 877 1,802 1,433 1,000 7 CiteSeer 613 510 1,000 3,703 3,000 6 PubMed 13,424 3,435 18,782 400 500 3 Movielens 6,040 3,706 1,000,209 N.A. N.A. DBLP 6,001 1,308 29,254 N.A. N.A. Yelp 2,614 1,286 30,837 N.A. N.A. Amazon 6,170 2,753 195,790 N.A. N.A. Graph preprocessing. In real-world applications, node features are usually difficult to obtain, e.g., the original Movielens, DBLP, Yelp, and Amazon datasets do not contain node attributes. Although constant or identity features can be applied, such formulation would make messages from different nodes difficult to differentiate [161]. To address this problem, 39 we propose to use the shifted PPMI matrix X [76] as initial features 1 . The shifted PPMI matrix has elements in form of X ij = max{log( M ij P k M kj )− log(β ),0}, (3.19) where β is set to 1/N and M = ˆ A+ ˆ A 2 +··· + ˆ A t , (3.20) and where ˆ A is the random walk normalized adjacency matrix. Row vector x T i inX is the feature vector representing the context information of node v i in the graph. In practice, we set t = 2 to balance efficiency and performance. Benchmarking methods. We compare L-BGNN with several unsupervised graph em- bedding methods in performance on several downstream tasks. Specifically, they can be classified into three categories: homogeneous, heterogeneous, and bipartite network embed- ding. Also,weconductmodelefficiencyevaluationonL-BGNNagainstonesupervisedgraph embedding method. These methods are described below. • node2vec [39]. It is an extension of word2vec [96] on graphs. It performs biased random walks on networks and employs the skip-gram model on generated sequences. • ANRL[182]. Itproposesaneighborenhancementauto-encoderandanattribute-aware skip-gram model to jointly encode each node’s attribute information and structure correlations between its direct or indirect neighbors. • VGAE [67]. It learns the latent node representations based on a variational auto- encoder, where GCNs are used as the encoder and the edge existence is predicted by the decoder. 1 It is also possible to pretrain other scalable methods to obtain input feature, e.g., DeepWalk or metap- ath2vec. 40 • GraphSAGE-MEAN [40, 66]. It is an extension of GCNs [66] to the unsupervised scenario with a skip-gram-type loss. Besides, sampling is adopted in each minibatch training to deal with large-scale graphs. • HeGAN [52]. It is an adversarial learning method on heterogeneous networks that makes the generator produce real relation triplets in an adversarial way. • FeatWalk[59]. It is a scalable framework to deal with heterogeneous feature embed- ding problems, where particular random walks are directly performed on the feature matrix and a skip-gram model is used for node embeddings. • metapath2vec[27]. Itadoptsrandomwalksassociatedwithmeta-pathsforsemantic- preserving on heterogeneous graph embedding. • HIN2Vec[30]. It learns node and meta-path representations in heterogeneous net- works jointly by carrying out multiple prediction tasks based on a set of relationships. • BiNE[32]. Itadoptsarepresentationlearningmethodspecificallyforbipartitegraphs by performing biased random walks to capture a higher-order structure relation exclu- sively in bipartite graphs. • BiANE[58]. It is an attributed bipartite network embedding method that models the intra- and inter-partition proximity simultaneously and adopts a latent correlation training approach to preserve attribute and structure proximity. • AS-GCN[57]. ItisasupervisedgraphembeddingmethodtodealwithGCNscalabil- ity, where adaptive sampling is employed in each forward passing to reduce the graph size. We implement each method based on its released codes and follow the same settings. Since feature distributions are inconsistent in the two sets of bipartite graphs, direct ap- plication of homogeneous network embedding methods (i.e., ANRL, VGAE, GraphSAGE, 41 and AS-GCN) is not suitable. For fair comparison, we align the attributes from the two partitions by padding them with 0 to a fixed length. This results in a homogeneous feature dimension, which can be conducted on those methods accordingly. For methods that require meta-paths (i.e., metapath2vec, HIN2Vec, and BiANE), we use ”IUI” (item-user-item) in experiments, which is symmetric for ”UIU” in bipartite graphs. As for the methods that support identity features as input (i.e., VGAE, GraphSAGE, and HeGAN), we perform and select the best experimental results on both original attributes and random identity features for model training. We evaluate each benchmarking method five times on the downstream tasks and report the best performance. Our L-BGNN method is implemented by PyTorch [112]. For each layerwise training, we use the Adam optimizer with a learning rate of 0.001 and adopt dropout and weight decay to prevent overfitting. A minibatch of size 512 is used in all datasets. We train in epochs until the generator and the discriminator reach the equilibrium and, then, proceed to the next layer training. More detailed implementations and parameter settings can be found in Appendix C.2. Experiments are conducted on a machine with 8 NVIDIA Tesla V100 cards, 10-core Intel Xeon CPU of 2.40GHz, and 100GB RAM. Table 3.2: Performance comparison for the node classification task, where OOM denotes “out of memory”. Group Cora CiteSeer PubMed Methods Accuracy Micro F1 Macro F1 Micro F1 Macro F1 Micro F1 Macro F1 node2vec 0.577 0.682 0.663 0.569 0.476 0.700 0.685 ANRL OOM 0.797 0.785 0.748 0.678 0.390 0.187 VGAE OOM 0.782 0.754 0.732 0.645 0.823 0.828 GraphSAGE-MEAN 0.580 0.823 0.801 0.748 0.665 0.838 0.843 HeGAN > 2 days 0.189 0.167 0.228 0.224 0.396 0.298 FeatWalk 0.514 0.237 0.164 0.276 0.213 0.408 0.306 metapath2vec OOM 0.628 0.596 0.577 0.466 0.609 0.576 HIN2Vec 0.518 0.635 0.593 0.561 0.514 0.630 0.599 BiNE > 2 days 0.372 0.077 0.293 0.186 0.391 0.297 BiANE > 2 days 0.791 0.773 0.739 0.654 0.793 0.799 L-BGNN-Adv 0.622 0.859 0.831 0.768 0.698 0.857 0.860 42 Table 3.3: Performance comparison for the link prediction task. DBLP Yelp Amazon Methods Accuracy Macro F1 AUC Accuracy Macro F1 AUC Accuracy Macro F1 AUC node2vec 0.679 0.669 0.764 0.666 0.666 0.720 0.665 0.665 0.719 ANRL 0.791 0.790 0.853 0.840 0.839 0.890 0.717 0.717 0.795 VGAE 0.553 0.455 0.701 0.760 0.749 0.908 0.623 0.608 0.726 GraphSAGE-MEAN 0.736 0.735 0.760 0.707 0.707 0.712 0.702 0.702 0.704 HeGAN 0.571 0.569 0.610 0.516 0.516 0.520 0.519 0.518 0.525 FeatWalk 0.560 0.553 0.567 0.562 0.554 0.579 0.624 0.623 0.695 metapath2vec 0.822 0.820 0.903 0.846 0.846 0.923 0.741 0.741 0.815 HIN2Vec 0.855 0.854 0.926 0.752 0.744 0.877 0.678 0.672 0.767 BiNE 0.600 0.560 0.780 0.603 0.593 0.660 0.568 0.515 0.713 BiANE 0.858 0.858 0.927 0.857 0.856 0.918 0.739 0.736 0.820 L-BGNN-Adv 0.861 0.860 0.931 0.882 0.882 0.932 0.749 0.749 0.827 3.4.2 Performance Evaluation WecomparetheperformanceofL-BGNNandbenchmarkingmethodsonvariousdownstream tasksinthissubsection. ForAS-GCN,whichasupervisedlearningmethod, weonlyexamine its efficiency on large-scale graphs in Sec. 3.4.6. Node Classification We use 80% of labeled nodes to train a logistic regression classifier on embedding results and test the classifier on the remaining 20% nodes. We select Group, Cora, CiteSeer and PubMedfourdatasetsforevaluation. Sincethetaskismulti-class,weadoptMicroF1(which is the same as Accuracy for the Group dataset) and Macro F1 for evaluation on the test set. Experiment results are shown in Table 3.2. We see from the table that L-BGNN con- sistently outperforms all benchmarking methods, showing its effectiveness in learning rep- resentations for bipartite graphs. Note the limited performance of pure random walk-based methods (node2vec, metapah2vec, HIN2Vec, and BiNE) on the three citation datasets. This isbecausetheyfailtoleveragetherichattributeinformationinmodellearning, whichiscru- cial for these datasets as indicated in previous research [151]. As compared to GraphSAGE, VGAE, ANRL, and BiANE, the superior performance of L-BGNN, which encode both at- tributes and the network structure into representation learning, demonstrates greater ability 43 to fuse initial features and topological information in bipartite graphs. It is worthwhile to point out that another adversarial training-based method, HeGAN, has limited performance in all three citation datasets. It appears that this model is not robust and suboptimal in handling bipartite graphs. For the large-scale Group bipartite graph, L-BGNN achieves a performance gain over the second place by 7%. Note that most of the benchmarking methods fail to execute for this network due to either being out of memory or excessively long running time. This is because they either load the entire graph into the memory (ANRL, VGAE, and HeGAN) or perform random walks on the entire nodes (metapath2vec, BiNE, and BiANE), which are not applicable to such large-scale networks. In addition, FeatWalk and HIN2Vec cannot learn reasonable representations due to their insufficient sampling processes on large-scale graphs. Link Prediction We predict the author-paper links for DBLP, the user-business links for Yelp, and the user-item links for Amazon. We randomly mask 20% of such edges from the original graph and they serve as ground truth positives in the test set. The residual network is used to learn node representations and, then, a logistic classifier is used to perform link prediction. To train the logistic classifier, we randomly sample the same amount of disconnected node pairs from the original network as negative instances. For the link prediction task, we adopt accuracy, Macro F1, and the area under the curve (AUC) for evaluation. The results are shown in Table 3.3. Again, L-BGNN consistently and significantly out- performs all benchmarking methods. Since there is no auxiliary attribute information of these three datasets, random walk-based methods (ANRL, metapath2vec, HIN2Vec, and BiANE) that carefully utilize the topological information are able to achieve better perfor- mance. Nevertheless, the superior performance of L-BGNN demonstrates its effectiveness in encoding the bipartite graph topology. The inferior performance of FeatWalk in both tasks 44 shows its weakness in generalizing to bipartite graphs. HeGAN again performs poorly on these datasets, indicating their model deficiency in the adversarial training. Figure 3.5: Performance comparison of the recommendation task for the Movielens dataset. Recommendation WeusetheMovielensdatasetfortherecommendationtask, whichrecommendseachuser a ranked list of unseen movies which may be of his/her interest. We adopt the leave-one-out method [45] to evaluate the performance i.e., the last movie of each user is held for testing while the remaining ones are used for node representation learning. A logistic classifier is trained based on embedding results and the top-K movies are selected with the highest probability prediction for each target user. We evaluate the recommendation performance with the Hit Ratio at rank k (HR@k) and the Normalized Discounted Cumulative Gain at rank k (NDCG@k). Three benchmarks in the link prediction tasks are used. We see from Fig. 3.5 that L-BGNN consistently outperforms others in all K values on both evaluation metrics. This shows the effectiveness of our model against rank-based metrics for bipartite graphs. 3.4.3 Embedding Visualization To gain more insights into the node representation obtained by L-BGNN, we show the t- SNE plot [90] in Fig. 3.6, where the vector embeddings of business nodes in the Yelp dataset are illustrated. Different colors denote different business categories. We see from the plot that BiNE and HeGAN cannot effectively identify different business categories since their 45 (a) (b) (c) (d) Figure 3.6: Visualization of embedding results on the Yelp dataset, where colors are used to indicate different business categories. (a) BiNE. (b) HeGAN. (c) metapath2vec. (d) L- BGNN-Adv. embeddings are uniformly distributed in the embedding space. In contrast, metapath2vec and L-BGNN can separate nodes of different business categories. Clusters of L-BGNN are more compact than those of metapath2vec. Table3.4: PerformancecomparisonofL-BGNN-Advagainstsimpledomainalignment,where the performance metric is the Micro F1 score. Group Cora CiteSeer PubMed Raw features 0.501 0.693 0.663 0.754 Aggregated features 0.555 0.825 0.612 0.757 Concatenated features 0.556 0.797 0.637 0.770 L-BGNN-Adv 0.622 0.859 0.768 0.857 Table 3.5: Performance comparison of L-BGNN-Adv and L-BGNN-MLP, where the perfor- mance metric is the Micro F1 score Group Cora CiteSeer PubMed L-BGNN-MLP 0.582 0.784 0.756 0.846 L-BGNN-Adv 0.622 0.859 0.768 0.857 46 3.4.4 Effect of Intradomain Alignment We show how L-BGNN can benefit from IDA effectively by comparing L-BGNN-Adv, L- BGNN-MLP, and three benchmarking methods with simple domain alignment on the node classification task in this subsection. L-BGNN-Adv v.s. other alignment schemes. We compare L-BGNN-Adv with three different domain alignment schemes. The first one exploits raw features only without considering the graph topology. The second one utilizes aggregated features from the op- posite set through edge connections. The third one concatenates feature vectors from the previous two, which can be viewed as domain fusion. A logistic regression classifier is then trained for each scheme. We report their performance as well as that of L-BGNN-Adv in Table 3.4. As shown in the table, L-BGNN-Adv outperforms all three schemes. Apparently, IDA with adversarial alignment can lead to an improved representation of a bipartite graph. L-BGNN-Adv v.s. L-BGNN-MLP. To further demonstrate the effectiveness of ad- versarial alignment, we replace the adversarial alignment with MLP-based alignment as discussed in Sec. 3.3.3. We compare their performance on the node classification task in Table3.5. WeseethatL-BGNN-AdvoutperformsL-BGNN-MLPinalldatasets. Thiscould be attributed to the nonlinear mapping of distributions, i.e., the Jensen-Shannon divergence in adversarial learning while the linear distribution mapping is adopted in L-BGNN-MLP as given in Eq. (3.9). We further show the training loss of L-BGNN-MLP as a function of the iteration number for the Pubmed dataset in Fig. 3.7. Since PubMed is a medium-size dataset with balanced degree distribution, it is a good dataset for illustration. The loss function converges after around 500 iterations. 3.4.5 Effect of Layerwise Training Wedemonstratetheadvantageofthelayerwisetrainingdesignbycomparingitwiththeend- to-end training. First, we show the learning curves of layerwise adversarial training in Fig. 3.8. For the training at each layer, the generator and the discriminator play the minimax 47 0 500 1000 1350/0 500 1500 2000 Iterations 0.00 0.25 0.50 0.75 1.00 Loss layer one layer two Figure 3.7: The training loss of L-BGNN-MLP as a function of the iteration number for the PubMed dataset. 0 200 400/0 200 400 Iterations 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 Loss Discriminator 0 200 400/0 200 400 Iterations 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Loss Generator Figure 3.8: The training loss of L-BGNN-Adv as a function of the iteration number for the PubMed dataset. game. After about 10 epochs, the losses of both converge and the equilibrium is achieved. It demonstrates the effectiveness of adversarial domain alignment. Then, embeddings can be inferred through the generators to initialize the training of the next layer. Next, we compare the layerwise training with the conventional end-to-end training for bipartite graphs. To do so, we design an end-to-end architecture with multiple successive IDMP operations and one IDA at the end. Instead of training at each layer, messages are passed through multiple IDMP operations and the losses are calculated at the final IDA operation. The gradients are then backpropagated for parameter update. Between each IDMP, we concatenate their embedding outputs with the node itself and use them as the input to the next IDMP operation; namely, H (k+1) u = CONCAT(H (k) v→u ,H (k) u ), H (k+1) v = CONCAT(H (k) u→v ,H (k) v ). (3.21) 48 Figure 3.9: Performance curves as a function of the layer number for the node classification task, where the x-axis denotes the number of layers and the y-axis is the F1 score. Table 3.6: Performance comparison of the two-layer end-to-end training with the layerwise training for L-BGNN-Adv, where the performance metrics are the F1 score for Group, and the Micro (the first number) and Macro (the second number) F1 score for Cora, CiteSeer and PubMed. End-to-end training Layerwise training Group OOM 0.622 Cora 0.837/0.809 0.859/0.831 CiteSeer 0.685/0.642 0.768/0.698 PubMed 0.859/0.859 0.857/0.860 In our experiment, we implement the end-to-end training using two IDMP operations fol- lowed by one IDA. The experimental results are shown in Table 3.6, where the layerwise training is better than the end-to-end training. Furthermore, the layerwise training requires less memory and has a faster convergence rate. These are crucial for large-scale bipartite graphs. Note that the end-to-end training cannot be applied to large-scale Group networks. Finally, we investigate the effect of the layer number on classification performance and show the results in Fig. 3.9. As the layer number increases, the performance rises first, becomes stable next, and drops slowly at the end. Specifically, the performance increases significantly after two layerwise training. This behavior has two major implications. The increase in the layer number can incorporate the multi-hop neighbor information in the embeddingstoyieldabetterrepresentationinthebeginning. However, theperformancewill 49 degradeasthelayernumbercontinuestoincrease, whichcanbeattributedtooversmoothing [78, 97]. That is, node embeddings become less distinguishable due to excessive alignment. Still,theperformancedropofL-BGNNislessascomparedwithGCNs[66]sincethelayerwise training can alleviate overfitting and error accumulation in deeper networks. 3.4.6 Efficiency for Large-scale Bipartite Graphs Figure 3.10: Memory cost and training time on Group dataset. The x axis in left figure denotes the wall-clock time in second, whereas the y axis in both figures are the memory cost. The short blue line of L-BGNN and orange line of AS-GCN mean the training has finished, whereas the training time of GraphSAGE is too long to be shown. ItwasclaimedinSec. 3.3.4thattheadoptionoflayerwisetrainingcanreducethememory cost and accelerate training time. This is verified by experimental results for the large-scale Group network as given in Fig. 3.10. We see that L-BGNN runs about four times faster and with lower memory consumption than AS-GCN, which is a specific method applying GCNs to large-scale networks. These results can be explained below. • There is no need for L-BGNN to load the entire network into the memory for training. Instead,onlyoneminibatchisrequiredforeachiteration. Incontrast,allothermethods need to fit the whole or partial graphs into the memory first [40, 57]. This tends to induce exponential neighbor expansion [23]. 50 • L-BGNN does not require sampling at each convolutional layer and, thus, avoids one additional time-consuming step. • The unsupervised adversarial loss in L-BGNN does not require further computational resources. Incontrast,theunsupervisedlossisbasedontheskip-grammodelinGraph- SAGE[40],whereextrasteps(e.g. randomwalk,negativesampling,etc.) arerequired. These often come at a price of longer training time and larger memory. 3.5 Related Work Our work is related to unsupervised graph embedding. According to the network structure, they can be categorized into two types: homogeneous and heterogeneous graph embedding methods. For homogeneous graph embedding, models are learned from network topology, i.e. ad- jacency matrix, in early work. Matrix factorization methods decompose the processed adja- cency matrix to obtain graph embeddings [86, 103]. Koren et al. [70] models the user-item relationship through singular value decomposition (SVD) and factorizes the adjacency ma- trix into latent components. GraRep [14] employs the PPMI matrix to capture higher-order information in the preprocessing step. Other work explores the network structure through randomwalksandappliestheskip-grammodeltothegeneratedsequencestolearntherepre- sentation. DeepWalk and node2vec are typical examples, where their main difference lies in the random walk process. Meanwhile, GNNs (e.g., GCN [66]), which emerge recently, learn graph representations in an end-to-end fashion with task-specific supervision. The adap- tation of GNNs to unsupervised graph embedding is to align embeddings with the graph structure. For example, GraphSAGE [40] leverages random walks and uses the skip-gram objective to force nodes of the similar neighbor structure to be close in the embedding space. GraphGAN [144] models the neighbor distribution of each node through adversarial train- ing. Although these methods are powerful in capturing structure relations in homogeneous 51 graphs, it is nontrivial to generalize them to heterogeneous networks, especially bipartite graphs. Severalmethods[27,30,125,132]havebeenproposedspecificallyforheterogeneousgraph embedding. One approach is to extend homogeneous graph embedding methods through meta-paths so as to preserve the semantic information in final embeddings. For example, SHNE [170] adopts a framework that models various edge relations and nodes semantic information in heterogeneous networks jointly. Although bipartite graphs can be regarded as a special class of heterogeneous networks, their unique pairwise disjoint structure makes a general embedding method suboptimal. Recently, network embedding methods targeting bipartite graphs have been studied in [32, 77, 101, 176]. Bi-HGNN [77] learns embeddings by capturing the community structure of the user to yield better recommendation results. However, most previous methods fail to capture the complex feature correlations between partitions but simply treat them as a concatenation of feature vectors. Generativeadversarialnetworks(GANs)[37]achievesuperiorperformanceinmanyprob- lems[80, 118, 165]. GANscontainsthegeneratorandthediscriminator, whichcompetewith eachotherinanadversarialwaytoimprovetheoutcome. Therearemanyapplicationsbased on this adversarial learning process. Our work is closely related to the use of regularization in representation learning [93, 111, 155, 179]. 3.6 Conclusion An unsupervised node representation learning framework, called L-BGNN, was proposed for bipartite graph embedding in this work. L-BGNN is domain-consistent, unsupervised, and scalable. It uses IDMP as the generator and IDA as the discriminator. For a given node, the discriminator can tell whether a node feature distribution is real or fake while the generator can produce a fake distribution from the opposite domains to mimic the real one. To scale the training in large-scale graphs, L-BGNN adopts a layerwise training mechanism. 52 It can sample minibatches for training in each layer independently. This layerwise training can reduce computational and memory cost effectively without degrading the performance. Extensive experiments were conducted to demonstrate the effectiveness and efficiency of L-BGNN on several downstream tasks. 53 Chapter 4 Enhanced Label Propagation for Node Classification In this chapter, we study a particular graph learning problem named node classification, e.g., predicting labels for unlabeled nodes. Example applications include predicting docu- ment categories in a citation network or image classes in a generated manifold graph. In particular, we consider node attributes and labels as graph signals. Based on the spectrum analysis of graph signals and the basic smoothness hypothesis, we propose a generally true assumption on all datasets. This assumption motivates us to design effective low-pass filters for the two signal types and efficiently connect through logistic regression, the framework we call GraphHop. Experiments show that our GraphHop method achieves state-of-the-art performance for node classification with low memory cost and fast running time. 4.1 Introduction The success of deep learning and neural networks [74] often comes at the price of a large number of labeled data. Semi-supervised learning is an important paradigm that leverages a large number of unlabeled data to address this limitation. The need for semi-supervised learning has arisen in many machine learning problems and found wide applications in com- puter vision, natural language processing, and graph-based modeling, where getting labeled data is expensive and there exists a large amount of unlabeled data. 54 Many efforts have been made in applying neural networks to the semi-supervised node classification problem for graph-structured data in recent years. For example, the pioneering graph convolutional network (GCN) proposed in [66] achieves state-of-the-art node classi- fication performance to citation networks. GCNs conduct layer embeddings propagation, starting from the node attributes, through a graph with its node labels as supervision. Al- though GCNs [40, 137, 163, 168] offer impressive results, they still have several drawbacks that limit the capacity. First, the large number of parameters and non-linearity in the GCN model demand a considerable number of labeled samples in model training. Li, et al. [78] attempted to alleviate this problem by introducing co-training or self-training. However, it is still restricted to the GCN framework without significant change. Second, GCNs fail to exploit the label dependence concerning the graph structure in model learning. Some meth- ods [88, 116, 178] tried to encode label dependence through a generative model. However, a particular inductive bias is assumed, and complicated optimization algorithms are needed. Third, the end-to-end training through gradient backpropagation makes GCNs difficult to scale for large graphs [19, 40]. Node embeddings of the entire graph have to be stored at all intermediate layers. Besides, the rapid expansion of the neighborhood size in deeper layers makes the minibatch training challenging [23]. There is a trade-off between training efficiency and a larger receptive field size [148]. Finally, non-linear activation hinders the un- derstanding of GCNs and the superior performance of GCNs remains to be a mystery. Some researchers [78, 79, 148] attempted to interpret GCNs by dropping the non-linear terms and, as a result, a low-pass filter becomes the means in propagating embeddings. It is still not sufficient to explain the whole GCN algorithm. Prior to the development of GCNs, another line of research based on label propagation (LP) [16, 184, 190] demonstrated great adaptability [158, 171, 180], scalability [145, 172], andefficiency[28, 181]forthesemi-supervisednodeclassificationproblem. Inparticular, LP exploitsthegeometryofdataentitiesinducedbylabeledandunlabeledsamples. Withfewer labeled nodes, LP iteratively aggregates label embeddings from neighbors and propagates 55 them throughout the graph to provide labels for all nodes. This iterative process can be viewed as low-pass filtering, where smooth graph signals are extracted given the noisy input [79, 129, 143]. Although LP methods can address the pitfalls in GCNs, their performance is inferior to that of GCN methods against several common benchmarking datasets (e.g., citation datasets). There could be three reasons to explain the poorer performance of LP methods. First, both attribute smoothening and label smoothening should be considered jointly in model learning. Typically, the propagation (i.e., smoothening operation) is only conducted on label embeddings while node attributes are ignored. Although some work [61, 171, 180] incorporated the attribute information by introducing them as an extra regularization term under the regularization framework, the effect through optimization is somehow indirect. Still, smoothed attribute information w.r.t. the graph structure are failed to be examined. Second, the label propagation in LP methods is typically implemented by a simple lookup table for unlabeled nodes [184, 190, 192]. Since label embedding parameters are not shared between nodes along iterations, it tends to have a slow convergence rate with inferior per- formance due to the limited regularization effect between label embeddings. Finally, only the information of one-hop neighbors is propagated at each iteration. Embeddings from multi-hop neighbors are not exploited [130]. In this chapter, we propose an enhanced label propagation scheme, called GraphHop, to tackle those weaknesses in LP. In particular, we treat attributes and labels as two distinct butcorrelatedsignals,wherecorrectlabelscanbepredictedfromthesmoothenedattributes. They are assumed to be locally smooth on graphs. GraphHop integrates these two signal types with a two-stage training framework. That is, it conducts attributes smoothening in the initialization stage and label smoothening in the iteration stage as sketched below. In the initialization stage, smoothened node attributes and labeled samples are used to train a regression classifier that predicts each node’s label embedding. The dimension of the labelembeddingvectoristhesameastheclassnumber. Itselementindicatestheprobability 56 ofbelongingtoaclass. Thelabelembeddingsserveasthestartingpointoflabelpropagation in the iteration stage. In the iteration stage, one iteration consists of the following two steps: 1) Label aggrega- tion: For each node, a simple average is first taken for the iterative label embeddings of its neighborhood nodes with the same hop number. Then, it concatenates these averages de- rived from multi-hop neighbors and itself, which serves as the input to the subsequent label update step. 2) Label update: Regression classifiers are used to smoothen label embeddings. Different classifiers are assigned and trained independently for multi-hop of aggregated rep- resentations so that different hops of information can be processed independently. The new label embeddings are calculated by averaging predictions from multiple classifiers. So that the embedding parameters are shared between nodes. Since this iterative procedure exploits multi-hop neighbor information together, we name the proposed method “GraphHop”. We conduct extensive experiments on various scales of benchmarking datasets with par- ticular small label rate settings. The contributions of this work can be summarized below. 1. We propose an enhanced label propagation-based method, called GraphHop, which conductsjointattributeandlabelsmootheningongraphsthroughregressionclassifiers. 2. We provide theoretical justification with some approximation to the fact that Graph- Hop converges faster than the traditional LP. 3. We empirically verify that the collaborative model design can address the three weak- nesses of the traditional LP. 4. We conduct extensive experiments to validate GraphHop’s effectiveness concerning small label rates and large-scale graphs against state-of-the-art GCN algorithms. To the best of our knowledge, we are the first to show that an enhanced LP-based method can outperform GCN baselines on several well-known benchmarking datasets. Our codes are publicly available. 1 1 https:/github.com/TianXieUSC/GraphHop 57 The rest of this chapter is organized as follows. Some preliminaries are introduced in Sec. 4.2. The GraphHop method is presented in Sec. 4.3. Theoretical analysis is conducted to estimate the required iteration number of the iterative algorithm in Sec. 4.4. Extensive experiments are conducted to show the state-of-the-art performance with low memory usage in Sec. 4.5. Comments on related work, applications and further improvements are made in Secs. 4.6 and 4.7, respectively. Finally, concluding remarks are given, and future research directions are pointed out in Sec. 4.8. 4.2 Preliminaries In this section, we first formulate the transductive semi-supervised learning problem on graphs in Sec. 4.2.1. Then, we present the smoothness assumption leading to our model design, which holds for most graph learning methods and provides empirical evidence on the benchmarking datasets in Sec. 4.2.2. Finally, we review three commonly used graph signal processing tools in Sec. 4.2.3. Figure 4.1: The plot of accuracy curves based on lower frequency (in blue) and higher frequency (in red) for citation datasets, respectively. The x-axis denotes the selected top-k lower or higher frequency components in percentage and the top 20% low frequency region is shaded in green. 58 4.2.1 Problem Statement An undirected graph can be represented by a triple: G = (V,A,X), where V denotes a set of n nodes, A ∈ R n× n is the adjacency matrix between nodes, and X ∈ R n× d is the attribute matrix whose d-dimensional row vector is an attribute vector associated with each node. In the setting of semi-supervised classification, each labeled node belongs to one class, y∈C, whereC ={1,··· ,c}. Nodes are divided into labeled and unlabeled sets with their indices denoted by L ={1,··· ,l} andU ={l+1,··· ,n}, respectively. LetH be the set of n× c matrices with nonnegative entries. Matrix H∈H = (h T 1 ,··· ,h T l ,h T l+1 ,··· ,h T n ) T is the global label embedding matrix for each node. It can be divided into H = (H T l ,H T u ) T where H l and H u represent labeled and unlabeled samples, respectively. For the ith row of H,vectorh i isaprobabilityvectorsatisfying P c j h ij = 1,whoseelementh ij istheprobability 59 of belonging to class j. To classify unlabeled node i in the convergence stage, we assign it the label of the most likely class: y i = argmax j∈C h ij . (4.1) The goal of transductive graph learning is to infer labels of unlabeled nodes (i.e., matrix H u ), given attribute matrix X, adjacency matrix A, and labeled samples. In the extremely small label rate case, the number of samples in the labeled set is much smaller than that in the unlabeled set; namely, |L|≪|U| . 4.2.2 Assumption The attribute matrix and the label embedding matrix can be treated as signals on graphs. For example, the d-dimensional column vectors of attribute matrix X can be viewed as a d-channel graph signal. Semi-supervised learning assumes that data points close in a high- density region should have similar attributes and produce similar outputs. This property can be stated qualitatively below. Assumption 1. The node attribute and label signals are smooth functions on graphs, where correct labels can be predicted from the smoothened attributes. The smoothness of attributes and labels can be verified by evaluating their means and variances as a function of the length of the shortest path between every two nodes, which is usually measured in terms of the hop count. Here, we propose an alternative approach to verify the smooth label assumption (the study of smooth attribute assumption on the benchmarking datasets can be found in [79, 107, 148]) and conduct experiments on the Cora, CiteSeer, and PubMed datasets (see Sec. 4.5 for more details) using the following procedure. 60 Step 1. Compute the graph Fourier basis Q = [q 1 ,q 2 ,··· ,q n ] and its corresponding frequency matrix Λ = diag(λ 1 ,λ 2 ,··· ,λ n ) from graph Laplacian L = D− A, where D is a diagonal matrix of vertex degrees defined as D ii = P n j=1 A ij . Specifically, the frequencies in Λ are arranged in an ascending order: λ 1 <λ 2 ≤ λ 3 ≤···≤ λ n . Similarly, we can define the reverse-ordered frequency matrix, Λ r = diag(λ n ,λ n− 1 ,··· ,λ 1 ), and find the corresponding basis matrix, Q r = [q n ,q n− 1 ,··· ,q 1 ]. ItiswellknownthatthebasissignalsofthegraphLaplacianassociatedwithlowerfrequencies (i.e., smaller eigenvalues) are smoother on the graph [191]. Step 2. Compute the top-k lowest or highest frequency components of labels via ˜ Y =Q[:k] T Y or ˜ Y r =Q r [:k] T Y, 61 where each row of Y ∈ R n× c denotes the one-hot embedding of a node label. Then, we reconstruct labels for all nodes via ˆ Y =Q[:k] ˜ Y or ˆ Y r =Q r [:k] ˜ Y r , and make a prediction based on the reconstructed label embedding matrix, where classes of node i are decided by Eq. (4.1). The above computation yields two classification results for each node - one using smooth components ( ˆ Y) while the other using highly fluctuating components ( ˆ Y r ). We incrementally add the number of Laplacian frequency components, reconstruct label embeddingsforclassification,andshowtwoaccuracycurvesforeachdatasetinFig. 4.1. The oneusinglow-frequencycomponentsisinblue,andtheoneusinghigh-frequencycomponents is in red. The blue curve rises quickly while the red curve increases slowly in all three cases, supporting our assumption that the graphs’ label signals are smooth. Interestingly, the performance does drop when adding more high-frequency components for PubMed. 4.2.3 Smoothening Operations Based on Assumption 1, we expect that a smoothening operation on node attributes and labels helps achieve better classification results. There are several common ways to achieve the smoothening effect, which can be categorized into the following three types. 1. Average attributes or label embeddings of neighboring nodes. It is well known that the average operation behaves like a low-pass filter [148]. Another reason for taking the average is that nodes have different numbers of m-hop neighbors, denoted by Ω m , where m = 1,2,··· . By averaging the attributes/labels of nodes with the same hop distance, we can consider the impact of m-hop neighbors for all nodes uniformly. 62 2. Use regression for label prediction. We need to predict labels for a great majority of nodes based on attributes of all nodes and labels of a few nodes. One way for predic- tion is to train a regressor. Regression (regardless of a linear or a logistic regressor) is fundamentally a smoothening regularization operation since it adopts a fixed but smaller model size to fit a large number of observations. 3. Incorporate a smooth regularization term in an objective function. One can introduce a quadratic penalty term and, at the same time, remain consistent with the initial labeling. The results can be equivalently derived from iterations in LP [16, 184] as elaborated below. The smoothening regularization framework can be formally defined as a minimization of objective function J(H) = ||H− Y|| 2 F | {z } Least square penalty + µ Tr(H ⊤ ˜ LH) | {z } Laplacian regularization , (4.2) where Y = (Y T l ,Y T u ) T consists of the one-hot label embedding matrix of labeled nodes Y l and zeros of unlabeled nodes Y u ,||·|| F is the Frobenius norm, µ is a parameter controlling the degree of the regularization, and ˜ L is the symmetric normalized graph Laplacian matrix. The optimization solution of embedding matrix H in Eq. (4.2) can be derived equivalently by applying an embedding propagation process. The update at tth iteration is derived as: H (t+1) =α ˜ AH (t) +(1− α )H (0) , (4.3) where ˜ A =D − 1 2 AD − 1 2 is the symmetric normalized adjacency matrix,H (0) l =Y l for labeled samples and zeros for unlabeled samples. The update rule can be described by replacing the current label embeddings with their averaged neighbors’ through a lookup table plus the initially labeled samples. As shown in Eqs. (4.2) and (4.3), we see that the LP method 63 Label aggregation ⊗ ˜ A 2 Logistic Classifier Label update ⊕ Iteration stage: Iterate times While not converged, train for epochs with all examples Logistic Classifier Concat ⊗ ˜ A 1 t Initialization stage: While not converged, train for epochs with labeled examples only t max _ iter Figure 4.2: An overview of the proposed GraphHop method, where the left subfigure illus- trates the initialization stage, the middle subfigure shows the label aggregation and label update steps in the iteration stage, and the converged result is given in the right subfigure. Red and blue arrows denote to the adaptation of the same framework in the initialization stage and iteration stage, respectively. Note that unique classifiers are applied to multi-hop aggregation with different M values shown in the middle. exploits smoothening operations of the first and the third types, respectively. A closed-form solution to Eqs. (4.2) and (4.3) can be equivalently cast in form of H = (1− α )(I− α ˜ A) − 1 H (0) . (4.4) BasedonthepropagationprocessinEq. (4.3),weobserveseverallimitationsthatrestrict the learning capability for embeddings propagation and update: • The node attributes, especially the smoothened node attributes are not exploited. • Label embeddings are not shared between nodes so that their correlations are not considered. • Only label embeddings of one-hop neighbors are propagated at each iteration. There- fore, the multi-hop relationships are omitted in model learning. 64 4.3 GraphHop Method The GraphHop method is introduced in this section to target the deficiencies mentioned above. It is an iterative algorithm that consists of an initialization stage and an iteration stage, as shown in Fig. 4.2. The initialization stage predicts the initial label embeddings of all nodes based on the smoothened node attributes and labeled nodes via regression. The predicted label embeddings then serve as the starting point for the subsequent iterations. The iteration stage conducts label aggregation and updates to smoothen the label signals. The attribute information is not needed in the iteration stage. GraphHop has the following features to address the limitations of the traditional LP methods as described in the last section. • The smoothened attributes are extracted and used to train regressors in the initial- ization stage, where the predicted label embeddings are further smoothened in the subsequent iterations. • Another set of regression classifiers are trained on entire nodes to smoothen the label embeddings, which plays the role of embedding parameter sharing. • Mixing multi-hop neighbors are served as input for each regression classifier indepen- dently. The predictions from different hop neighbors are then aggregated to update the label embedding of each node. 4.3.1 Initialization Stage We attempt to smoothen the node attributes input and provide initial label embeddings to all nodes in the initialization stage. Therefore, we design the first smoothening operation 65 and apply it to the attribute signals. Mathematically, for node j and its mth hop neighbors, we have the following averaged attribute representations x j,m = P l∈Ω m(j) x l |Ω m (j)| , m = 0,1,··· ,M, (4.5) where Ω m (j) denotes the m-hop neighbors of node j and m = 0 indicates the node j itself. Theaveragedattributevectorsfromdifferenthops(i.e.,different m)areconcatenated,which yields a higher-order neighborhood representation (see Fig. 4.2 and 4.3). By considering all the nodes, we can express this as X M =∥ 0≤ m≤ M ˜ A m X, (4.6) where|| denotes column-wise concatenation, ˜ A m is the normalized m-hop adjacency matrix, and X M ∈ R n× d(M+1) is the smoothened attribute matrix. A larger M value generates a stronger signal smoothening results [79, 148]. To derive the initial label embeddings, we leverage a logistic regression (LR) classifier trained on this smoothened attribute matrix with labeled samples as supervision. The objective function can be written as L = 1 |L| X y∈L x M ∈X M,l H(y,p model (y|x M ;θ )), (4.7) where H(p,q) is the cross-entropy loss, p model is the LR classifier, and θ is the set of model parameters. The combination of attributes from various distances (i.e., different M) possess distinct relationships and should be considered independently [1, 69]. To encode multi-hop correlations, a unique classifier is adopted for different hops of aggregations (see Fig. 4.2) to effectively embed multi-hop attribute information. Afterward, the label embeddings for 66 all nodes are updated by an average from the predictions of all the converged LR classifiers. Formally, this can be written as H (0) = 1 K K X M=1 p model (Y|X M ;θ ), (4.8) where K is the number of classifiers trained on different hops of aggregation, and Y is the distributionoverclasslabels. Theintuitiontoleveragetheclassifierpredictionsastheinitial label embeddings is justified from Assumption 1. That is, nodes of spatial proximity should have similar attributes and produce similar labels. 4.3.2 Iteration Stage The label embeddings predicted in the initialization stage are further processed in the itera- tionstage. Eachiterationcontainstwosteps: label aggregation andlabel update tosmoothen the initial label embeddings. They are introduced as follows: ⊗ ˜ A 2 ⊗ ˜ A 0 ⊗ ˜ A 1 X or H ( t−1) X M or H ( t−1) M Figure 4.3: Illustration of attribute or label embedding aggregation in GraphHop, where multi-hop embeddings are averaged and then concatenated to form the input to the LR classifier. The example in the figure gives the aggregation of the center green node with M = 2, where the red and blue regions correspond to one-hop and two-hop neighbors, respectively. Label Aggregation The label aggregation step is the same as the attribute aggregation mechanism, as stated in Sec. 4.3.1, except that it is performed on the iterative label embeddings rather than node 67 attributes. By modifying Eq. (4.6) slightly, we have the following label aggregation formula H (t− 1) M =∥ 0≤ m≤ M ˜ A m H (t− 1) , t = 1,2,··· . (4.9) where t is the iteration index and H (t− 1) M ∈ R n× c(M+1) is the smoothened label embedding matrix. The hop number M controls the size of neighborhoods. A larger M value results in larger memory and computational complexity. However, even with one-hop connections (i.e., M = 1), strong smoothening results can be achieved due to the iteration procedure. This is not available in the model of [148] (see Eq. (4.16)). In the experiments, we focus on the case with M = 2, as shown in Fig. 4.2. So far, the label embedding parameters are not shared between nodes, where each node embedding is unique and derived independently from their neighborhoods. To incorporate the correlations between nodes, we propose to adopt regression classifiers as regularization between nodes and their aggregated results in the label update step. So that, new label embeddings are generated globally based on entire nodes. Label Update We perform smoothening and regularization in the label update step on the aggregated label embeddings. An individual LR classifier is adopted for different hops of aggregations (see Fig. 4.2). Thus, multi-hop label representations are processed independently. However, being different from the initialization stage where classifiers are only trained by labeled samples, each unlabeled node is now associated with a label embedding (i.e., H (t− 1) u ) from the previous iteration. This embedding encodes the confidence of the current predictions. It can serve as a pseudo-label and contribute to classifier training. The implementation details of the LR classifier in the iteration step are elaborated in Appendix D. In short, the label embeddings of unlabeled nodes are also used as supervision and added to the objective function (see Eq. (D.5)). Once all classifiers converge, new label embeddings are averaged 68 from the predictions and used as the input to the next iteration. Formally, this can be expressed as H (t) = 1 K K X M=1 p model (Y|H (t− 1) M ;θ ), (4.10) where K is the number of classifiers applied for M hop aggregations. Note that there are other possible choices for the predictor in Eq. (4.8) and (4.10) as discussed in [37, 63, 109]. The current LR classifier is chosen due to its efficiency, where the minibatch training can be easily conducted (see Algorithm 2), and its small model size with the probability output. Finally, the GraphHop method is summarized by pseudo-codes in Algorithm 2. Algorithm 2 GraphHop 1: Input: Graph A, attributes X, label vectors Y l 2: Output: Label vectors H u 3: Initialization: 4: X M ← calculate Eq. (4.6) 5: while not converged do 6: for each minibatch do 7: Compute g←∇ L(X M,l ,Y l ;θ ) in Eq. (4.7) 8: Conduct Adam update using gradient estimator g 9: end for 10: end while 11: H (0) ← calculate Eq. (4.8) 12: Iteration: 13: for iteration∈ [1,...,max iter] do 14: H (t− 1) M ← calculate Eq. (4.9) ▷ label aggregation 15: while not converged do 16: for each minibatch do 17: Compute g←∇ L(H (t− 1) M ,Y l ,Sharpen(H (t− 1) u );θ ) in Eq. (D.5) 18: Conduct Adam update using gradient estimator g 19: end for 20: end while 21: H (t) ← calculate Eq. (4.10) ▷ label update 22: end for 69 4.4 Analysis 4.4.1 Convergence Analysis In the iterationg stage, we analyze the relationship between the lower bound of iteration number, label rate, and the hops number. The higher-order (larger-hops) feature of nodes can be efficiently approximated within m-hop neighbors, as illustrated in [175]. Here, we call the approximation of each unlabeled node is sufficient if at least one initially labeled sample is captured in its m-hop surroundings after iterations, and otherwise insufficient. If all unlabeled nodes are sufficient, we call the iteration sufficient. We define the shortest distance between nodes u and v as dist(u,v) in the graph. Given initially labeled samples, we partition node setV into several subsets V 0 ,··· ,V i ,··· ,V I , whereV 0 denotes the set of initially labeled nodes, V i ={u|min v∈V l dist(u,v) =i} is the subset of unlabeled nodes that have the same minimum distance (i.e., of i hops) to one of labeled nodes, and I is the maximum distance between an unlabeled node and a labeled one. Since each node will aggregate M-hop neighbor embeddings at one iteration (Eq. (4.9)), we have the following Lemma. Lemma 1. Given the maximum hop k = max{M} covered in each label propagation step, the node predictions inV i will be insufficient until the propagation of iteration i/k is finished. Based on Lemma 1, we can derive the number of sufficient nodes after t iterations. 70 Theorem 3. Let G be a graph with n nodes, and the maximum degree of all nodes in d(d ̸= 1). The number of initially labeled nodes is j. Then after t iterations, at most min{n,jd d kt − 1 d− 1 } nodes are sufficient. Proof. Since |L| ≪ |U| in semi-supervised learning, we ignore the difference between the number of unlabeled nodes (i.e. |U|)and all nodes in the graph (i.e. |L|+|U|). According to Lemma 1, after t iterations, the nodes in subsetsV 1 ,V 2 ,··· ,V kt are sufficient. Then, we have |V 1 ∪V 2 ∪···∪V kt | =|V 1 |+|V 2 |+··· +|V kt | ≤ jd+jd 2 +...+jd kt =jd d kt − 1 d− 1 . (4.11) The first equality is because that V 1 ,V 2 ,...,V kt are mutually disjoint. Each node has only one unique minimum distance to labeled set V 0 so that they can only be assigned to one specific subset. The inequality is due to the use of the maximum degree d for every node. Apparently,|V 1 ∪V 2 ∪···∪V kt |≤ n. Thus, we get |V 1 ∪V 2 ∪...∪V kt |≤ min{n,jd d kt − 1 d− 1 }, (4.12) and the theorem is proved. It is easy to get the following corollary. Corollary 1. The predictions of all unlabeled nodes on graph G will be sufficient with t iterations, where t∈ Ω( 1 k log d (1+ n(d− 1) jd )). (4.13) Proof. According to Theorem 3, at most min{n,jd d kt − 1 d− 1 } nodes are sufficient after t itera- tions. To ensure that all nodes on graph G are sufficient, we let jd d kt − 1 d− 1 ≥ n 71 so that t≥ 1 k log d (1+ n(d− 1) jd ). (4.14) Corollary 1 shows that the relationship between sufficient iterations and the maximum hopnumberk isininverseratio. Increasingk willdecreasetherequirednumberofiterations. Theinitiallabelratej andgraphdensitydalsoinfluencetheiterationnumber. Nevertheless, the effects are minor due to the logarithm function. Notably, in a large-scale graph where j ≪ n, changing the label rate has negligible influence on the iteration number. The same behavior has been shown in the experiment section. In practice, we observe that Graph- Hop converges in a few iterations (usually 10) since few iterations are required to achieve sufficiency. 4.4.2 Complexity Analysis The time and memory complexities of GraphHop are significantly lower than those of GCNs for the following reasons. First, only one set of node embeddings in GraphHop, while there are multiple layer embeddings in GCNs. Second, the minibatch training is straightforward in GraphHop, which is difficult for GCNs due to the neighbor expansion problem [23]. Note that LR classifiers can be directly trained using the aggregation in Eq. (4.9) (resp. (4.6)). Thus, minibatches can be easily applied to matrix H (t− 1) M (resp. X M ) (see lines 6 and 16 in Algorithm 2). Suppose that, during training, the number of minibatches is N, the number of nodes is n, the number of iterations is t, the size of one minibatch is b, and the number of classes is c. Then, the time complexity of one minibatch propagation can be computed as: O(n|| ˜ A M || 0 +t || ˜ A M || 0 N c+tbc 2 ), 72 where the first term is from the computation of multi-hop neighbors, the second term comes fromlabelaggregation, andthethirdfromlabelupdate. Notethatwecaneliminatethefirst termbyconsideringtheone-hopneighbors(i.e.,M = 1)only. Thememoryusagecomplexity is O(bc+c 2 ), which represents embeddings of one minibatch and parameters for the classifiers. We ignore the storage of the adjacency matrix since it is the same for all algorithms. Note that the memory cost is fixed and independent of iterations t and scales linearly in terms of the minibatch size b. 4.5 Experiments We conduct experiments to evaluate the performance of GraphHop with multiple datasets and tasks. Datasets used in the experiments are described in Sec. 4.5.1. Experimental settings are discussed in Sec. 4.5.2. Then, the performance of GraphHop is compared with state-of-the-art methods in small- and large-scale graphs in Sec. 4.5.3. Finally, ablation studies are given in Sec. 4.5.5. 4.5.1 Datasets We evaluate the performance of GraphHop on six representative graph datasets as shown in Table 4.1. Cora, CiteSeer, and PubMed [66] are three citation networks. Nodes are papers, while edges are citation links in these graphs. The task is to predict the category of each paper. PPI and Reddit [40] are two datasets of large-scale networks. PPI is a multi-label dataset, where each node denotes one protein with multiple labels in the gene ontology sets (121 in total). Amazon2M [23] is by far the largest graph dataset that is publicly available, with over 2 million nodes and 61 million edges obtained from Amazon co-purchasing networks. The raw node features are bag-of-words extracted from product 73 descriptions. We use the principal component analysis (PCA) [51] to reduce their dimension to 100. Also, we use the top-level category as the class label for each node. Table 4.1: Statistics of six representative datasets used in experiments. Dataset Vertices Edges Classes Features Dims Cora 2,708 5,429 7 1,433 CiteSeer 3,327 4,732 6 3,703 PubMed 19,717 44,338 3 500 PPI 56,944 1,612,348 121 50 Reddit 231,443 11,606,919 41 602 Amazon2M 2,449,029 61,859,140 47 100 Table 4.2: The training, validation, and testing splits used in the experiments for large-scale graphs, where the node numbers and the corresponding percentages (in brackets) are listed. Dataset Data splits Train Validation PPI 569(1%) 1139(2%) 2847(5%) 5694(10%) 5000 Reddit 2296(1%) 4592(2%) 11562(5%) 22888(10%) 5000 Amazon2M 19606(1%) 35259(2%) 77700(3%) 139556(5%) 50000 Table 4.3: Grid search ranges of the hyperparameters. T {0.1,1,10,100} α {0.01,0.1,1,10,100} β {0,0.1,1,10} 4.5.2 Experimental Settings We evaluate GraphHop and several benchmarking methods on the semi-supervised node classificationtaskinatransductivesettingatseveralsmall-labelrates. Forcitationdatasets, we first conduct experiments by following the conventional train/validate/test split (i.e., 20 labels per class) of the training set. Next, we train models at meager label rates (i.e., 1, 2, 4, 8, and 16 labeled samples per class). For the three large-scale networks PPI, Reddit, 74 Table 4.4: Hyperparameters used in the experiments for the largest label rate. Datasets T α β Cora 0.1 10 1 CiteSeer 0.1 10 1 PubMed 0.1 1 1 Reddit 1 1 0 PPI 1 1 1 Amazon2M 1 100 100 and Amazon2M, the original data splits target inductive learning scenarios, which do not fit our purpose. To tailor them to the transductive semi-supervised setting, we adopt fewer labeledtrainingsamples. Specifically,forRedditandAmazon2M,werandomlypickthesame number of samples in each class with multiple label rates for training. For the multi-label PPI dataset, we simply select a small portion of samples randomly in training. Fixed-size of remaining samples is selected as validation, while the rest is used as testing. The complete data splits are summarized in Table 4.2. For simplicity, we use the percentages of training samples to indicate different data split in reporting performance results. We implement GraphHop in PyTorch [112]. For the LR classifiers in the initialization stage and the iteration stage, we use the same Adam optimizer with a learning rate of 0.01 and 5× 10 − 5 weight decay. The minibatch size is fixed to 512 for citation networks but adaptive for large-scale graphs since experiments show a trade-off between efficiency and memory cost using different minibatch sizes for the latter case. The training epochs are set to 1000 with an early stopping criterion which stops the classifiers from training and goes to the next iteration. We set the maximum iteration to 100 for citation datasets and 200 for large-scale networks. As shown in the experiments, these numbers are large enough for GraphHop to converge. For hyperparameters T, α , and β , we perform a grid search based on the validation results in the parameter space. The hyperparameter tuning ranges and their final values are listed in Table 4.3 and 4.4, respectively. Note that hyperparameters are tuned for different label rates. We show their values for the largest label rates in Table 75 4.4. All experiments were conducted on a machine with an NVIDIA Tesla P100 GPU (16 GB memory), 10-core Intel Xeon CPU (2.40 GHz), and 100 GB of RAM. 4.5.3 Performance Evaluation Table 4.5: Classification accuracy (%) for three citation datasets with different label rates. Thehighestaccuracyineachcolumnishighlightedinboldandthetopthreeareunderlined. Cora CiteSeer PubMed # of labels per class 1 2 4 8 16 20 1 2 4 8 16 20 1 2 4 8 16 20 LP 51.5 56.0 61.5 63.4 65.8 67.3 30.1 33.6 38.2 40.6 43.4 44.8 55.7 58.8 62.7 64.4 65.8 66.4 LNP 39.9 42.8 51.8 60.2 66.7 69.2 17.6 28.4 32.9 37.9 44.7 46.7 45.2 62.5 64.7 67.2 63.7 64.9 Special LP 50.5 51.1 60.8 64.4 68.4 70.2 20.4 34.3 34.1 41.0 46.2 47.6 62.3 66.5 66.5 66.5 67.5 69.6 Centered Kernel 25.0 26.2 38.9 51.4 50.2 53.2 28.6 31.8 29.1 36.3 42.1 38.5 41.7 52.9 46.1 49.0 49.9 51.0 WNLL 12.7 38.8 60.2 69.9 70.6 70.9 7.6 7.5 37.5 47.6 52.6 47.9 68.5 70.2 69.6 70.3 70.7 71.3 Poisson 12.7 44.2 61.2 69.5 70.4 72.3 7.6 7.5 37.3 47.8 54.6 49.2 67.2 68.7 66.4 65.7 71.3 72.2 DeepWalk 40.4 47.1 56.6 62.4 67.7 69.9 28.3 31.5 36.4 40.1 43.8 45.6 - - - - - - LINE 49.4 56.0 63.0 67.3 72.6 74.0 28.0 31.4 36.4 40.6 45.8 48.5 - - - - - - DGI 55.3 63.1 71.8 74.5 77.2 77.9 46.1 52.7 61.7 65.6 68.2 68.7 68.3 68.4 69.9 71.2 74.8 76.7 Graph2Gauss 54.5 61.3 69.5 72.4 74.8 75.8 45.1 50.8 58.4 61.7 64.4 65.7 67.2 66.0 68.7 67.6 69.4 69.6 GCN 42.4 52.0 65.0 72.2 78.4 80.2 36.4 43.4 53.9 60.4 67.5 68.8 41.3 48.1 59.3 67.4 74.5 77.8 GAT 41.8 51.8 66.4 73.6 77.8 79.6 32.8 40.7 51.8 57.9 64.5 68.2 57.6 70.0 69.8 71.7 75.6 76.5 Co-training GCN 53.1 59.4 68.0 73.5 78.9 78.7 36.7 42.9 52.0 57.9 62.5 65.9 55.1 59.9 66.9 71.3 75.7 77.9 Self-training GCN 40.6 52.3 67.5 73.8 77.3 79.1 34.6 42.3 54.4 63.1 68.3 69.1 49.7 56.2 65.0 68.9 73.6 76.5 GraphHop 59.8 58.2 69.3 76.3 79.7 81.0 48.4 55.0 55.1 60.4 66.7 70.3 69.3 70.9 71.1 71.9 75.0 77.2 Table4.6: Classificationaccuracy(%)forthreelarge-scalegraphdatasets, wherethecolumn oflabeledsamplesismeasuredintermsofpercentagesoftheentiredatasetandOOMmeans “out of memory”. Reddit Amazon2M PPI % of labeled samples 1 2 5 10 1 2 3 5 1 2 5 10 FastGCN 78.6 80.2 86.7 87.7 OOM OOM OOM OOM - - - - Cluster-GCN 92.0 92.7 93.7 94.0 - - 68.8 73.8 - - - - L-GCN 89.2 90.7 92.0 92.8 36.9 47.6 59.4 65.8 68.3 69.8 69.2 69.1 GraphHop 93.4 94.1 94.7 95.0 51.4 59.8 66.3 68.2 73.8 73.4 73.7 74.1 We conduct performance benchmarking between GraphHop and several state-of-the-art methods and compare their results for small- and large-scale datasets below. 76 Citation Networks The state-of-the-art methods used for performance benchmarking are grouped into three categories: • LP-basedmethods: LP[184],LNP[142],SpecialLP[102],CenteredKernel[91],WNLL [128], and Poisson [13]. • Unsupervised methods: DeepWalk [113], LINE [133], DGI [136], and Graph2Gauss [10]. • Semi-supervised methods: GCN [66], GAT [137], Co-training GCN, and Self-training GCN [78]. Results on three citation datasets are summarized in Table 4.5, where label rates are chosen to be one, two, four, eight, 16, and 20 labeled nodes per class. Each column shows the classification accuracy (%) for GraphHop and nine benchmarking methods under a dataset and a given label rate. Overall, GraphHop performs the best, especially for cases with extremely small label rates. The reason is that its adoption of label aggregation and label update steps in the iteration stage yields a smooth distribution of label embeddings on the graph for predic- tion, which relies less on label supervision. Similarly, other embedding-based methods (e.g. DGI and Graph2Gauss) also outperform GCN variants for cases with very few labels since their methods are designed to take advantage of the graph structure into embeddings in an unsupervised way. When the label rate goes higher, GCN variants perform better than unsupervised models. Li et al. [78] and Sun et al. [131] showed the limitation of GCN in a few label cases and proposed a co-training or self-training mechanism to handle this problem. Still, GraphHop outperforms their methods in various label rates. The limited performance of LP-based methods is due to the failing or ineffective exploiting of the node attribute information in model learning. 77 Large-Scale Graphs To demonstrate the scalability of GraphHop, we apply it to three large-scale graph datasets. Only one-hop label embedding is propagated to save time and memory at each iteration. We compare GraphHop with three state-of-the-art GCN variants designed for large-scale graphs. They are: FastGCN [19], Cluster-GCN [23], and L-GCN [163]. We tried our best to adopt their released codes and follow the same settings as described in their papers for the three benchmarking methods. However, their original models deal with su- pervised learning in an inductive setting. For fair comparison, we modify their model input to conduct the training on the entire graph in the transductive setting. The classification accuracy results for node label rates equal to 1%, 2%, 5%, and 10% of the entire graph are summarized in Table 4.6. For the Reddit dataset, we can get results for all methods. We see from the table that GraphHop performs the best and Cluster-GCN the second in all cases. For the super large Amazon2M dataset, FastGCN exceeds the memory while Cluster-GCN fails to converge when the label rates are 1% and 2%. Furthermore, it takes Cluster-GCN a long time for one epoch training, and we report the results after 20 epochs. There is a significant performance gap between GraphHop and L-GCN in all cases for Amazon2M. For the PPI multi-label dataset, both FastGCN and Cluster-GCN fail to converge in the transductive semi-supervised setting. As the PPI label rates increase, the performance stays about the same. This is probably because a small label rate (less than 10%) is not large enough to cover all cases in a multi-label scenario. 4.5.4 Computational Complexity and Memory Requirement Since GraphHop is an iterative algorithm, we study test accuracy curves as a function of the iteration number for all six datasets with different label rates in Fig. 4.4. We have two main observations. First, for Cora, CiteSeer, and PubMed datasets, test accuracy curves converge in about 10, 5, and 4 iterations, respectively. Although fewer labeled nodes tend to demand more iterations to achieve convergence, yet its impact on convergence behavior is minor, 78 (a) (b) (c) (d) (e) (f) Figure 4.4: GraphHop’s test accuracy curves as a function of the iteration number under different label rates, where the label rate is expressed as the number and the percentage of labeled nodes per class for citation and large-scale graph datasets, respectively. (a) Cora dataset. (b) CiteSeer dataset. (c) PubMed dataset. (d) Reddit dataset. (e) Amazon2M dataset. (f) PPI dataset. which is consistent with Corollary 1. Second, for Reddit and Amazon2M, test accuracy curves achieve their peaks in a few iterations, drop, and converge. The latter is due to over- smoothening of label signals as explained in Sec. 4.5.5. Repeatedly applying smoothening operators will result in the convergence of feature vectors to the same values [78], where can be alleviated by residual connections [44]. In practice, we can find the optimal iteration numberforeachdatasetbyobservingtheconvergencebehaviorbasedonthevalidationdata. Furthermore, we need to train an LR classifier for label embedding initialization and propagation through iterative optimization. To achieve this objective, we show the training loss curves of the LR classifiers as a function of the epoch number for all six datasets in Fig. 4.5. We see that all LR classifiers converge relatively fast. This is due to the initialization stage, where the smoothness of the label embeddings has been imposed consistently through the classifier predictions. 79 (a) (b) (c) (d) (e) (f) Figure 4.5: Training loss curves of the LR classifiers of GraphHop as a function of the epoch number for six datasets. (a) Cora dataset. (b) CiteSeer dataset. (c) PubMed dataset. (d) Reddit dataset. (e) Amazon2M dataset. (f) PPI dataset. To demonstrate the efficiency and scalability of GraphHop, we compare training time and memory usage of several methods in Table 4.7. Here, we focus on benchmarking models thatcanhandlelarge-scalegraphssuchasGraphSAGE[40], Cluster-GCN[23], L-GCN[163] and FastGCN [19]. For citation networks of smaller sizes, we adopt their original codes and implement them supervised. For large-scale graphs (i.e., Reddit, PPI, and Amazon2M), we follow the process discussed earlier and set the label rate to the largest. We measure the averagedrunningtimeperepoch(orperiterationforGraphHop)andthetotaltrainingtime in seconds. Early stopping is adopted, where we record the time when the performance on the validation set drops continuously for five iterations. For memory usage, we only consider the GPU memory 2 . Generally speaking, GraphHop can achieve fast training with low memory usage. Al- though L-GCN has the lowest memory usage, all parameters are fixed without validation 2 Itismeasuredby torch.cuda.memory allocated()forPyTorchandtf.contrib.memory stats.MaxBytesInUse() for TensorFlow. 80 (a) (b) Figure 4.6: (a) Comparison of time complexity vs. memory usage of different methods on Reddit, where the lower left corner indicates the desired region that has low training complexity and low GPU memory consumption. (b) The trade-off between training time and memory usage by varying the training minibatch size on Reddit for GraphHop, where thegreencurveindicatesthetrainingtimeandtheorangecurveindicatesthememoryusage. Table4.7: ComparisonoftrainingtimeandGPUmemoryusage. Theaveragedrunningtime per epoch/iteration and the total running time are given outside and inside the parantheses, respectively. GraphSAGE Cluster-GCN L-GCN FastGCN GraphHop Time Memory Time Memory Time Memory Time Memory Time Memory Cora 0.033(7)s 902 MB 0.142(39)s 546 MB 0.004(0.7)s 3 MB 0.023(3)s 21 MB 0.010(36)s 26 MB CiteSeer 0.059(12)s 2288 MB 0.175(56)s 723 MB 0.009(1.4)s 5 MB 0.055(4)s 74 MB 0.018(56)s 68 MB PubMed 0.022(5)s 418 MB 0.483(148)s 808 MB 0.012(1.9)s 4 MB 0.214(7)s 81 MB 0.058(78)s 9 MB Reddit 5.7(1135)s 1755 MB 3.3(483)s 186 MB 2.3(365)s 2 MB 13.7(469)s 1690 MB 11.3(237)s 104 MB Amazon2M 44.6(913)s 2167 MB 1010(10251)s 73 MB 3.4(549)s 3 MB OOM OOM 69.3(762)s 302 MB PPI 1.3(26)s 110 MB - - 0.099(16)s 14 MB - - 0.077(46)s 21 MB applied (validation data are counted in memory consumption for ours and other baselines), so the comparison may not be fair. To shed light on training complexity and memory usage, we plot the time complexity versus memory usage of different methods on Reddit in Fig. 4.6a based on the data in Table 4.7. The lower-left corner of this figure indicates the desired region with low training complexity and low GPU memory consumption. Furthermore, we can balance the training time and memory usage by changing the minibatch size, as shown in Fig. 4.6b for GraphHop. By increasing the minibatch size, the memory consumption increases in exchange for a lower training time. 81 4.5.5 Additional Observations Ablation Study Table 4.8: Comparison of accuracy performance of GraphHop and its two variants. Cora CiteSeer PubMed Variant I 74.9 64.0 75.3 Variant II 67.3 44.8 66.4 Variant III 79.3 68.6 76.7 GraphHop 81.0 70.3 77.2 To illustrate the effectiveness of GraphHop against the three weaknesses of traditional LP (i.e., failing to encode smoothened attribute information, without embedding parameter sharing, and one-hop neighbors propagation), we explore three variants of GraphHop below. • Variant I: GraphHop with the initialization stage only. It utilizes center’s and neigh- bors’ node features for label prediction without any label embeddings propagation. • Variant II: GraphHop without the initialization stage and LR classifiers between iter- ations, which is the same as vanilla LP in Eq. (4.3). It propagates label embeddings without leveraging any attribute information. • VariantIII:GraphHopwiththeinitializationstageandvanillaLP,wherethelabelem- beddings in LP are initialized from the classifiers’ predictions. It encodes smoothened attributeinformationwithcontinuouslabelsmootheningbutwithoutLRclassifiersfor parameter sharing. We compare the test accuracy of all three variants against GraphHop under the same label rate in Table 4.8. Clearly, the collective model design of GraphHop outperforms the other three. Note that Variant III is the most similar to GraphHop, and it yields the closest accuracy performance. Still, the LR classifiers that serve as smoothening regularization at each iteration can boost the performance. In addition, we show test accuracy curves as a function of the iteration number for Variant II and GraphHop in Fig. 4.7. We see that 82 GraphHop can achieve higher accuracy and converge faster than Variant II. These indicate that a good initialization of label embeddings and multi-hop neighbor aggregation result in faster convergence. Figure 4.7: The convergence curves of GraphHop and Variant II on citation datasets, where inner figures show the initial 10 iterations. Figure 4.8: The convergence of GraphHop’s test accuracy curves with (right) and without (left) residual connections for Reddit, where shaded areas indicate the standard deviation range. Over-Smoothening Problem We study the over-smoothening problem by examining the convergence behavior of GraphHop on Reddit in Fig. 4.8. The left subfigure shows the averaged convergence curves of GraphHop on Reddit with multiple label rates. The curves drop after five iterations and converge at around 50 iterations. We argue that this phenomenon is due to the over- smoothening of label embeddings. Generally speaking, correlations of embeddings are valid 83 Figure 4.9: Plots of the cumulative percentages of nodes in subset V 1 ∪V 2 ∪...∪V k , where different colors indicates different label rates. only locally. These are especially true for large-scale graphs. Adding uncorrelated infor- mation from long-distance hops tends to have a negative impact. To verify this claim, we conduct experiments on a variant of GraphHop, whose label embeddings are updated in the form of H (t) = (1− τ ) 1 K K X M=1 p model (Y|H (t− 1) M ;θ )+τ H (t− 1) , (4.15) where parameter τ ∈ (0,1) is used to control the update speed. Eq. (4.15) is also known as the residual connection [44]. A large τ value enables the model to preserve more information from the previous iteration and slows down the smoothening speed. We report the results with τ = 0.9 in the right subfigure while keeping the other settings the same as the left. We observe more stable curves with slower performance degradation. Fast Convergence We explain why only a few iterations can achieve convergence. The reason is that a small number of iterations reaches sufficiency in Corollary 1. Without loss of generality, we choose Cora and CiteSeer datasets for illustration. We calculate the number of nodes in each subset, V i , in Eq. (4.11) up to k hops. The cumulative results are shown in Fig. 4.9. Both subfigures indicate that all nodes in the graph can be reached from labeled samples within less than ten hops. Fast propagation of embeddings yields efficient label smoothening and fast convergence. Also, we see from Fig. 4.9 that higher label rates will reach sufficient 84 iterations faster than lower label rates. Besides, the label update step also contributes to faster convergence due to the further smoothening of the embeddings. 4.6 Comments on Related Work Graph-based Semi-supervised Learning. There is rich literature on semi-supervised learning [15], including generative models [64], the transductive support vector machine [8], entropyregularization[38],manifoldlearning[6]andgraph-basedmethods[35,60,184]. Our discussion is restricted to graph-related work. Most semi-supervised graph-based methods are built on the manifold assumption [15], where nearby nodes are close in thedata manifold and, as a result, they tend to have the same labels. Early research penalizes non-smoothness alongedgesofagraphwiththeMarkovrandomfield[189],theLaplacianeigenmaps[5],spec- tralkernels[17],andcontext-basedmethods[39,113]. Theirmaindifferenceliesinthechoice of regularization. The quadratic penalty term applied on nearby nodes to enforce label con- sistency with the data geometry is the most popular one. The optimization result is shown to be equivalent to LP [16]. Traditional graph-based methods are non-parametric, discrimi- native, and transductive, making them lightweight with good classification performance. To further improve the performance, methods are developed by combining graph-based regu- larization with other entities to yield one joint learning framework. Instead of constructing the graph ahead, the adjacency weights can be learned adaptively through the optimization process [61, 171, 180, 181]. Rather than regularizing label embeddings, this idea can be extended to attributes [147] and even to hidden layers or auxiliary embeddings in neural networks. Manifold regularization [6] and Planetoid [157] generalize the Laplacian regular- izer with a supervised classifier that imposes stronger constraints on the model learning. The work in [60] tries to generalize neural networks to transductive learning with the help of LP. However, their focus is mainly on image classifications instead of node classifications on graphs. 85 Graph Convolutional Networks. Inspired by the recent success of Convolutional Neural Networks (CNNs) [73, 74, 119, 141] on images and videos, a series of efforts have been made to generalize convolutional filters from grid-structured domains to non-Euclidean domains [2, 12, 46] with theoretical support from graph signal processing [129]. The space spannedbytheeigenvectorsofthegraphLaplaciancanberegardedasageneralizationofthe Fourier basis. By following this idea, a deep neural architecture was formulated in [12, 46] to employtheFouriertransformasaprojectionontotheeigenbasisofthegraphLaplacian. Fur- thermore, to overcome the expensive eigendecomposition, recurrent Chebyshev polynomials were proposed in [26] as an efficient filter for approximation. GCN [66] further simplified it by only considering the first-order approximation in the Chebyshev polynomials. GCN has inspired quite a lot of follow-up work, e.g., [43, 136, 149]. With the combination of embedding propagation and non-linear activation, GCNs offer impressive results on the semi-supervised classification problem. Later, it was explained in [78, 79, 107, 148] that the success of GCNs is due to a low-pass filtering operation performed on node attributes. Specifically, [79, 107, 148] have shown that the powerful feature extrac- tion ability behind the graph convolutional operation in GCNs is due to a low-pass filter applied on the feature matrix to extract only smooth signals for prediction. This simplified graph convolution can be formulated as an LR classifier on the aggregated features [148]. That is, we have H = p model (Y|S k X;θ ), (4.16) where p model (· ) is the LR classifier, S k is the k-th power of the normalized adjacency matrix S, θ is the classifier parameters, and H is the label embeddings output as defined previously. This shed light on a simple filter design on the graph feature matrix. However, the intensive computation of matrix powers restrains the long-range correlations from each node. Besides, there is no restriction on the smoothness of the predicted label embeddings, limiting the generalization to extremely small label rate cases. 86 DespitethestrengthofGCNs, theyhavestilllimitedinthreeaspects: 1)effectivemodel- ingofnodeattributesandlabelsjointly; 2)easeofscalabilitytolargegraphs; 3)requirement ofaconsiderableamountoflabelsfortraining. Forthefirstpoint, unlabeledsamplesarenot integrated into model training but only inference. Several algorithms have been proposed to tackle this deficiency. Zhang et al. [178] employed a Bayesian approach by modeling the graphstructure,nodeattributes,andlabelsasajointprobabilityandinferringtheunlabeled samples by calculating the posterior distribution. On the other hand, Qu et al. [116] used theconditionalrandomfieldtoembedthecorrelationbetweenlabeledandunlabeledsamples with GCNs for feature extraction. In practice, these methods are costly and incurred by the local minimum during optimization. Another line of methods employed self-training tech- niques to generate pseudo-labels for unlabeled samples and used them throughout training [78, 131, 164]. However, they do not utilize the correlation between labeled and unlabeled samples effectively and often suffer from label error feedback [75]. For the second point, the main issue of GCNs is the demand of loading the entire graph and intermediate node embeddings into memory, which makes the generalization to large graphs especially difficult. Unlike images in computer vision or sentences in natural lan- guage processing, one graph can be significant while its nodes are connected without seg- mentation. The layer-wise convolutional operation introduces an exponential expansion of neighborhood sizes [23], which hinders GCNs from minibatch training. Sampling-based strategies (e.g. GraphSAGE [40] and FastGCN [19]) have been proposed to overcome this problem. They attempt to reduce the neighborhood size during aggregation. Alternatively, some methods [23, 168] directly sample one or more subgraphs and perform subgraph-level training. Recently, You et al. [163] proposed a layer-wise training algorithm for GCNs, called L-GCN. The idea is that, instead of training multiple GCN layers at once, the gradi- ent update and parameters convergence are performed in a layer-wise fashion. Nevertheless, L-GCN requires a large amount of training data (i.e., heavily supervised learning), and its performancedegradesdramaticallywhenonlyafewlabeledsamplesareavailableintraining. 87 The limitations from the first and second points and the nonlinearity result in the last aspect,i.e.,therequirementofaconsiderablenumberoftraininglabels. Thelargenumberof parametersandnon-linearactivationtermsinduceGCNstorelyheavilyonlabelsupervision, wherethejointdependenceofnodelabelsisoftenignored. TheseweakenGCN’sperformance in the extremely small label rates. 4.7 Applications and Improvements Since GraphHop is an enhancement of the classical LP, any application that can be for- mulated as graph-based semi-supervised classification could be considered. For example, the underlying manifold in image classification can be formulated as a graph, where labels are smoothly distributed. Then, GraphHop can be applied with only a few labeled im- ages. GraphHop can also be applied to other vision tasks, e.g., face recognition [180], object recognition [171], video semantic recognition [87], human activity recognition [20], etc. The effective encoding of node attributes in model learning makes GraphHop suitable for data with ample or high-quality features. This is confirmed by high accuracy results of networks (e.g., citation networks, co-purchasing networks, and social networks) with rich attribute in- formation. Another application is large-scale graph classification. New real-world networks are getting bigger. GraphHop can be directly applied to large-scale graphs simultaneously with low memory cost, fast running time, and high performance. Although GraphHop achieves excellent performance in the node classification task, it couldbefurtherimprovedinseveralaspects. First,itadoptsthelogisticregressionclassifiers to learn the mapping between the smoothened node attributes and labels and regularize the aggregated neighborhood label embeddings with node itself in the initialization stage and the label update, respectively. The logistic regression is a linear classifier that may degrade the performance in nonlinear mappings. We may use a kernel in logistic regression or adopt nonlinear classifiers. Second, GraphHop is applied to transductive semi-supervised learning. 88 It is desired to extend it to inductive learning. Third, the two main operations in GraphHop (i.e.,aggregationandclassifiertraining)attempttosmoothentheattributeandlabelsignals. This is feasible on graphs with smooth signals. However, the smoothness assumption may not hold for networks of low homophily or heterophily [187]. For example, people of the opposite gender are more likely to connect in dating networks, where high-frequency signals (i.e., the difference between nodes) could be more relevant. 4.8 Conclusion A novel iterative LP-like method, called GraphHop, was proposed for transductive semi- supervised node classification on graph-structured data. The main ingredients contributing to the success of GraphHop are: 1) jointly modeling the smoothening node attribute and smoothening label signals on graphs; 2) introducing regression classifiers in each iteration to regularize the label embeddings in a smoothening way; 3) treating multi-hop neighbors independently and then aggregating them during the iteration process. They collaboratively leadtosuperiorclassificationaccuracythanthestate-of-the-artGCNalgorithms. GraphHop is scalable well to large-scale graphs and extremely small label rates. Theoretical deriva- tion and extensive experimental results were provided to demonstrate GraphHop’s efficiency and effectiveness. In the future, we will extend GraphHop to more challenging tasks (e.g., inductive learning) and derive some deeper understanding from different angles (e.g., the regularization framework). 89 Chapter 5 New Insights into GraphHop and Its Enhancement Inthischapter,wedescribeourfinalcontributiontothedissertation-adeepermathematical understandingoftheGraphHopmethodproposedinthepreviouschapter. Weshowthatthe iteration stage in GraphHop provides an alternate optimization to a specific regularization problem defined on graphs. In particular, the label update step can be analogized to one optimization subproblem of classifier training, and the label aggregation step is associated with the alternate subproblem of label embedding optimization. Based on this interpreta- tion, we propose two enhancements to GraphHop from these two optimization subproblems, respectively, resulting in the GraphHop++ method. Specifically, the first improvement is selecting reliable unlabeled samples for the classifier training instead of the whole unlabeled set. The second enhancement proposes an iteration process to approximate the optimal la- bel embedding solution better. Experimental results show that GraphHop++ outperforms a large margin compared to both GraphHop and other benchmarking methods at extremely low label rates. In addition, we apply our method to an object recognition task where state-of-the-art results are also achieved. 5.1 Introduction Semi-supervised learning that exploits both labeled and unlabeled data in learning tasks is highly in demand in real-world applications due to the expensive cost of data labeling 90 and the availability of a large amount of unlabeled samples. For graph problems where very few labels are available, the geometric or manifold structure of unlabeled data can be leveraged to improve the performance of classification, regression, and clustering algorithms. Graph convolutional networks (GCNs) have been accepted by many as the de facto tool in addressinggraphsemi-supervisedlearningproblems[40, 66, 130]. Simplyspeaking, basedon the input feature space, each layer of GCNs applies transformation and propagation through the graph and nonlinear activation to node embeddings. The GCN parameters are learned via label supervision through backpropagation [120]. The joint attributes encoding and propagation as smoothening regularization enable GCNs to produce prominent performance on various real-world networks. There are however remaining challenges in GCN-based semi-supervised learning. First, GCNsrequireasufficientnumberoflabeledsamplesfortraining. Insteadofderivingthelabel embeddingsfromthegraphregularizationastraditionalmethodsdo[184,190],GCNsneedto learnaseriesofprojectionsfromtheinputfeaturespacetothelabelspace. Theseembedding transformations largely depend on ample labeled samples for supervision. Besides, nonlinear activationpreventsGCNsfromclosed-formsolutions. Lackofsufficientlabeledsamplesmay make training unstable (or even divergent). To improve it, one may integrate other semi- supervisedlearningtechniques(e.g.,self-andco-training[78])withGCNtrainingorenhance the filter power to strengthen the regularization effect [79]. They are nevertheless restricted by internal deficiencies of GCNs. Second, GCNs usually consist of two convolutional layers and, as a result, the information from a small neighborhood for each node is exploited [1, 40, 66, 137]. The learning ability is handicapped by ignoring correlations from nodes of longer distance. Although increasing the number of layers could be a remedy, this often leads to an oversmoothing problem and results in inseparable node representations [78, 167]. Furthermore, a deeper model needs to train more parameters which requires even more labeled samples. 91 Inspired by the traditional PageRank [110] and label propagation (LP) algorithms [184, 190], some work discards feature transformations at every layer but maintains a few with additional embedding propagation [56, 68, 148, 186]. This augments regularization strength bycapturinglongerdistancerelationshipsandreducesthenumberoftrainingparametersfor label efficiency. Nonetheless, labels still serve as the guidance to model parameters training rather than direct supervision in the label embedding space. As a result, the resulting methods still suffer the unstable training problem and lack of supervision at very low label rates. Along this research idea, the C&S method [56] applied the optimization procedure to the label space directly with the same propagation as given in [184]. Yet, it was originally designedforsupervisedlearning, anditsperformancedeterioratessignificantlyifthenumber of labeled samples decreases. Also, a simple propagation strategy in Laplacian learning may suffer from the degeneracy issue; namely, the solution is localized as spikes near the labeled samples and almost constant far from them [13, 92, 128]. This is caused by a naive propagation procedure where the information cannot be carried over longer distances. An enhancement of traditional label propagation, called GraphHop, was recently pro- posed in [151]. GraphHop achieves state-of-the-art performance as compared with GCN- basedmethods[66, 78, 137]andotherclassicalpropagation-basedalgorithms[102, 142, 184]. It performs particularly well at extremely low label rates. Unlike GCNs that integrate node attributes and smoothening regularization into one end-to-end training system, GraphHop adopts a two-stage learning mechanism. Its initialization stage extracts smoothened node attributes and exploits them to predict label distributions of each node through logistic re- gression (LR) classifiers. The predicted label embeddings can be viewed as signals on graphs andadditionalsmootheningstepsareappliedinthesubsequentiterationstage. Theiteration stage consists of two steps: 1) label aggregation and 2) label update. In the first step, label embeddings are propagated and aggregated, which corresponds to low-pass filtering in the labelspace. Toaddressineffectivepropagationinextremelylowlabelrates,LRclassifiersare introduced in the second step. They are trained based on labels and pseudo-labels of a local 92 neighborhood and used to infer label embeddings in the following iterations. The classifier training enables parameter sharing among label embeddings of all nodes. It enhances the information passing over longer distances effectively. Although the superior performance of GraphHop was explained from the signal processing viewpoint (e.g., low-pass filtering on both attribute and label signals) in [151], its mathematical treatment was not rigorous. In this work, we attempt to analyze the superior performance of GraphHop from a regu- lated optimization viewpoint, which is more rigorous and transparent mathematically. The iteration stage of GraphHop will be interpreted as an alternately optimized solution to a variational problem under the regularization framework. The label aggregation and label update steps lead to an alternate optimization process, each of which solves one of two convex transformed subproblems under probability constraints. Based on the variational interpretation, we get two ideas to improve GraphHop and propose its enhanced version called GraphHop++. We conduct extensive experiments on GraphHop, GraphHop++ and many benchmarking methods and observe that GraphHop++ outperforms all other meth- ods (including GraphHop) consistently on all test datasets at all low labeled rates. The contributions of this work are summarized below. 1. We analyze GraphHop theoretically from a variational viewpoint and show that it cor- responds to an alternate optimization process that provides a solution to a constrained optimization problem. 2. Basedontheoreticalanalysis,weproposetwoenhancementsonGraphHop,whichleads to an even more powerful semi-supervised solution called GraphHop++. 3. We demonstrate the effectiveness and efficiency of GraphHop++ with extensive ex- periments on five commonly used datasets as well as an object recognition task at extremely low label rates (i.e., 1, 2, 4, 8, 16 and 20 labeled samples per class). The rest of this chapter is organized as follows. The problem definition and some back- ground information are stated in Sec. 5.2. A constrained optimization framework is set up 93 anditsconnectionwithGraphHopisbuiltinSec. 5.3. TheGraphHop++modelisproposed inSec. 5.4. ExtensiveexperimentsareconductedforperformancebenchmarkinginSec. 5.5. Comments on related work are provided in Sec. 5.6. Finally, concluding remarks are given in Sec. 4.8. 5.2 Preliminaries 5.2.1 Notations and Problem Statement Tobeginwith, wedefinesomenotationsusedthroughoutthischapterandgivethedefinition to the investigated problem. An undirected graph is given by G = (V,E), where V is the node set with |V| = n and E is the edge set. The graph structure can be described by adjacency matrix A ∈ R n× n , where A ij = 1 if node i and j are linked by an edge in G. Otherwise, A ij = 0. For A, its diagonal degree matrix is D ii = P n j=1 A ij and its graph Laplacian matrix is L = D− A. In an attributed graph, each node is associated with a d-dimensional feature vector x i ∈R d , where the feature space of all nodes is X∈R n× d . In a transductive semi-supervised classification problem, nodes in set V can be divided into sets of labeled nodesL with l samples and unlabeled nodesU with u samples followed by n = l +u. Let Y = (y T 1 ,··· ,y T l ,y T l+1 ,··· ,y T n ) T ∈ R n× c denote the initial labels of all samples, where y i ∈ R c is a row vector and c is the number of classes. If labeled node i belongs to class j, y ij = 1; otherwise, y ij = 0. To record the predicted label distribution of each node, the label embedding matrix is defined as F = (f T 1 ,··· ,f T l ,f T l+1 ,··· ,f T n ) T ∈R n× c with probability entries, where the ith row vector satisfying P c j=1 f ij = 1 with f ij is the probability of node i belonging to class j. The classes of unlabeled nodes can be assigned as y i = argmax 1≤ j≤ c f ij . Given graph G, node attributes X, and labeled samples Y, the objective is to infer labels of unlabeled nodes under the condition l≪ u. 94 5.2.2 Transductive Label Propagation We introduce the general regularization framework for transductive LP algorithms below. Formally, it can be defined as a problem that minimizes the objective function: min F tr(F T ˜ LF)+tr((F− Y) T U(F− Y)), (5.1) where tr(· ) is the trace operator, ˜ L is the random-walk (or symmetric) normalized graph Laplacian matrix, andU = diag(u 1 ,··· ,u n ) is a positive weighting matrix that balances the manifold smoothness and label fitness. For generality, we relax the probability constraints on label embeddings F. The closed-form solution of Eq. (5.1) can be derived as F ∗ = (U+ ˜ L) − 1 UY. (5.2) This result can also be obtained from an iteration process named LP. At each iteration, the label embedding information of each node is partially propagated by its neighbors and partially obtained from its initial label. Formally, at iteration t, the propagation can be described as F (t) =α ˜ AF (t− 1) +(1− α )Y, (5.3) where α ∈ (0,1) is a weighting hyper-parameter and ˜ A is the random-walk (or symmetric) normalized adjacency matrix. The converged label embedding matrix can be obtained by taking the limit of iterations as F ∗ = (1− α )(I− α ˜ A) − 1 Y, (5.4) which is the same as Eq. (5.2) by setting U = diag( 1− α α ,··· , 1− α α ). 95 5.2.3 GraphHop The recently proposed GraphHop model improves traditional LP methods and offers state- of-the-art performance on several datasets. There are two training stages in GraphHop (i.e., the initialization stage and the iteration stage). They are used to capture the smooth node attributes and label embeddings, respectively, as explained below. In the initialization stage, smoothened attribute signals are extracted by aggregating the neighborhood of each node. Formally, this can be expressed as X M =|| 0≤ m≤ M ˜ A m X, (5.5) where || denotes column-wise concatenation, ˜ A m is the random-walk normalized m-hop adjacency matrix (m = 0 indicates the attribute of each node itself), and X M ∈R n× d(M+1) isthesmoothenedattributematrix. Afterthat,alogisticregression(LR)classifierisadopted for training with labeled samples as supervision. The minimization of cross-entropy loss can be written as min W − 1 l l X i=1 y i log σ (Wx T M,i ) , (5.6) where x M,i is the ith row of X M , W ∈ R c× d(M+1) is the parameter matrix and σ (z) i = e z i P c j=1 e z j is the softmax function. The solution to Eq. (5.6) can be derived by any optimization algorithm (e.g., stochastic gradient descent). Once the classifier converges, the label embeddings of all nodes can be predicted. They serve as the initial embeddings to the subsequent iteration stage. Formally, this can be written as F (0) =F init =σ (X M W T ), (5.7) where the softmax function, σ (z), is applied in a row-wise fashion. Note that in the origi- nal GraphHop model, distinct classifiers are trained independently w.r.t. different hops of 96 aggregations and the final results are the average of all predictions. For simplicity, we only consider one classifier in the variational derivation. The same applies to the iteration stage. Intheiterationstage, therearetwosteps, calledlabelaggregationandlabelupdate, used for label embeddings propagation. In the label aggregation step, the same aggregation as given in Eq. (5.5) is conducted on label embeddings. Formally, at iteration t, we have F (t− 1) M =|| 0≤ m≤ M ˜ A m F (t− 1) , (5.8) where F (t− 1) M ∈R n× c(M+1) is the aggregated nodes’ neighborhood embeddings. In the aggre- gation of Eq. (5.8), the embedding parameters are not shared between nodes, i.e., each label embedding is independently derived from their neighborhoods. This will lead to deficient information passing and deteriorate the performance, especially at low label rates. In the label update step, GraphHop trains an LR classifier on the aggregated embedding space and uses inferred label embeddings at the next iteration to overcome the shortcoming mentioned above. The minimization problem of the LR classifier can be stated as min W − l X i=1 y i log σ (Wf (t− 1) T M,i ) − α n X i=l+1 Sharpen(f (t− 1) i )log σ (Wf (t− 1) T M,i ) , (5.9) where f (t− 1) M,i (resp. f (t− 1) i ) is the ith row of F (t− 1) M (resp. F (t− 1) ), W ∈ R c× c(M+1) is the parameter matrix, α is a weighting hyper-parameter, and Sharpen(z) i =z 1 T i c X j=1 z 1 T j is the sharpening function to control the confidence of the current label embedding distribu- tions. The first term in the objective function is contributed by labeled samples while the second term is from unlabeled samples, which are pseudo-labels generated by the classifier. 97 Note that we can view the classifier training as a smoothening operation on label signals by sharing the same classifier parameters. We can also regard the cross-entropy minimization in Eq. (5.9) as minimizing the KL-divergence between two distributions. Simply speaking, GraphHop uses the LR classifier to map the neighborhood label distribution with each node itself so as to makes the inference locally similar. When the iterative optimization process converges, the inferred label embeddings at iteration t is F (t) =σ (F (t− 1) M W T ). (5.10) They serve as the input to the next iteration through Eq. (5.8). In the entire process of GraphHop, each label embedding is a probability vector. Insummary, thesmoothenedattributes ongraphsare extractedandused forlabel distri- butionpredictionintheinitializationstage. Theselabelembeddingsarefurthersmoothened by aggregation and classifier training in the iteration stage and used for final inference. 5.3 Understanding GraphHop via A Regularization Framework The superior performance of GraphHop was explained by joint smoothening of node at- tributes and label embeddings through propagation and classifier training in the former section. Here, we interpret the smoothening process in GraphHop using a regularization framework, show that it corresponds to an alternate optimization process of a certain objec- tive function with probability constraints. 5.3.1 Constrained Optimization Framework The initialization and iteration stages of GraphHop actually share a similar procedure. The main iterative process arises in the iteration stage. In this subsection, we will derive a 98 variationalinterpretationforsuchaniterativeprocess. Tobeginwith,wesetuparegularized optimization problem: min F,W tr(F T ˜ LF)+tr (F− F init ) T U(F− F init ) +tr (F− σ (F M W T )) T U α (F− σ (F M W T )) s.t. F M =|| 0≤ m≤ M ˜ A m F, F1 c =1 n , F≥ 0 (5.11) where ˜ L = D − 1 L is the random-walk normalized graph Laplacian matrix, F init is the ini- tialized label embeddings from Eq. (5.7), σ (z) is the softmax function applied row-wisely, U∈ R n× n and U α ∈ R n× n are diagonal hyper-parameter matrix with nonnegative entries, 1 c ∈ R c and 1 n ∈ R n are column vectors of ones, and W ∈ R c× c(M+1) is the parameter matrix. The last two constraints make each label embedding a probability vector. The cost function in Eq. (5.11) consists of three terms. The first two give the objective function of the general regularization framework in Eq. (5.1). The third term is the Frobenius norm on F− σ (F M W T ) ifU α is the identity matrix. It can be viewed as further regularization. The functionality of this extra regularization term can be seen as forcing the label embeddings to be close to the classifier predictions based on the neighborhood representations. Note that we have replaced matrix Y with the initialized label embeddings, F init , derived from Eq. (5.7) in the cost function. 5.3.2 Alternate Optimization We present the optimization procedure for the solution of Eq. (5.11) and discuss its rela- tionship with GraphHop. Due to the introduction of variable W in the extra regularization term, the optimization of two variables, F and W, depends on each other. The problem cannot be solved directly but alternately. We propose an alternate optimization strategy that updates one variable at a time. This alternate optimization strategy has two subprob- lems. Each of them is a nonconvex optimization problem. With approximation and convex 99 transformation, we show that they correspond to the label aggregation and the label update steps in the iteration stage of GraphHop, respectively. Update W with Fixed F The classifier parameters, W, only exist in the extra regularization term of Eq. (5.11). Thus, the optimization problem is min W tr (F− σ (F M W T )) T U α (F− σ (F M W T )) . (5.12) By setting U α = diag(u α, 1 ,··· ,u α,n ), Eq. (5.12) can be written as min W n X i=1 u α,i ||f T i − σ (Wf T M,i )|| 2 2 , (5.13) which is a weighted sum of the l 2 -norm on the difference between each label embedding and the classifier prediction. It is a nonconvex problem due to the softmax function. To address it, we consider the following transformation. Since f i and σ (Wf T M,i ) are both probability vectors, it is more appropriate to use the KL-divergence to measure their distance. Then, by converting the l 2 -norm to the KL-divergence and only considering the term w.r.t. parameter W, we are led to the cross-entropy loss minimization: min W − n X i=1 u α,i f i log(σ (Wf T M,i )). (5.14) Note that the optimized solutions to Eq. (5.13) and Eq. (5.14) are identical. Besides, we have the following Theorem. Theorem 4. The optimization problem in Eq. (5.14) is a convex optimization problem. The proof is given in the Appendix. 100 We can rewrite the cost function as contributions from labeled and unlabeled samples in form of − l X i=1 u α,i f i log(σ (Wf T M,i ))− n X i=l+1 u α,i f i log(σ (Wf T M,i )). (5.15) Instead of adopting the current label embedding (i.e., f i ) as supervision for distributions mapping directly, two improvements can be made to the labeled and unlabeled terms, re- spectively. For the labeled term, we can replace label embeddings with ground-truth labels (i.e.,y i )forsupervision. Fortheunlabeled term, sincetheir isnoground-truthlabelandthe current supervision is adopted from the former iteration (i.e., pseudo-labels), a sharpening function can be used to control the confidence of the present pseudo-labels for supervision. Also, by simplifying weighting hyper-parameter u α,i to the same value (i.e., 1 for labeled data and α for unlabeled data), Eq. (5.15) becomes min W − l X i=1 y i log σ (Wf T M,i ) − α n X i=l+1 Sharpen(f i )log σ (Wf T M,i ) , (5.16) which is exactly the objective function in the label update step of GraphHop as given by Eq. (5.9). Eq. (5.16) is a convex function, and its optimum can be obtained by a standard optimization procedure. Update F with Fixed W With fixed classifier parameters W, the optimization problem in Eq. (5.11) becomes min F tr(F T ˜ LF)+tr((F− F init ) T U(F− F init )) +tr (F− σ (F M W T )) T U α (F− σ (F M W T )) s.t. F M =|| 0≤ m≤ M ˜ A m F, F1 c =1 n , F≥ 0. (5.17) This is a nonconvex optimization problem due to the last regularization term which involves both parametersF andF M . To solve this, we argue that F andF M in the last term should 101 not be optimized simultaneously. The reason is that, in deriving the former subproblem in Eq. (5.16), the supervision of unlabeled samples are the pseudo-labels from previous iterations. The pseudo-labels should serve as the input (rather than parameters) in the current optimization. For simplicity, we fix the parameter inside F M 1 Then, we obtain the following optimization problem: min F tr(F T ˜ LF)+tr((F− F init ) T U(F− F init )) +tr (F− σ (F M W T )) T U α (F− σ (F M W T )) s.t. F M =|| 0≤ m≤ M ˜ A m ˆ F, F1 c =1 n , F≥ 0, (5.18) whereF M isaconstantderivedfromthepseudo-labelembeddings ˆ F, whichcanbeviewedas theinitiallabelembeddingsofthisoptimizationsubproblem. Wehavethefollowingtheorem for this problem. Theorem 5. The problem in Eq. (5.18) is a convex optimization problem. The proof can be found in the Appendix. A straightforward way to solve this problem is to apply the KKT conditions [11] to the Lagrangian function. Instead, we show a different derivation which is intuitively connected to GraphHop and leading to the same optimum. First, we take the derivative of the cost function and set it to zero. It gives F ∗ = (U+U α + ˜ L) − 1 (UF init +U α σ (F M W T )). (5.19) Next, we introduce two constants U ′ and Y ′ : U ′ =U+U α U ′ Y ′ =UF init +U α σ (F M W T )) . (5.20) 1 F and σ (F M W) are initially the same due to the optimization in Eq. (5.14)). 102 By substituting these two terms in Eq. (5.19), we obtain F ∗ = (U ′ + ˜ L) − 1 U ′ Y ′ . (5.21) It has the same form as Eq. (5.2). Thus, the same result can be derived for the label propagation process as given in Eq. (5.3). It can be expressed as F (t) =U β ˜ AF (t− 1) +(I− U β )Y ′ = U β ˜ AF (t− 1) +(I− U β ) × (U+U α ) − 1 UF init +(U+U α ) − 1 U α σ (F M W T ) , (5.22) where U β = (I+U ′ ) − 1 = (I+U+U α ) − 1 . This is summarized in the following Proposition. Proposition 1. The iteration process in Eq. (5.22) converges to Eq. (5.21) and further to Eq. (5.19). The proof is given in the Appendix. Also, we have the following theorem stating the relationships between the convergence result, i.e., Eq. (5.19), and the optimal solution to the optimization problem in Eq. (5.18). Theorem 6. If the variable, F, is initialized in a probabilistic way, i.e., F (0) 1 c = 1 n and F (0) ≥ 0, then the convergence result of the iteration process in Eq. (5.22) is the optimal solution to the optimization problem in Eq. (5.18). The proof is shown in the Appendix. Note that the probabilistic initialization can be easily achieved. That is, we can directly adopt the result from the former round of optimization and initialize it as F init . Theorem 6 providesaniterativesolutiontotheoptimizationproblem(5.18). Bycomparingtheiteration 103 process in Eq. (5.22) with the label aggregation in Eq. (5.8) and the label update in Eq. (5.10) of GraphHop, we see that GraphHop only uses the last term of Eq. (5.22) with one iteration, i.e., F ∗ ≈ F =σ (F M W T ). (5.23) This is a rough approximation to the iteration process in Eq. (5.22). On one hand, it does not yield the optimal solution to Eq. (5.18) in general. On the other hand, it still meets the probability constraints. Most importantly, it lowers the cost in optimizing Eq. (5.18), which makes GraphHop scalable to large-scale networks. We summarize the main results of this section as follows. GraphHop offers an alter- nate optimization solution to the variational problem (5.11). It solves two approximate subproblems alternatively and iteratively. Its classifier training in the label update step is a convex transformation of the optimization problem in Eq. (5.12). Its label aggregation and update steps as given in Eqs. (5.8) and (5.10) yield an approximate solution to the optimization problem in Eq. (5.18), which itself is an approximated convex transformation of the problem in Eq. (5.17). Based on this understanding, we propose an enhancement, named GraphHop++, to address the limitations of GraphHop. 5.4 GraphHop++ Thediscussioninthelastsectionleadstotwoideastoimprovesolutionstothetwoalternate optimization subproblems naturally. They are elaborated in Secs. 5.4.1 and 5.4.2, respec- tively. An overview of the GraphHop++ model is shown in Fig. 5.1. In the left subfigure, itsupperpartshowslabelednodesfromtwodifferentclasses(inblueandred)andunlabeled nodes (in gray) while its lower part shows label embeddings predicted by the LR classifer. They serve as the initialization for the subsequent alternate optimization process. The right subfigure depicts the alternate optimization process. It consists of two alternate optimiza- tion steps: 1) iterations for label embeddings update and 2) classifier training for classifier 104 LR Classifier Inference LR Classifier Training Iterations Classifier Training Alternate optimization LR Classifier Training and Inference Initialization Figure 5.1: An overview of the GraphHop++ method, where the left subfigure shows the initialization for the subsequent process while the right subfigure depicts the alternate opti- mization process. The latter consists of: 1) iterations for label embeddings update and 2) classifier training for classifier parameters update. Round 1 Round 2 (a) (b) Figure 5.2: Illustration of growing curriculum sets in the alternate optimization process: (a) the original graph, (b) and (c) selected reliable nodes in the curriculum set in the first and second rounds of the optimization process. Two labeled nodes of two classes are colored in blue and red, unlabeled nodes and nodes in the curriculum set are colored in gray and green, respectively. The dotted ellipses show the corresponding curriculum sets of two labeled nodes. parameters update. In the right subfigure, nodes in the curriculum set are colored in green. Onlythelabelednodesandnodesinthecurriculumsetareemployedintheclassifiertraining step. 105 5.4.1 Enhancement in Classifier Training Inthesubproblemofclassifierparametersupdatingwithfixedlabelembeddingsasdescribed in Sec. 5.3.2, the substituted objective function in GraphHop is given in Eq. (5.16). Un- labeled samples are supervised by label embeddings from the previous iteration, which can be regarded as a self-training process with pseudo-labels. However, the pseudo-labels may not be reliable, which may have a negative impact on classifier training. Besides, this self- training may suffer from label error feedback [75], which makes the classifier biased in the wrong direction. Thus, we should not include all unlabeled samples but the most trustwor- thy ones in classifier training. Intuitively, reliability of pseudo-labels of unlabeled samples can be measured by their distances to labeled ones in the network; namely, the closer to labeled samples, the more reliable the pseudo-labels. The rationale is that ground-truth la- bels should have stronger influence on their neighborhoods than nodes farther away through propagation. Formally, we define the collection of the reliable nodes as follows. Definition1. Thesetofreliablenodesisdefinedbythecurriculumset S, whichisasubsetof unlabeledsamplesS⊆U . Attherthround(startingfromr = 1)ofthealternateoptimization process, nodes in this set satisfy S (r) ={u|min v∈V dist(u,v)≤ r,u∈U}, where dist(u,v) is the shortest path between the two nodes. Inwords,attherthroundoftheoptimizationprocess,unlabelednodesthatarewithinr- hopawayfromthelabelednodesareacceptedasreliablenodesandincludedinthecurriculum set. This is illstrated in Fig. 5.2. Then, we can modify the objective function of classifier training in Eq. (5.16) as min W − l X i=1 y i log σ (Wf T M,i ) − α X i∈S Sharpen(f i )log σ (Wf T M,i ) . (5.24) 106 As a result, only reliable nodes in the curriculum set contribute to classifier training. This enhancement also reduces the training time since there are fewer noisy samples. 5.4.2 Enhancement in Label Embeddings Update For the subproblem of label embeddings update with fixed classifier parameters as presented in Sec. 5.3.2, we showed that, with a probabilistic initialization of the label embeddings, the optimal solution to Eq. (5.18) can be derived from an iteration process as given in Eq. (5.22). For GraphHop, it is further simplified to one iteration with only pseudo-labels generated from the classifier as given in Eq. (5.23). It is however a rough approximation. GraphHop can be enhanced by considering a more accurate solution to Eq. (5.19). The inverse of the Laplacian matrix in Eq. (5.19) has high time complexity -O(n 3 ), which is not practical. Instead, we adopt the iteration in Eq. (5.22) to approximate the optimal solution with fewer iterations and avoid matrix inversion. We propose the following iteration: F (t) =U β ˜ AF (t− 1) +(I− U β )F init =β ˜ AF (t− 1) +(1− β )F init , (5.25) where hyper-parameters inU β are simplified to be the same for all nodes. The first iteration at each round is given by F (0) =σ (F M W T ), where F M =|| 0≤ m≤ M ˜ A m ˆ F. (5.26) Note that we drop the last classifier inference term σ (F M W T ) in Eq. (5.25) for simplicity since it has already been included in Eq. (5.26). The balance between F init and F (t− 1) can also be achieved by adjusting weight β . If there is no iteration in Eq. (5.25) (i.e., t = 0), it degenerates to GraphHop. Another advantage of the iteration is that it can alleviate the oversmoothing problem introduced by continuous label embeddings update in Eq. (5.23). Since the initial label embeddings are 107 added at each iteration as shown in Eq. (5.25), the distinct embedding information from the initialization stage is consistently enforced at each node during the optimization process. With the above two enhancements, the pseudo-codes of GraphHop++ are given in Al- gorithm 3. Note that we adopt the same label initialization as GraphHop as given in Eq. (5.6). The LR classifier is a linear classifier by nature. Yet, nonlinearity can be introduced with nonlinear kernels. This has not yet been tried in this work. Experimental results show that GraphHop++ usually converges in less than 20 alternate rounds while the number of iterations of Eq. (5.25) is less than 10. Algorithm 3 GraphHop++ 1: Input: Graph A, attributes X, label vectors Y 2: Initialization: 3: X M ← compute Eq. (5.5) 4: while not converged do 5: for each minibatch of labeled nodes do 6: Compute g←∇ L(X M ,Y;W) in Eq. (5.6) 7: Conduct Adam update using gradient estimator g 8: end for 9: end while 10: F (0) ← compute Eq. (5.7) 11: Alternate optimization of Eq. (5.11): 12: for r∈ [1,...,max round] do 13: for t∈ [1,...,max iter] do 14: Compute Eq. (5.25) 15: end for 16: Update the curriculum setS defined in 1 17: while not converged do 18: for each minibatch of all nodes do 19: Compute g←∇ L(F,F M ,Y;W) in Eq. (5.24) 20: Conduct Adam update using gradient estimator g 21: end for 22: end while 23: Compute Eq. (5.26) 24: end for 25: Output: Labels of unlabeled nodes y i = argmax 1≤ j≤ c f ij 108 Table 5.1: Benchmark dataset properties and statistics. Dataset Nodes Edges Classes Feature Dims Cora 2,708 5,429 7 1,433 CiteSeer 3,327 4,732 6 3,703 PubMed 19,717 44,338 3 500 Amazon Photo 7,487 119,043 8 745 Coauthor CS 18,333 81,894 15 6,805 5.5 Experiments We evaluate GraphHop++ on the transductive semi-supervised node classification task with multiple real-word graph datasets in Secs. 5.5.1-5.5.6. Furthermore, we apply it to an object recognition problem in Sec. 5.5.7 5.5.1 Datasets For performance benchmarking, we consider five widely used graph datasets whose statistics are shown in Table 5.1, including the numbers of nodes, edges, node labels (i.e. classes) and node feature dimensions. Among them, Cora, CiteSeer, and PubMed [66, 124, 157] are three citation networks. Their nodes represent documents, links indicate citations between documents, labelsdenotedocument’scategories, andnodefeaturesarebag-of-wordsvectors. Amazon Photo [94, 126] is a co-purchase network. Its nodes represent goods items and links suggest that the two goods items are frequently purchased together. Its labels are product categories and node features are bag-of-words encoded product reviews. Coauthor CS [126] is a co-authorship graph. Its nodes are authors, which are connected by an edge if they have a co-authored paper. Node features are keywords of author’s papers while a node label indicate the author’s most active fields of study. For data splitting, the standard evaluation of semi-supervised node classification is to sample20labelednodesperclassasthetrainingset[66,126]. Todemonstratelabelefficiency ofGraphHop++,weexamineextremelylowlabelratesinthetrainingset. Thatis,wesample 109 1, 2, 4, 8, 16, and 20 labeled nodes per class randomly to form the training set. After that, we follow the standard evaluation procedure by using 500 of the remaining samples as the validation set and the rest as the test dataset. We apply the same dataset split rule to all benchmarking methods for fair comparison. 5.5.2 Experimental Settings Benchmarking Methods, Performance Metrics and Experimental Environment We compare GraphHop++ with representative state-of-the-art methods of the following three categories. 1. GCN-based Methods: GCN [66] and GAT [137]. 2. Propagation-based Methods: LP [184], APPNP [68], C&S [56], and GraphHop [151]. 3. Label-efficient GCN-based Methods: Co-training GCN [78], self-training GCN [78], IGCN [79], GLP [79], and CGPN [139]. We implement GCN and GAT using the PyG library [29] and LP based on its description. For other benchmarking methods, we adopt their released code. All experimental settings are the same as those specified in their original papers. We run each method on the five datasets as described in Sec. 4.5.1. AllexperimentsforGraphHop++andbenchmarkingmethodsareconductedtentimeson each dataset. The mean test accuracy and the standard deviation are used as the evaluation metric. The experimental environment is a machine with an NVIDIA Tesla V100 GPU (32-GB memory), a ten-core Intel Xeon CPU and 50 GB of RAM. Implementation Details of GraphHop++ We implement GraphHop++ in PyTorch by following GraphHop. That is, we train two independent LR classifiers with one- and two-hop neighborhood aggregation. Thus, there are two LR classifiers trained in the label embedding initialization stage and in the alternate optimization process, respectively. For classifier training, we adopt the Adam 110 optimizerwithalearningrateof0.01and5× 10 − 5 weightdecayforregularization. Minibatch training is adopted for larger label rates while the full batch training is used for smaller label rates since there may not be enough labeled samples in a minibatch for the latter. The number of training epochs is set to 1000 with an early stopping criterion to avoid overfitting. It is observed that the above-mentioned hyper-parameters have little impact on classifier training in experiments; namely, the classifier converges efficiently for a wide range of hyper- parameters. We set the maximum number of alternate rounds in the alternate optimization process to 100 for all datasets. Since this number is large enough for GraphHop++ to converge as observed in the experiments. We perform a grid search based on validation results in the hyper-parameter space of T, α , β , and max iter. The temperature, T, in the sharpening functionissearchedfromgrid{0.1,0.5,1,10,100},theweightingparameterα intheclassifier objective function is searched from grid {0.1,1,10,100}, the weighting parameter β in the label embedding iteration is searched from grid{0.1,0.5,0.9}, and the number of iterations max iter is searched from grid {1,5,10}. These hyper-parameters are tuned for different label rates and datasets. Table 5.2: Test accuracy for the Cora dataset with extremely low label rates measured by “mean accuracy (%) ± standard deviation”. The highest mean accuracy is in bold while the second and third ones are underlined. Cora # of labels per class 1 2 4 8 16 20 GCN 40.48± 3.62 49.70± 1.56 67.23± 1.34 73.88± 0.75 79.66± 0.45 81.76± 0.25 GAT 40.12± 5.75 50.69± 2.08 68.82± 2.00 75.08± 0.81 79.45± 0.59 81.59± 0.44 LP 51.34± 0.00 54.19± 0.00 60.64± 0.00 67.57± 0.00 69.47± 0.00 71.03± 0.00 APPNP 60.70± 1.26 68.49± 1.31 75.12± 0.93 79.15± 0.53 81.17± 0.40 82.46± 0.61 C&S 37.46± 14.90 33.93± 8.61 48.93± 9.94 64.07± 2.62 73.63± 1.14 74.90± 1.06 GraphHop 59.12± 3.18 59.21± 2.66 73.22± 0.86 75.48± 0.90 79.29± 0.46 81.05± 0.39 Co-training GCN 56.58± 0.80 66.92± 0.73 71.95± 0.65 75.56± 0.93 77.36± 0.76 80.17± 0.97 Self-training GCN 39.71± 3.42 52.32± 6.72 65.82± 4.85 75.16± 1.90 78.14± 0.92 80.57± 0.59 IGCN 61.23± 1.94 63.75± 2.59 71.43± 0.66 78.46± 0.57 80.04± 0.50 82.51± 0.41 GLP 55.67± 6.79 57.71± 3.99 70.26± 2.68 76.78± 1.17 80.38± 0.60 82.17± 0.71 CGPN 70.56± 0.00 66.41± 0.00 72.71± 0.00 76.30± 0.00 75.91± 0.00 78.14± 0.00 GraphHop++ 72.47± 0.50 73.86± 0.67 79.15± 0.52 80.07± 0.22 81.27± 0.56 82.65± 0.30 111 Table 5.3: Test accuracy for the Citeseer dataset with extremely low label rates measured by “mean accuracy (%)± standard deviation”. The highest mean accuracy is inbold while the second and third ones are underlined. CiteSeer # of labels per class 1 2 4 8 16 20 GCN 26.04± 1.65 34.69± 3.25 52.02± 0.75 61.51± 0.87 68.03± 0.36 68.67± 0.26 GAT 26.83± 2.32 41.95± 3.99 52.43± 0.97 62.86± 0.91 68.23± 0.46 68.47± 0.51 LP 20.10± 0.00 32.72± 0.00 33.25± 0.00 42.21± 0.00 46.58± 0.00 47.51± 0.00 APPNP 34.18± 1.53 47.04± 2.64 54.55± 0.60 65.40± 0.47 69.46± 0.48 70.32± 0.72 C&S 25.34± 4.88 25.47± 3.39 37.55± 3.19 46.24± 1.41 55.68± 1.47 57.98± 1.22 GraphHop 48.40± 3.08 53.27± 5.16 54.34± 1.66 60.11± 1.60 64.99± 1.26 67.47± 0.66 Co-training GCN 28.24± 0.27 36.55± 0.99 33.77± 3.62 58.20± 1.06 64.10± 1.69 67.79± 0.76 Self-training GCN 30.45± 5.76 36.34± 8.54 43.59± 5.91 62.50± 3.23 68.41± 0.84 69.63± 0.31 IGCN 29.63± 0.45 45.19± 1.18 51.48± 1.88 64.49± 1.03 68.97± 0.43 69.80± 0.28 GLP 24.10± 5.67 40.00± 5.45 49.83± 2.36 63.54± 1.16 68.19± 0.89 69.10± 0.37 CGPN 51.93± 0.00 62.31± 0.00 50.55± 0.00 59.63± 0.00 63.05± 0.00 62.91± 0.00 GraphHop++ 53.20± 0.85 59.66± 0.44 61.04± 1.24 66.39± 0.44 69.62± 0.32 70.77± 0.47 Table 5.4: Test accuracy for the PubMed dataset with extremely low label rates measured by “mean accuracy (%)± standard deviation”. The highest mean accuracy is inbold while the second and third ones are underlined. PubMed # of labels per class 1 2 4 8 16 20 GCN 48.11± 9.76 65.01± 2.05 70.99± 0.35 70.57± 0.79 76.84± 0.26 77.38± 0.20 GAT 57.64± 5.52 70.04± 1.14 69.79± 0.39 71.66± 0.30 75.62± 0.35 76.46± 0.20 LP 63.70± 0.00 67.16± 0.00 66.37± 0.00 65.89± 0.00 68.63± 0.00 70.55± 0.00 APPNP 71.05± 0.37 71.37± 0.35 71.05± 0.19 72.73± 1.22 79.17± 0.39 79.22± 0.41 C&S 46.58± 6.26 57.41± 8.33 67.32± 2.00 71.08± 0.99 74.25± 0.92 74.51± 0.85 GraphHop 67.13± 2.51 68.82± 0.80 69.62± 0.31 71.21± 0.58 74.98± 0.31 76.05± 0.32 Co-training GCN 62.41± 0.35 68.70± 0.31 67.27± 0.43 69.66± 0.17 76.21± 0.23 76.93± 0.20 Self-training GCN 54.06± 7.89 70.50± 2.83 67.43± 1.39 69.40± 1.03 73.44± 1.92 76.96± 0.85 IGCN 70.17± 0.11 71.62± 0.11 70.93± 0.08 73.45± 0.41 78.96± 0.28 79.53± 0.15 GLP 70.06± 0.32 71.30± 0.41 70.66± 0.68 73.39± 0.33 77.92± 0.32 79.13± 0.31 CGPN 69.22± 0.00 69.48± 0.00 68.38± 0.00 68.97± 0.00 68.87± 0.00 69.93± 0.00 GraphHop++ 72.67± 1.79 72.00± 0.36 71.11± 0.19 72.59± 0.07 77.20± 1.45 79.71± 0.33 112 Table 5.5: Test accuracy for the Amazon Photo dataset with extremely low label rates measured by “mean accuracy (%)± standard deviation”. The highest mean accuracy is in bold while the second and third ones are underlined. Amazon Photo # of labels per class 1 2 4 8 16 20 GCN 35.61± 3.37 39.52± 10.35 69.55± 1.37 72.53± 0.90 75.76± 1.06 78.21± 0.55 GAT 46.24± 0.00 67.66± 2.52 78.78± 4.24 83.30± 1.86 83.54± 1.47 84.78± 1.11 LP 61.09± 0.00 72.96± 0.00 67.79± 0.00 76.69± 0.00 81.32± 0.00 82.62± 0.00 APPNP 36.72± 29.73 33.15± 32.40 50.90± 35.43 86.63± 0.56 88.19± 0.57 87.50± 0.60 C&S 33.01± 18.02 36.47± 5.19 64.45± 4.27 78.69± 2.44 84.67± 1.18 85.39± 1.37 GraphHop 58.76± 4.12 70.86± 7.51 78.30± 2.78 83.67± 1.24 87.16± 1.55 88.88± 0.97 Co-training GCN 45.97± 6.00 56.84± 5.29 70.82± 3.44 75.94± 2.92 79.49± 1.11 81.45± 1.05 Self-training GCN 20.21± 11.28 28.65± 8.98 65.71± 4.80 72.69± 4.61 80.56± 2.14 83.09± 0.87 IGCN 29.57± 7.03 28.70± 4.06 56.51± 1.41 67.12± 3.24 72.77± 4.10 79.61± 1.14 GLP 9.12± 0.00 9.25± 0.00 26.59± 4.20 35.58± 0.26 40.49± 5.74 55.30± 6.56 CGPN 62.22± 0.00 64.76± 0.00 78.35± 0.00 80.03± 0.00 78.14± 0.00 82.16± 0.00 GraphHop++ 67.41± 2.00 83.10± 1.58 86.10± 0.75 89.26± 0.36 91.83± 0.33 92.06± 0.29 Table 5.6: Test accuracy for the Coauthor CS dataset with extremely low label rates mea- sured by “mean accuracy (%)± standard deviation”. The highest mean accuracy is inbold while the second and third ones are underlined. Coauthor CS # of labels per class 1 2 4 8 16 20 GCN 64.42± 2.14 74.07± 1.09 82.62± 0.74 87.88± 0.55 89.86± 0.15 90.19± 0.17 GAT 72.81± 2.01 80.76± 1.25 85.37± 0.94 88.06± 0.61 89.83± 0.16 89.82± 0.17 LP 52.63± 0.00 59.34± 0.00 61.77± 0.00 68.60± 0.00 73.50± 0.00 74.65± 0.00 APPNP 71.06± 17.48 86.18± 0.56 86.57± 0.58 89.66± 0.30 90.65± 0.21 90.55± 0.17 C&S 35.63± 13.20 67.11± 3.50 78.57± 1.22 83.92± 1.15 87.09± 0.95 87.97± 0.56 GraphHop 65.03± 0.01 77.59± 3.17 83.79± 1.13 86.69± 0.62 89.47± 0.38 89.84± 0.00 Co-training GCN 75.78± 1.00 86.94± 0.70 87.40± 0.78 88.81± 0.32 89.22± 0.42 89.01± 0.55 Self-training GCN 69.69± 3.27 82.79± 4.10 87.62± 1.49 88.82± 1.00 89.53± 0.54 89.07± 0.70 IGCN 62.16± 2.81 59.56± 2.91 65.82± 4.86 86.57± 0.78 87.76± 0.74 88.10± 0.52 GLP 43.56± 7.06 50.74± 7.55 46.61± 9.85 76.61± 3.39 81.75± 2.81 82.43± 3.31 CGPN 67.66± 0.00 64.49± 0.00 71.00± 0.00 77.09± 0.00 78.75± 0.00 79.71± 0.00 GraphHop++ 82.46± 1.28 86.37± 0.37 88.45± 0.35 89.87± 0.48 90.69± 0.13 90.87± 0.06 5.5.3 Performance Evaluation The performance results on five graph datasets are summarized in Tables 5.2, 5.3, 5.4, 5.5, and 5.6, respectively. Each column shows the classification accuracy (%) of GraphHop++ and benchmarking methods on test data under a specific label rate. Overall, GraphHop++ has the top performance among all comparators in most cases. In particular, for cases with extremely limited labels, GraphHop++ outperforms other benchmarking methods by a 113 largemargin. Thisisbecausegraphconvolutionalnetworksaredifficulttotrainwithasmall number of labels. The lack of supervision prevents them from learning the transformation from the input feature space to the label embedding space with nonlinear activation at each layer. Instead, GraphHop++ applies regularization to the label embedding space. It is more effective since it relieves the burden of learning the transformation. A small number of labelednodesalsorestrictstheefficacyofmessagepassingongraphsofGCN-basedmethods. In general, only two convolutional layers are adopted by GCN-based methods, which means that only messages in the two-hop neighborhood of each labeled node can be supervised. However, the two-hop neighborhood of a limited number of labeled nodes cannot cover the whole network effectively. As a result, a large amount of nodes do not have supervised training from labels and it results in inferior performance of all GCN-based methods. Some label-efficient GCN-based methods try to alleviate this problem by exploiting pseudo-labels as supervision (e.g., self-training GCN) or improving message passing capability (e.g., IGCN and CGPN). However, they are still handicapped by deficient message passing in the graph convolutional layers. Some propagation-based methods (e.g., GraphHop++, APPNP, and GraphHop) achieve better performance in most datasets with low label rates. This is because the messages from labeled nodes can pass a longer distance on graphs through an iterative process. Yet, although LP and C&S is also propagation-based, their performance is poorer since LP fails to encode the rich node attribute information in model learning while C&S was originally designed for supervised learning and lack of labeled samples degrades its performance sig- nificantly. 5.5.4 Convergence Analysis Since GraphHop++ can be interpreted as an alternate optimization process of an objective function, we show the convergence of label embeddings F and parameters of LR classifiers W in Figs. 5.3 and 5.4, respectively, to demonstrate its convergence behavior. Fig. 5.3 114 (a) (b) (c) (d) (e) Figure 5.3: Convergence results of the label embeddings for the five benchmarking datasets: (a) Cora, (b) CiteSeer, (c) PubMed, (d) Amazon Photo, and (e) Coauthor CS. The x-axis is the number of alternate optimization rounds and the y-axis is the test accuracy (%). Different curves show the mean accuracy values under different label rates and the shaded areas represent the standard deviation. (a) (b) (c) (d) (e) Figure 5.4: Convergence results of the LR classifiers for the benchmarking datasets: (a) Cora, (b) CiteSeer, (c) PubMed, (d) Amazon Photo, and (e) Coauthor CS. The x-axis is the number of training epochs and the y-axis is the training loss. The label rate is 20 labels per class. shows the accuracy change along with the number of alternate rounds of the optimization process. We see that label embeddings converge in all five benchmarking datasets under different label rates. A smaller label rate requires a larger number of alternate rounds. Since classifiers are trained on reliable nodes as defined in Eq. (5.24), it needs more alternate rounds to train the entire node set and propagate label embeddings to the whole graph. Furthermore, the convergence speed is relatively fast. The number of alternate rounds is usually less than 20 in most investigated cases, leading to significant training efficiency as discussed later. Fig. 5.4 depicts the cross-entropy loss curves as a function of the training epoch for the two LR classifiers at a label rate of 20 labels per class. The number of epochs is counted throughout the whole optimization process. As shown in the figure, the curves converge rapidly in several epochs. 115 5.5.5 Complexity and Memory Requirements We compare the running time performance of GraphHop++ and its several benchmarking methods. Fig. 5.5 shows the results at a label rate of 20 labeled samples per class with respecttofivedatasets. ForGraphHop++,wereportresultswith100roundsand20rounds, respectively,sinceitcanconvergewithin20roundsasindicatedinFig. 5.3. WeseefromFig. 5.5 that GraphHop++(20) can achieve an average running time. The most time-consuming part of GraphHop++ is the training of the LR classifiers, which demands multiple loops of all data samples. However, we find that incorporating classifiers is essential to the superior performanceofGraphHop++underlimitedlabelrates. Inotherwords,GraphHop++trades sometimeforeffectivenessinthecaseofextremelylowlabelrates. Amongalldatasets, C&S achieves the lowest running time due to its simple design as LP. Next, we compare the GPU memory usage of GraphHop++ and its several benchmark- ing methods 2 . The results are shown in Fig. 5.6, where the values of GraphHop++ are taken from the same experiment of running time comparison as given in Fig. 5.5. Generally, GraphHopandGraphHop++achievesthelowestGPUmemoryusageamongallbenchmark- ingmethodsagainstthefivedatasets. ThereasonisthatGraphHopandGraphHop++allow minibatch training. The only parameters to be stored in the GPU are classifier parameters and one minibatch of data. Instead, the GCN-based methods cannot simply conduct mini- batchtraining. Sinceembeddingsfromdifferentlayersneedtobestoredforbackpropagation, the GPU memory consumption increases significantly. Parameter Sensitivity The effects of three hyper-parameters, i.e., T, α and β , of GraphHop++ on the perfor- mance are analyzed here. Due to page limitations, we focus on a representative case in our discussion below, namely, the CiteSeer dataset with a label rate of one labeled sample per class. To study the effect of three hyper-parameters, we fix one and changing the other two 2 It is measured by torch.cuda.max memory allocated() for PyTorch and tf.contrib.memory stats.MaxBytesInUse() for TensorFlow. 116 (a) (b) (c) (d) (e) Figure 5.5: Comparsion of computational efficiency of different methods measured by log(second)forCora,CiteSeerdataset,PubMed,AmazonPhotoandCoauthorC&Sdatasets, where the label rate is 20 labeled samples per class and GraphHop++(20) is the result of GraphHop++ with 20 alternate optimization rounds. (a) (b) (c) (d) (e) Figure 5.6: Comparsion of of GPU memory usages of different methods measured by log(Mega Bytes) for Cora, CiteSeer dataset, PubMed, Amazon Photo and Coauthor C&S datasets, where the label rate is 20 labeled samples per class. (a) (b) (c) Figure 5.7: Performance of GraphHop++ with various hyper parameter settings for the CiteSeer dataset, where the z-axis is the accuracy result and the label rate is one label per class: (a) performance of T and β with fixed α , (b) performance of T and α with fixed β , (c) performance of α and β with fixed T. using grid search. For example, we fix T and adjust α and β . For each setting, results are averaged over 10 runs. They are depicted in Fig. 5.7. 117 (a) (b) (c) (d) (e) Figure 5.8: Performance of GraphHop++ with an increasing number of iterations for five datasets: (a) Cora dataset. (b) CiteSeer dataset. (c) PubMed dataset. (d) Amazon Photo dataset. (e)CoauthorC&Sdataset,wherethex-axisisthenumberofalternateoptimization rounds and the y-axis is the test accuracy (%). The averaged accuracy under one specific number of iterations is represented as dots and the standard deviation as vertical bars. Different lines denote results under different label rates. (a) (b) (c) (d) (e) Figure5.9: PerformancecomparisonofGraphHop++andGraphHop-Vunderdifferentlabel rates: (a) Cora, (b) CiteSeer, (c) PubMed, (d) Amazon Photo, and (e) Coauthor C&S. We have the following observations. First, hyper-parameter T has the largest influence on the performance, and a small value will degrade the accuracy result. However, once T > 1, the performance stays the same. Note that T = 1 means that there is no sharpening operation on the label distribution in Eq. (5.16). In practice, we can eliminate hyper- parameter T by removing the sharpening operation. Second, α and β have less influence on the performance. By increasing α and β slightly, the performance improves as shown in Fig. 5.7. We need to tune these two hyper-parameters so as to achieve the best performance. 5.5.6 Ablation Study Enhancement via Label Embeddings Update As compared to GraphHop, GraphHop++ has an enhancement by conducting several iterations of label embeddings in Eq. (5.25). That results in a better approximation to the 118 optimum of Eq. (5.18). This is experimentally verified below. We measure the accuracy under the same experimental settings but vary the number of iterations, max iter, from zero to 20. The results against five datasets are shown in Fig. 5.8. It is worthwhile to point out that GraphHop++ degenerates to GraphHop with zero iteration, i.e., only performing Eq. (5.26). We see from the figure that increasing the number of iterations improves the performance overall. This is especially true when the label rate is very low. Fewer labeled nodes demand a larger number of iterations for better approximation. Furthermore, we observe that the performance improvement saturates within 10 iterations. Since several iterations can achieve sufficient approximation with high accuracy, there is no need to reach optimality in label embedding optimization. It saves a considerable amount of running time. Removing LR Classifier Training GraphHop++ is an alternate approximate optimization of the variational problem given in Eq. (5.11). If we set hyper-parameter U α = 0, the alternate optimization process re- duces to label embeddings optimization without the LR classifier training. Note that this is still different from the regularization framework of the classical LP algorithm given in Eq. (5.1), where the ground-truth label, Y, is changed to the initial label embeddings F init as supervision. This model without LR classifier training is named “GraphHop-V (GraphHop- Variant)”. We conduct experiments of GraphHop-V on the same five datasets and show the results in Fig. 5.9. We see that GraphHop++ outperforms GraphHop-V on all datasets under various label rates. The performance gap is especially obvious when the number of labeled samples is very low. This can be explained as follows. Recall that LR classifiers are trained on labeled and reliable nodes and used to infer label embeddings for all nodes in the next round as given in Eq. (5.26). The inference improves the propagation of label embedding from labeled nodes to the entire graph. This is especially important when label rates are low since the performance of the original propagation as given in Eq. (5.3) is poor under this situation [13]. Another interesting observation is that the simple variant can 119 achieve performance comparable with that of GCN under the standard setting of 20 labels per class. 5.5.7 Application to Object Recognition To show effectiveness and other potential applications of GraphHop++, we apply it to an object recognition problem in this subsection. COIL20 [100] is a popular dataset for object recognition. It contains 1,440 object images belonging to 20 classes, where each object has 72 images shot from different angles. Some exemplary images are given in Fig. 5.10. The resolution of each image is 32× 32 and each pixel has 256 gray levels. As a result, each image can be represented as a 1024-D vector. Each image corresponds to a node. A k-nearest neighbor (kNN) graph with k = 7 is built based on the Euclidean distance of two images; namely, edge weights are the Euclidean distance between two connected nodes. We conduct experiments on this constructed graph for GraphHop++ and other benchmarking methods. We tune the hyper-parameters of all benchmarks and report the best accuracy performance. As to GraphHop++, we set the number of iterations to 50 and α = 0.99 for lower label rates. The results under three different label rates are shown in Table 5.7. GraphHop++ achieves the best performance at all three label rates. The high performance of the LP method indicates that the node attribute information (i.e., 1024 image pixels) does not contribute much to label predictions, which is different from other benchmarking datasets. Instead, the manifold regularization of label embeddings is the factor that is most relevant to high performance. This explains the fact that propagation-based methods achieve better accuracy than GCN-based methods at very low label rates. However, once there are enough labeled samples (e.g., 20 labeled samples per class), GCN-based or other propagation-based methods can achieve comparable performance or outperform the LP method. In this case, there are sufficient labeled samples for neural network training. They further exploit the node attribute information in model learning. 120 Table5.7: Testaccuracywiththreelabelratesmeasuredby“meanaccuracy(%)± standard deviation”, where the highest mean accuracy is marked in bold and the second and third are underlined. COIL20 # of labels per class 1 2 20 GCN 61.05± 1.85 68.19± 3.75 88.72± 1.64 GAT 57.98± 5.27 65.32± 3.36 86.66± 2.82 LP 83.48± 0.00 85.89± 0.00 88.89± 0.00 APPNP 78.11± 0.23 79.94± 0.60 90.67± 0.45 C&S 67.85± 4.66 75.67± 2.79 92.50± 0.68 GraphHop 78.42± 2.72 78.99± 2.95 92.70± 0.96 Co-training GCN 60.07± 3.04 71.10± 2.99 91.22± 1.12 Self-training GCN 59.96± 1.95 70.32± 3.44 91.15± 0.75 IGCN 73.77± 4.06 75.70± 2.46 82.24± 1.73 GLP 45.53± 7.44 39.79± 4.04 27.63± 3.51 GraphHop++ 86.75± 1.29 87.61± 0.27 93.07± 1.04 Figure 5.10: Illustration of several exemplary images of the COIL20 dataset. 5.6 Related Work GCN-basedMethods. Graphconvolutionalnetworks(GCNs)haveachievedgreatsuccess insemi-supervisedgraphlearning[19,40,66,137]. Amongthem,themodelproposedbyKipf et al. [66], which adopts the first-order approximation of spectral graph convolution, offered good results in semi-supervised node classification and revealed the identity between the spectral-based graph convolution [12, 26] and the spatial-based message-passing scheme [34, 122]. Afterwards,thereisanexplosionofdesignsingraphconvolutionallayerstooffervarious 121 propagationmechanisms[40, 99, 137, 154]. Numerousapplicationsongraph-structureddata arise [154, 162, 177]. However, graph regularization of label distributions in the design of GCNs is not directly connected to label embeddings in model learning, which could undermine the performance. Some work attempts to address this deficiency by proposing a giant graphic model that incorporates label correlations [88, 116, 178]. In practice, these methods are costly with a specific inductive bias and subjective to the local minimum during optimization. Another issue induced is that a sufficient number of labels are required to learn the feature transfor- mation. To improve label efficiency of GCNs, researchers introduce other semi-supervised learning techniques [78, 131] or improve layer-wise aggregation strength [79, 139]. However, they still inherit deficiency from graph convolutional layers. Propagation-based Methods. The propagation-based methods for node classification on graphs can be dated back to the label propagation algorithms in [184, 190]. Recently, there has been a renaissance in combining propagation schemes with advanced neural net- works [21, 56, 68, 69, 186]. These methods attempt to encode higher-order correlations but preservenodes’localinformationsimultaneously. Forexample, APPNP[68]propagatesneu- ral predictions via label propagation with small contributions of predictions added at each iteration [184]. It gives superior performance as compared to GCN-based methods. Yet, these methods do not target as very low label rates. Besides, the joint learning of feature transformation still requires a considerable amount of labels for training. 5.7 Conclusion New insights into the underlying mechanism of GraphHop was obtained using a regulariza- tion framework in this work. Simply speaking, GraphHop can be viewed as an alternate optimization process that optimizes a regularized function defined by graphs under prob- ability constraints. Motivated by this understanding, an enhanced version of GraphHop, 122 called GraphHop++, was proposed. GraphHop++ solves two convex transformed subprob- lems in each round with two salient features. First, it adopts iterations to offer an improved approximate to the optimal label embedding. Second, it determines reliable nodes adap- tively for classifier training. The performance of GraphHop++ was tested and compared with a large number of existing methods against five commonly used datasets as well as an object recognition problem. Its superior performance, especially at extremely low label rates, demonstrates its effectiveness and efficiency. It would be interesting to apply other regularization schemes to graph learning problems based on the framework of GraphHop and GraphHop++ as a future extension. 123 Chapter 6 Conclusion and Future Work 6.1 Summary of the Research Graph learning has received much attention among machine learning and data mining com- munitiesinrecentyearsduetotheabundanceofgraph-structureddatainreal-worldandthe invention of powerful learning methods, graph neural networks (GNNs). Although GNNs have been considered by many as the de facto solution, they are still inferior in several aspects, such as interpretability, scalability, and label efficiency. In this dissertation, we focus on developing efficient solutions to graph learning problems through the perspectives of graph convolutional networks, graph signal processing, and regularization frameworks. Through the innovations in algorithms, theories, and applications, we extend GNNs to a unique graph structure and pave a new way in solving graph learning problems with both effectiveness and efficiency. From the GNN perspective, we extend the general graph convolutional network to the node representation learning problem in bipartite graphs. The unique connections between nodes from two sets hinder the GNN from direct applications. In particular, the first chal- lenge is the domain inconsistency in that nodes in two sets follow different but correlated featuredistributions,whichraisesdifficultiesforthestandardmessagepassinginGNNs. Sec- ond,unsupervisednoderepresentationlearningandscalabilitytolarge-scalenetworksshould be considered simultaneously due to the enormous graph size in the real world. Accordingly, 124 we propose a layerwise trained bipartite graph neural network (L-BGNN) to address those challenges. Specifically, L-BGNN adopts a unique message passing with adversarial train- ing between the embedding space of the two-node sets, which effectively encodes the graph topology and correlations between feature domains. The adversarial training also enables unsupervised learning of the node representations. In addition, L-BGNN adopts a layer- wise training mechanism with low memory cost and without performance deterioration for efficiency on large-scale graphs. We conduct a variety of downstream tasks to verify the capacity of our representation results, which consistently outperforms the state-of-the-art approaches by a large margin. Fromthegraphsignalperspective,weproposeanoveltwo-stagetrainingalgorithmnamed GraphHop for the semi-supervised node classification task. We design the model following our assumption that both attributes and labels are smooth signals distributed on graphs, where correct labels can be predicted from the smoothened attributes. The trustfulness of which is empirically verified on all benchmarking networks. To encode smooth attribute information, we design a neighborhood averaging operation serving as low-pass filtering to the attribute signal and exploit logistic regression for label initialization. An iterative propagation is adopted to the initialized label predictions to further extract smooth label signals. This two-stage training framework enables GraphHop scalable to large-scale graphs. In addition, logistic regression is introduced to further smoothen the label signal in the local neighborhoods of each node, which enhances the capability in extremely small label rates. We conduct experiments on various scales of networks and node classification tasks showing the state-of-the-art performance results. From the regularization framework perspective, we develop a variational interpretation and theoretical understanding of the proposed GraphHop method. We show that the iter- ation process in the GraphHop method can be explained as an alternate optimization to a certain regularization problem defined on graphs under probability constraints. Based on this interpretation, we propose an enhanced version of GraphHop, named GraphHop++. 125 In particular, GraphHop++ introduces two salient designs for better solutions to the two convex transformed subproblems; that is, 1) additional iterations for a better approximation to the optimal label embedding; 2) adaptively determining the reliable nodes for a better classifier training. Experiments show that equipped with these two improvements, Graph- Hop++ achieves superior performance on five benchmarking datasets as well as an object recognition problem. The exceptional accuracy performance in extremely small label rates demonstrates effectiveness and efficiency. 6.2 Future Research Directions The success of deep learning and neural networks has attracted great attention in recent years and is considered by many as the ultimate path to real artificial intelligence. However, the superior performance of deep learning is often restricted to the training datasets and relies heavily on label supervision, which is deeply due to the lacking interpretability of neural networks. Graph neural networks (GNNs), which adopted the same deep learning paradigm to graph learning problems, still inherit those deficiencies. On the other hand, graphs are such a flexible data structure that can naturally model distinct research fields, such as images, texts, robotics, physics system, etc, which provides a great potential to integrate learning from different domains. In the future, we would like to consider two challenges in current graph learning research and bring up the following research problems. • Efficiency . Efficiency does not simply mean the machine learning algorithm is fast, has low memory cost, and has less label supervision. More importantly, the efficiency shouldandcanbederivedfromaprofoundunderstandingofthegraphlearningproblem as well as the designed methods. • Cross domains. As we mentioned earlier, graphs are a ubiquitous data structure that exit a diversity of domain applications. So, utilizing this cross-domain feature 126 to design machine learning algorithms that can learn the relations and benefit from different domain knowledge will be the next generation of artificial intelligence. 6.2.1 Efficient Graph Learning Methods Thenewlydesignedgraphneuralnetworksarebecomingmorecomplicatedanddatahunger, but with less mathematical interpretation. The raising question is whether a more sophisti- cated method really learns more than a simpler method? Recently, several works are trying to explain the reason behind GNNs [78, 79, 107, 148], and conclude that the graph convo- lutional network (GCN) is no more than a low-pass filter on the input attribute signals. In particular, [148] resulted in the same accuracy performance but more efficient than GCN by dropping all the nonlinear terms. In Chapter 4 and 5, we propose two efficient graph learn- ing algorithms for semi-supervised node classification, which also outperform GCN-based methods in a large margin. The reason is due to the understanding of the network and the rigorous mathematical proof, where no resource is wasted on the irrelevant design, such as nonlinear activation and multiple embedding transformations. Thephysicistsoftensayhowbeautifulthetheoryis,wherethebeautyoriginatesfromhow simple it is to explain the complex nature. In the future, we should also design solutions to graph learning problems in an elegant and efficient way, where it arises from a deeper understanding of the problems and the models. 6.2.2 Learning across Domains Graphs not only can represent data structures in multiple domains but can also serve as bridges to represent the cross-domain relationships. For example, in the left of Fig. 6.1 gives a parse tree with the corresponding sentence The brown dog jumped. In addition, an image with the same semantic meaning is shown on the right. The relationship between these two data modes also can be expressed as a graph. Specifically, the red curves shown in the figure denote the edges of one-to-one entity mapping between the sentence and the image. 127 Figure 6.1: Data from distinct domains have the exactly the same representations. Left: the text and its parse tree. The figure is from [3]. Right: the image, i.e., grid-like graph. Although data are from two different domains, they can be integrated into one single graph with the only difference of edge types, i.e., word-to-word, pixel-to-pixel, and word-to-pixels. Sincegraphshavesuchaflexibleabilitytorepresentdatainandacrossdifferentdomains, theyarethepotentialinstrumentstoenablemachinelearningalgorithmstolearnandreason across domains, which is the critical path to real artificial intelligence, just like human beings. In particular, the current deep meta-learning shares a similar idea by enabling neural networks learning to learn from multiple tasks. However, the integration between meta-learning and graphs is still immature and worth exploring in the future [33, 83, 193]. 128 Bibliography 1. Abu-El-Haija, S., Perozzi, B., Kapoor, A., Alipourfard, N., Lerman, K., Harutyun- yan, H., et al. Mixhop: Higher-order graph convolutional architectures via sparsified neighborhood mixing in international conference on machine learning (2019), 21–29. 2. Atwood, J. & Towsley, D. Diffusion-convolutional neural networks in Advances in neural information processing systems (2016), 1993–2001. 3. Battaglia, P. W., Hamrick, J. B., Bapst, V., Sanchez-Gonzalez, A., Zambaldi, V., Malinowski, M., et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261 (2018). 4. Belkin, M. & Niyogi, P. Laplacian eigenmaps and spectral techniques for embedding andclustering.Advancesinneuralinformationprocessingsystems 14,585–591(2001). 5. Belkin, M. & Niyogi, P. Semi-supervised learning on Riemannian manifolds. Machine learning 56, 209–239 (2004). 6. Belkin, M., Niyogi, P. & Sindhwani, V. Manifold regularization: A geometric frame- work for learning from labeled and unlabeled examples. Journal of machine learning research 7, 2399–2434 (2006). 7. Bengio,Y.,Courville,A.&Vincent,P.Representationlearning:Areviewandnewper- spectives. IEEE transactions on pattern analysis and machine intelligence 35, 1798– 1828 (2013). 8. Bennett, K. P. & Demiriz, A. Semi-supervised support vector machines in Advances in Neural Information processing systems (1999), 368–374. 9. Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A. & Raffel, C. A. Mixmatch: A holistic approach to semi-supervised learning in Advances in Neural In- formation Processing Systems (2019), 5049–5059. 10. Bojchevski, A. & G¨ unnemann, S. Deep gaussian embedding of graphs: Unsupervised inductive learning via ranking. International Conference on Learning Representations (2018). 11. Boyd,S.,Boyd,S.P.&Vandenberghe,L. Convex optimization (Cambridgeuniversity press, 2004). 12. Bruna, J., Zaremba, W., Szlam, A. & LeCun, Y. Spectral networks and locally con- nected networks on graphs. arXiv preprint arXiv:1312.6203 (2013). 13. Calder, J., Cook, B., Thorpe, M. & Slepcev, D. Poisson learning: Graph based semi- supervised learning at very low label rates in International Conference on Machine Learning (2020), 1306–1316. 14. Cao,S.,Lu,W.&Xu,Q.Grarep: Learning graph representations with global structural information in Proceedings of the 24th ACM international on conference on informa- tion and knowledge management (2015), 891–900. 129 15. Chapelle, O., Scholkopf, B. & Zien, A. Semi-supervised learning (chapelle, o. et al., eds.;2006)[bookreviews].IEEE Transactions on Neural Networks 20,542–542(2009). 16. Chapelle, O., Sch¨ olkopf, B. & Zien, A. Label Propagation and Quadratic Criterion (2006). 17. Chapelle, O., Weston, J. & Sch¨ olkopf, B. Cluster kernels for semi-supervised learning. Advances in neural information processing systems 15, 601–608 (2002). 18. Chen, J., Zhu, J. & Song, L. Stochastic training of graph convolutional networks with variance reduction. arXiv preprint arXiv:1710.10568 (2017). 19. Chen, J., Ma, T. & Xiao, C. Fastgcn: fast learning with graph convolutional net- worksviaimportancesampling.InternationalConferenceonLearningRepresentations (2018). 20. Chen, K., Yao, L., Zhang, D., Wang, X., Chang, X. & Nie, F. A semisupervised recur- rent convolutional attention model for human activity recognition. IEEE transactions on neural networks and learning systems 31, 1747–1756 (2019). 21. Chen, M., Wei, Z., Ding, B., Li, Y., Yuan, Y., Du, X., et al. Scalable graph neural networks via bidirectional propagation. Advances in neural information processing systems 33, 14556–14566 (2020). 22. Chen,Z.,Chen,F.,Zhang,L.,Ji,T.,Fu,K.,Zhao,L.,et al.BridgingtheGapbetween Spatial and Spectral Domains: A Survey on Graph Neural Networks. arXiv preprint arXiv:2002.11867 (2020). 23. Chiang, W.-L., Liu, X., Si, S., Li, Y., Bengio, S. & Hsieh, C.-J. Cluster-GCN: An efficient algorithm for training deep and large graph convolutional networks in Pro- ceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2019), 257–266. 24. Conneau, A., Lample, G., Ranzato, M., Denoyer, L. & J´ egou, H. Word translation without parallel data. arXiv preprint arXiv:1710.04087 (2017). 25. De Cao, N. & Kipf, T. MolGAN: An implicit generative model for small molecular graphs. arXiv preprint arXiv:1805.11973 (2018). 26. Defferrard, M., Bresson, X. & Vandergheynst, P. Convolutional neural networks on graphswithfastlocalizedspectralfiltering inAdvancesinneuralinformationprocessing systems (2016), 3844–3852. 27. Dong, Y., Chawla, N. V. & Swami, A. metapath2vec: Scalable representation learning for heterogeneous networks in Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining (2017), 135–144. 28. Du,B.,Xinyao,T.,Wang,Z.,Zhang,L.&Tao,D.Robustgraph-basedsemisupervised learning for noisy labeled data via maximum correntropy criterion. IEEE transactions on cybernetics 49, 1440–1453 (2018). 29. Fey,M.&Lenssen,J.E.Fast Graph Representation Learning with PyTorch Geometric in ICLR Workshop on Representation Learning on Graphs and Manifolds (2019). 30. Fu, T.-y., Lee, W.-C. & Lei, Z. Hin2vec: Explore meta-paths in heterogeneous in- formation networks for representation learning in Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (2017), 1797–1806. 31. Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., et al. Domain-adversarial training of neural networks. The Journal of Machine Learning Research 17, 2096–2030 (2016). 130 32. Gao, M., Chen, L., He, X. & Zhou, A. Bine: Bipartite network embedding in The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (2018), 715–724. 33. Garcia, V. & Bruna, J. Few-shot learning with graph neural networks. arXiv preprint arXiv:1711.04043 (2017). 34. Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O. & Dahl, G. E. Neural mes- sage passing for quantum chemistry in International Conference on Machine Learning (2017), 1263–1272. 35. Gong, C., Liu, T., Tao, D., Fu, K., Tu, E. & Yang, J. Deformed graph Laplacian for semisupervised learning. IEEE transactions on neural networks and learning systems 26, 2261–2274 (2015). 36. Goodfellow, I., Bengio, Y., Courville, A. & Bengio, Y. Deep learning 2 (MIT press Cambridge, 2016). 37. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et al. Generative adversarial nets in Advances in neural information processing systems (2014), 2672–2680. 38. Grandvalet, Y. & Bengio, Y. Semi-supervised learning by entropy minimization in Advances in neural information processing systems (2005), 529–536. 39. Grover, A. & Leskovec, J. node2vec: Scalable feature learning for networks in Proceed- ings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining (2016), 855–864. 40. Hamilton, W., Ying, Z. & Leskovec, J. Inductive representation learning on large graphs in Advances in neural information processing systems (2017), 1024–1034. 41. Hamilton,W.L.,Ying,R.&Leskovec,J.Representationlearningongraphs:Methods and applications. arXiv preprint arXiv:1709.05584 (2017). 42. Hastie,T.,Tibshirani,R.,Friedman,J.H.&Friedman,J.H.Theelementsofstatistical learning: data mining, inference, and prediction (Springer, 2009). 43. He, C., Xie, T., Rong, Y., Huang, W., Li, Y., Huang, J., et al. Bipartite graph neural networks for efficient node representation learning. arXiv e-prints, arXiv–1906 (2019). 44. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition in Proceedings of the IEEE conference on computer vision and pattern recognition (2016), 770–778. 45. He,X.,Liao,L.,Zhang,H.,Nie,L.,Hu,X.&Chua,T.-S.Neural collaborative filtering inProceedings of the 26th international conference on world wide web (2017),173–182. 46. Henaff, M., Bruna, J. & LeCun, Y. Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163 (2015). 47. Hinton,G.,Vinyals,O.&Dean,J.Distilling the Knowledge in a Neural Network 2015. 48. Hinton, G. E. Deep belief networks. Scholarpedia 4, 5947 (2009). 49. Hong, W., Wang, Z., Yang, M. & Yuan, J. Conditional generative adversarial network for structured domain adaptation inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), 1335–1344. 50. Hornik, K., Stinchcombe, M., White, H., et al. Multilayer feedforward networks are universal approximators. Neural networks 2, 359–366 (1989). 51. Hotelling, H. Analysis of a complex of statistical variables into principal components. Journal of educational psychology 24, 417 (1933). 131 52. Hu,B.,Fang,Y.&Shi,C.Adversariallearningonheterogeneousinformationnetworks in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2019), 120–129. 53. Hu, B., Shi, C., Zhao, W. X. & Yang, T. Local and global information fusion for top- n recommendation in heterogeneous information network in Proceedings of the 27th ACM International Conference on Information and Knowledge Management (2018), 1683–1686. 54. Hu, B., Shi, C., Zhao, W. X. & Yu, P. S. Leveraging meta-path based context for top- n recommendation with a neural co-attention model in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2018), 1531–1540. 55. Hu, W., Liu, B., Gomes, J., Zitnik, M., Liang, P., Pande, V., et al. Pre-training graph neural networks. arXiv preprint arXiv:1905.12265 (2019). 56. Huang, Q., He, H., Singh, A., Lim, S.-N. & Benson, A. R. Combining label prop- agation and simple models out-performs graph neural networks. arXiv preprint arXiv:2010.13993 (2020). 57. Huang, W., Zhang, T., Rong, Y. & Huang, J. Adaptive sampling towards fast graph representation learning in Advances in neural information processing systems (2018), 4558–4567. 58. Huang, W., Li, Y., Fang, Y., Fan, J. & Yang, H. BiANE: Bipartite Attributed Network Embedding inProceedingsofthe43rdinternationalACMSIGIRconferenceonresearch and development in information retrieval (2020), 149–158. 59. Huang, X., Song, Q., Yang, F. & Hu, X. Large-scale heterogeneous feature embedding in Proceedings of the AAAI conference on artificial intelligence 33 (2019), 3878–3885. 60. Iscen, A., Tolias, G., Avrithis, Y. & Chum, O. Label propagation for deep semi- supervised learning in Proceedings of the IEEE conference on computer vision and pattern recognition (2019), 5070–5079. 61. Jia,L.,Zhang,Z.,Wang,L.,Jiang,W.&Zhao,M. Adaptive neighborhood propagation by joint L2, 1-norm regularized sparse coding for representation and classification in 2016 IEEE 16th International Conference on Data Mining (ICDM) (2016), 201–210. 62. Jin, W., Barzilay, R. & Jaakkola, T. Junction tree variational autoencoder for molec- ular graph generation. arXiv preprint arXiv:1802.04364 (2018). 63. Kingma, D. P. & Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013). 64. Kingma, D. P., Mohamed, S., Jimenez Rezende, D. & Welling, M. Semi-supervised learning with deep generative models. Advances in neural information processing sys- tems 27, 3581–3589 (2014). 65. Kipf,T.,Fetaya,E.,Wang,K.-C.,Welling,M.&Zemel,R.Neuralrelationalinference for interacting systems. arXiv preprint arXiv:1802.04687 (2018). 66. Kipf, T. N. & Welling, M. Semi-supervised classification with graph convolutional networks. International Conference on Learning Representations (2016). 67. Kipf, T. N. & Welling, M. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308 (2016). 132 68. Klicpera, J., Bojchevski, A. & G¨ unnemann, S. Predict then propagate: Graph neural networks meet personalized pagerank. International Conference on Learning Repre- sentations (2018). 69. Klicpera, J., Weißenberger, S. & G¨ unnemann, S. Diffusion improves graph learning. Advances in Neural Information Processing Systems 32, 13354–13366 (2019). 70. Koren, Y., Bell, R. & Volinsky, C. Matrix factorization techniques for recommender systems. Computer 42, 30–37 (2009). 71. Krizhevsky,A.,Hinton,G.,et al.Learningmultiplelayersoffeaturesfromtinyimages (2009). 72. Lample, G., Conneau, A., Denoyer, L. & Ranzato, M. Unsupervised machine transla- tion using monolingual corpora only. arXiv preprint arXiv:1711.00043 (2017). 73. LeCun, Y., Bengio, Y., et al. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks 3361, 1995 (1995). 74. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. nature 521, 436–444 (2015). 75. Lee, D.-H. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks in Workshop on challenges in representation learning, ICML 3 (2013). 76. Levy, O. & Goldberg, Y. Neural word embedding as implicit matrix factorization. Advances in neural information processing systems 27, 2177–2185 (2014). 77. Li,C.,Jia,K.,Shen,D.,Shi,C.-J.R.&Yang,H.Hierarchical Representation Learning for Bipartite Graphs. in IJCAI (2019), 2873–2879. 78. Li, Q., Han, Z. & Wu, X.-M. Deeper insights into graph convolutional networks for semi-supervised learning in Thirty-Second AAAI conference on artificial intelligence (2018). 79. Li, Q., Wu, X.-M., Liu, H., Zhang, X. & Guan, Z. Label efficient semi-supervised learningviagraphfiltering inProceedingsoftheIEEEConferenceonComputerVision and Pattern Recognition (2019), 9582–9591. 80. Li, Y. & Ye, J. Learning adversarial networks for semi-supervised text classification via policy gradient in Proceedings of the 24th acm sigkdd international conference on knowledge discovery & data mining (2018), 1715–1723. 81. Lin, Y., Liu, Z., Sun, M., Liu, Y. & Zhu, X. Learning entity and relation embeddings for knowledge graph completion in Proceedings of the AAAI Conference on Artificial Intelligence 29 (2015). 82. Linden, G., Smith, B. & York, J. Amazon. com recommendations: Item-to-item col- laborative filtering. IEEE Internet computing 7, 76–80 (2003). 83. Liu, L., Zhou, T., Long, G., Jiang, J. & Zhang, C. Learning to propagate for graph meta-learning. arXiv preprint arXiv:1909.05024 (2019). 84. Liu, S., Demirel, M. F. & Liang, Y. N-gram graph: Simple unsupervised represen- tation for graphs, with applications to molecules in Advances in Neural Information Processing Systems (2019), 8464–8476. 85. Long, M., Cao, Z., Wang, J. & Jordan, M. I. Conditional adversarial domain adapta- tion in Advances in neural information processing systems (2018), 1640–1650. 86. Luo, D., Ding, C. H., Nie, F. & Huang, H. Cauchy graph embedding in ICML (2011). 133 87. Luo, M., Chang, X., Nie, L., Yang, Y., Hauptmann, A. G. & Zheng, Q. An adaptive semisupervised feature analysis for video semantic recognition. IEEE transactions on cybernetics 48, 648–660 (2017). 88. Ma, J., Tang, W., Zhu, J. & Mei, Q. A Flexible Generative Framework for Graph- basedSemi-supervisedLearning inAdvancesinNeuralInformationProcessingSystems (2019), 3281–3290. 89. Ma, X., Zhang, T. & Xu, C. Gcan: Graph convolutional adversarial network for un- supervised domain adaptation in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019), 8266–8276. 90. Maaten, L. v. d. & Hinton, G. Visualizing data using t-SNE. Journal of machine learning research 9, 2579–2605 (2008). 91. Mai, X. & Couillet, R. Random matrix-inspired improved semi-supervised learning on graphs in International Conference on Machine Learning (2018). 92. Mai, X. & Couillet, R. Consistent semi-supervised graph regularization for high di- mensional data. Journal of Machine Learning Research 22, 1–48 (2021). 93. Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I. & Frey, B. Adversarial autoen- coders. arXiv preprint arXiv:1511.05644 (2015). 94. McAuley, J., Targett, C., Shi, Q. & Van Den Hengel, A. Image-based recommenda- tions on styles and substitutes in Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval (2015), 43–52. 95. Mikolov, T., Chen, K., Corrado, G. & Dean, J. Efficient Estimation of Word Repre- sentations in Vector Space 2013. 96. Mikolov,T.,Sutskever,I.,Chen,K.,Corrado,G.S.&Dean,J.Distributedrepresenta- tionsofwordsandphrasesandtheircompositionality. Advances in neural information processing systems 26 (2013). 97. Min, Y., Wenkel, F. & Wolf, G. Scattering GCN: Overcoming Oversmoothness in Graph Convolutional Networks. arXiv preprint arXiv:2003.08414 (2020). 98. Miyato, T., Maeda, S., Koyama, M. & Ishii, S. Virtual Adversarial Training: A Regu- larization Method for Supervised and Semi-Supervised Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 1979–1993. doi:10.1109/TPAMI. 2018.2858821 (2019). 99. Monti, F., Boscaini, D., Masci, J., Rodola, E., Svoboda, J. & Bronstein, M. M. Geo- metric deep learning on graphs and manifolds using mixture model cnns inProceedings of the IEEE conference on computer vision and pattern recognition (2017),5115–5124. 100. Nene, S. A., Nayar, S. K., Murase, H., et al. Columbia object image library (coil-100). Technical Report CUCS-005-96 (1996). 101. Nie, F., Wang, X., Deng, C. & Huang, H. Learning a structured optimal bipartite graph for co-clustering in Proceedings of the 31st International Conference on Neural Information Processing Systems (2017), 4132–4141. 102. Nie,F.,Xiang,S.,Liu,Y.&Zhang,C.Ageneralgraph-basedsemi-supervisedlearning with novel class discovery. Neural Computing and Applications 19, 549–555 (2010). 103. Nie, F., Zhu, W. & Li, X. Unsupervised large graph embedding in Thirty-first AAAI conference on artificial intelligence (2017). 134 104. Nie, L., Li, Y., Feng, F., Song, X., Wang, M. & Wang, Y. Large-scale question tag- ging via joint question-topic embedding learning. ACM Transactions on Information Systems (TOIS) 38, 1–23 (2020). 105. Nie, L., Liu, M. & Song, X. Multimodal learning toward micro-video understanding (Morgan & Claypool Publishers, 2019). 106. Niepert, M., Ahmed, M. & Kutzkov, K. Learning convolutional neural networks for graphs in International conference on machine learning (2016), 2014–2023. 107. NT, H. & Maehara, T. Revisiting graph neural networks: All we have is low-pass filters. arXiv preprint arXiv:1905.09550 (2019). 108. Ortega, A., Frossard, P., Kovaˇ cevi´ c, J., Moura, J. M. & Vandergheynst, P. Graph signal processing: Overview, challenges, and applications. Proceedings of the IEEE 106, 808–828 (2018). 109. Ou, M., Cui, P., Pei, J., Zhang, Z. & Zhu, W. Asymmetric transitivity preserving graph embedding in Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining (2016), 1105–1114. 110. Page, L., Brin, S., Motwani, R. & Winograd, T. The PageRank citation ranking: Bringing order to the web. tech. rep. (Stanford InfoLab, 1999). 111. Pan, S., Hu, R., Long, G., Jiang, J., Yao, L. & Zhang, C. Adversarially regularized graph autoencoder for graph embedding. arXiv preprint arXiv:1802.04407 (2018). 112. Paszke,A.,Gross,S.,Chintala,S.,Chanan,G.,Yang,E.,DeVito,Z., et al.Automatic differentiation in pytorch (2017). 113. Perozzi, B., Al-Rfou, R. & Skiena, S. Deepwalk: Online learning of social representa- tions inProceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (2014), 701–710. 114. Poursaeed, O., Katsman, I., Gao, B. & Belongie, S. Generative adversarial perturba- tions in Proceedings of the IEEE Conference on Computer Vision and Pattern Recog- nition (2018), 4422–4431. 115. Qiu, J., Tang, J., Ma, H., Dong, Y., Wang, K. & Tang, J. Deepinf: Social influence prediction with deep learning in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2018), 2110–2119. 116. Qu, M., Bengio, Y. & Tang, J. Gmnn: Graph markov neural networks in International conference on machine learning (2019), 5241–5250. 117. Quionero-Candela, J., Sugiyama, M., Schwaighofer, A. & Lawrence, N. D. Dataset shift in machine learning (The MIT Press, 2009). 118. Radford, A., Metz, L. & Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015). 119. Ru,L.,Du,B.&Wu,C.Multi-temporalsceneclassificationandscenechangedetection with correlation based fusion. IEEE Transactions on Image Processing 30, 1382–1394 (2020). 120. Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning representations by back- propagating errors. nature 323, 533–536 (1986). 121. Salakhutdinov, R. & Hinton, G. Deep boltzmann machines in Artificial intelligence and statistics (2009), 448–455. 135 122. Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M. & Monfardini, G. The graph neural network model. IEEE transactions on neural networks 20, 61–80 (2008). 123. Scott, J. Social network analysis. Sociology 22, 109–127 (1988). 124. Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B. & Eliassi-Rad, T. Collective classification in network data. AI magazine 29, 93–93 (2008). 125. Shang, J., Qu, M., Liu, J., Kaplan, L. M., Han, J. & Peng, J. Meta-path guided embedding for similarity search in large-scale heterogeneous information networks. arXiv preprint arXiv:1610.09769 (2016). 126. Shchur, O., Mumme, M., Bojchevski, A. & G¨ unnemann, S. Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868 (2018). 127. Shi, C., Hu, B., Zhao, W. X. & Philip, S. Y. Heterogeneous information network em- beddingforrecommendation.IEEE Transactions on Knowledge and Data Engineering 31, 357–370 (2018). 128. Shi, Z., Osher, S. & Zhu, W. Weighted nonlocal laplacian on interpolation from sparse data. Journal of Scientific Computing 73, 1164–1177 (2017). 129. Shuman, D. I., Narang, S. K., Frossard, P., Ortega, A. & Vandergheynst, P. The emerging field of signal processing on graphs: Extending high-dimensional data anal- ysis to networks and other irregular domains. IEEE signal processing magazine 30, 83–98 (2013). 130. Song,Z.,Yang,X.,Xu,Z.&King,I.Graph-basedsemi-supervisedlearning:Acompre- hensive review. IEEE Transactions on Neural Networks and Learning Systems (2022). 131. Sun, K., Lin, Z. & Zhu, Z. Multi-stage self-supervised learning for graph convolutional networks on graphs with few labeled nodes in Proceedings of the AAAI Conference on Artificial Intelligence 34 (2020), 5892–5899. 132. Tang, J., Qu, M. & Mei, Q. Pte: Predictive text embedding through large-scale hetero- geneous text networks in Proceedings of the 21th ACM SIGKDD International Con- ference on Knowledge Discovery and Data Mining (2015), 1165–1174. 133. Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J. & Mei, Q. Line: Large-scale informa- tion network embedding in Proceedings of the 24th international conference on world wide web (2015), 1067–1077. 134. Tarvainen, A. & Valpola, H. Mean teachers are better role models: Weight-averaged consistencytargetsimprovesemi-superviseddeeplearningresults inAdvancesinneural information processing systems (2017), 1195–1204. 135. Van Engelen, J. E. & Hoos, H. H. A survey on semi-supervised learning. Machine Learning 109, 373–440 (2020). 136. Velickovic, P., Fedus, W., Hamilton, W. L., Li` o, P., Bengio, Y. & Hjelm, R. D. Deep Graph Infomax. International Conference on Learning Representations 2, 4 (2019). 137. Veliˇ ckovi´ c, P., Cucurull, G., Casanova, A., Romero, A., Li` o, P. & Bengio, Y. Graph Attention Networks. International Conference on Learning Representations (2018). 138. Wan, H., Luo, Y., Peng, B. & Zheng, W.-S. Representation Learning for Scene Graph Completion via Jointly Structural and Visual Embedding. in IJCAI (2018), 949–956. 139. Wan, S., Zhan, Y., Liu, L., Yu, B., Pan, S. & Gong, C. Contrastive Graph Poisson Networks: Semi-Supervised Learning with Extremely Limited Labels. Advances in Neural Information Processing Systems 34 (2021). 136 140. Wang, D., Cui, P. & Zhu, W. Structural deep network embedding in Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining (2016), 1225–1234. 141. Wang, D., Du, B., Zhang, L. & Xu, Y. Adaptive Spectral–Spatial Multiscale Contex- tualFeatureExtractionforHyperspectralImageClassification. IEEE Transactions on Geoscience and Remote Sensing 59, 2461–2477 (2020). 142. Wang, F. & Zhang, C. Label propagation through linear neighborhoods. IEEE Trans- actions on Knowledge and Data Engineering 20, 55–67 (2007). 143. Wang, H. & Leskovec, J. Combining Graph Convolutional Neural Networks and Label Propagation. ACM Transactions on Information Systems (TOIS) 40, 1–27 (2021). 144. Wang, H., Wang, J., Wang, J., Zhao, M., Zhang, W., Zhang, F., et al. Graphgan: Graph representation learning with generative adversarial nets in Proceedings of the AAAI conference on artificial intelligence 32 (2018). 145. Wang, M., Fu, W., Hao, S., Tao, D. & Wu, X. Scalable semi-supervised learning by efficient anchor graph regularization. IEEE Transactions on Knowledge and Data Engineering 28, 1864–1877 (2016). 146. Wang,X.,Ji,H.,Shi,C.,Wang,B.,Ye,Y.,Cui,P.,etal.Heterogeneousgraphattention network in The World Wide Web Conference (2019), 2022–2032. 147. Weston, J., Ratle, F., Mobahi, H. & Collobert, R. in Neural networks: Tricks of the trade 639–655 (Springer, 2012). 148. Wu, F., Souza, A., Zhang, T., Fifty, C., Yu, T. & Weinberger, K. Simplifying graph convolutional networks in International conference on machine learning (2019), 6861– 6871. 149. Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C. & Philip, S. Y. A comprehensive sur- vey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems (2020). 150. Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C. & Yu, P. S. A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596 (2019). 151. Xie, T., Wang, B. & Kuo, C.-C. J. GraphHop: an enhanced label propagation method for node classification. IEEE Transactions on Neural Networks and Learning Systems (2022). 152. Xie, Y., Xu, Z., Zhang, J., Wang, Z. & Ji, S. Self-supervised learning of graph neural networks: A unified review. arXiv preprint arXiv:2102.10757 (2021). 153. Xu, D., Zhu, Y., Choy, C. B. & Fei-Fei, L. Scene graph generation by iterative mes- sage passing in Proceedings of the IEEE conference on computer vision and pattern recognition (2017), 5410–5419. 154. Xu, K., Hu, W., Leskovec, J. & Jegelka, S. How powerful are graph neural networks? International Conference on Learning Representations (2018). 155. Yang, C., Zhuang, P., Shi, W., Luu, A. & Li, P. Conditional structure generation through graph variational generative adversarial nets in Advances in Neural Informa- tion Processing Systems (2019), 1340–1351. 156. Yang,J.,Lu,J.,Lee,S.,Batra,D.&Parikh,D.Graphr-cnnforscenegraphgeneration in Proceedings of the European conference on computer vision (ECCV) (2018), 670– 685. 137 157. Yang, Z., Cohen, W. & Salakhudinov, R. Revisiting semi-supervised learning with graph embeddings in International conference on machine learning (2016), 40–48. 158. Ye, Q., Yang, J., Yin, T. & Zhang, Z. Can the virtual labels obtained by traditional LP approaches be well encoded in WLR? IEEE transactions on neural networks and learning systems 27, 1591–1598 (2015). 159. Yosinski, J., Clune, J., Bengio, Y. & Lipson, H. How transferable are features in deep neural networks? in Advances in neural information processing systems (2014), 3320– 3328. 160. You, J., Liu, B., Ying, Z., Pande, V. & Leskovec, J. Graph convolutional policy net- work for goal-directed molecular graph generation in Advances in neural information processing systems (2018), 6410–6421. 161. You, J., Ma, X., Ding, D. Y., Kochenderfer, M. & Leskovec, J. Handling missing data with graph representation learning. arXiv preprint arXiv:2010.16418 (2020). 162. You, J., Ying, R., Ren, X., Hamilton, W. & Leskovec, J. Graphrnn: Generating real- istic graphs with deep auto-regressive models in International conference on machine learning (2018), 5708–5717. 163. You, Y., Chen, T., Wang, Z. & Shen, Y. L2-GCN: Layer-Wise and Learned Efficient Training of Graph Convolutional Networks in Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (2020), 2127–2135. 164. You, Y., Chen, T., Wang, Z. & Shen, Y. When does self-supervision help graph convo- lutional networks? in International Conference on Machine Learning (2020), 10871– 10880. 165. Yu, L., Zhang, W., Wang, J. & Yu, Y. Seqgan: Sequence generative adversarial nets with policy gradient in Proceedings of the AAAI conference on artificial intelligence 31 (2017). 166. Yuan,X.,He,P.,Zhu,Q.&Li,X.Adversarialexamples:Attacksanddefensesfordeep learning. IEEE transactions on neural networks and learning systems 30, 2805–2824 (2019). 167. Zeng, H., Zhang, M., Xia, Y., Srivastava, A., Malevich, A., Kannan, R., et al. Decou- plingtheDepthandScopeofGraphNeuralNetworks.AdvancesinNeuralInformation Processing Systems 34 (2021). 168. Zeng, H., Zhou, H., Srivastava, A., Kannan, R. & Prasanna, V. Graphsaint: Graph samplingbasedinductivelearningmethod.International Conference on Learning Rep- resentations (2020). 169. Zhang, C., Song, D., Huang, C., Swami, A. & Chawla, N. V. Heterogeneous graph neural network in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2019), 793–803. 170. Zhang, C., Swami, A. & Chawla, N. V. Shne: Representation learning for semantic- associated heterogeneous networks in Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining (2019), 690–698. 171. Zhang, H., Zhang, Z., Zhao, M., Ye, Q., Zhang, M. & Wang, M. Robust triple-matrix- recovery-based auto-weighted label propagation for classification. IEEE transactions on neural networks and learning systems 31, 4538–4552 (2020). 138 172. Zhang, K., Lan, L., Kwok, J. T., Vucetic, S. & Parvin, B. Scaling up graph-based semisupervised learning via prototype vector machines. IEEE transactions on neural networks and learning systems 26, 444–457 (2014). 173. Zhang,M.,Liu,Y.,Luan,H.&Sun,M.Adversarial training for unsupervised bilingual lexicon induction in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2017), 1959–1970. 174. Zhang, M. & Chen, Y. Weisfeiler-lehman neural machine for link prediction in Pro- ceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2017), 575–583. 175. Zhang, M. & Chen, Y. Link prediction based on graph neural networks in Advances in Neural Information Processing Systems (2018), 5165–5175. 176. Zhang, M. & Chen, Y. Inductive matrix completion based on graph neural networks. arXiv preprint arXiv:1904.12058 (2019). 177. Zhang, M., Cui, Z., Neumann, M. & Chen, Y. An end-to-end deep learning archi- tecture for graph classification in Proceedings of the AAAI Conference on Artificial Intelligence 32 (2018). 178. Zhang, Y., Pal, S., Coates, M. & Ustebay, D. Bayesian graph convolutional neural networks for semi-supervised classification in Proceedings of the AAAI Conference on Artificial Intelligence 33 (2019), 5829–5836. 179. Zhang, Y., Song, G., Du, L., Yang, S. & Jin, Y. Dane: Domain adaptive network embedding. arXiv preprint arXiv:1906.00684 (2019). 180. Zhang, Z., Li, F., Jia, L., Qin, J., Zhang, L. & Yan, S. Robust adaptive embedded label propagation with weight learning for inductive classification. IEEE transactions on neural networks and learning systems 29, 3388–3403 (2017). 181. Zhang, Z., Zhang, Y., Li, F., Zhao, M., Zhang, L. & Yan, S. Discriminative sparse flexible manifold embedding with novel graph for robust visual representation and label propagation. Pattern Recognition 61, 492–510 (2017). 182. Zhang, Z., Yang, H., Bu, J., Zhou, S., Yu, P., Zhang, J., et al. ANRL: Attributed Network Representation Learning via Deep Neural Networks. in IJCAI 18 (2018), 3155–3161. 183. Zhang, Z., Cui, P. & Zhu, W. Deep learning on graphs: A survey. IEEE Transactions on Knowledge and Data Engineering (2020). 184. Zhou, D., Bousquet, O., Lal, T., Weston, J. & Sch¨ olkopf, B. Learning with local and global consistency. Advances in neural information processing systems 16, 321–328 (2003). 185. Zhou,J.,Cui,G.,Zhang,Z.,Yang,C.,Liu,Z.,Wang,L.,et al.Graphneuralnetworks: A review of methods and applications. arXiv preprint arXiv:1812.08434 (2018). 186. Zhu, H. & Koniusz, P. Simple spectral graph convolution in International Conference on Learning Representations (2020). 187. Zhu, J., Yan, Y., Zhao, L., Heimann, M., Akoglu, L. & Koutra, D. Beyond homophily ingraphneuralnetworks:Currentlimitationsandeffectivedesigns. AdvancesinNeural Information Processing Systems 33, 7793–7804 (2020). 188. Zhu, J.-Y., Park, T., Isola, P. & Efros, A. A. Unpaired image-to-image translation using cycle-consistent adversarial networks in Proceedings of the IEEE international conference on computer vision (2017), 2223–2232. 139 189. Zhu, X. & Ghahramani, Z. Towards semi-supervised classification with Markov ran- dom fields (2002). 190. Zhu, X., Ghahramani, Z. & Lafferty, J. D. Semi-supervised learning using gaussian fields and harmonic functions in Proceedings of the 20th International conference on Machine learning (ICML-03) (2003), 912–919. 191. Zhu,X.&Goldberg,A.B.Introductiontosemi-supervisedlearning.Synthesis lectures on artificial intelligence and machine learning 3, 1–130 (2009). 192. Zhu, X. J. Semi-supervised learning literature survey (2005). 193. Z¨ ugner, D. & G¨ unnemann, S. Adversarial attacks on graph neural networks via meta learning. arXiv preprint arXiv:1902.08412 (2019). 140 Appendices C L-BGNN: Proof and Experimental Details C.1 Proof of Theorem 2 Proof. Since D(P 1 ,P 2 ) measures the distance of two distributions, we have D(P 1 ,P 3 )≤ D(P 1 ,P 2 )+D(P 2 ,P 3 ) (C.1) Then, this leads to L M,u −L M,v→u = X h p u (h)· D( ˆ P u (y|h;θ ),P(y|h)) − X h p v→u (h)· D( ˆ P v→u (y|h;θ ),P(y|h)) ≤ X h (p u (h)− p v→u (h))· D( ˆ P v→u (y|h;θ ),P(y|h)) + X h p u (h)· D( ˆ P u (y|h;θ ),p v→u (y|h;θ )) =ϵ L M,v→u +d (C.2) 141 C.2 Hyperparameter Tuning We use the grid search to tune our model on every dataset to find the best hyperparameters. A parameter sweep is performed on the following choices: • learning rate: 0.01, 0.001, 0.0001; • epoch: 10, 50, 100, 500; • minibatch size: 256, 512, 1024; • weight decay: 1× 10 − 3 , 5× 10 − 4 , 1× 10 − 4 . In the experiments, we observe that the L-BGNN model can converge in a relative small number of epochs. This explains its short training time. The L-BGNN-MLP model contains twodenselayerswitharectifiedactivationlayerandadropoutlayerinbetween. Theoutput oftheMLPisalignedintherange[− 1,1]withthehyperbolictangentactivation,whichisthe same distribution as the input features. As for the L-BGNN-Adv model, the discriminator also contains two dense layers but with the leaky ReLU activation, which can avoid the sparse gradient problem. D GraphHop: Implementation of LR Classifiers We explain the implementation details of the LR classifier in the label update step here. For labeled samples, the supervised loss term can be written as L l = 1 |L| X y∈L h (t− 1) M ∈H (t− 1) M,l H y,p model (y|h (t− 1) M ;θ ) , (D.1) where H(p,q) is the entropy loss for classification and θ denotes the parameters of the clas- sifier. The missing labels for unlabeled samples prevent direct supervision. However, we can leverage the label embeddings generated from the last iteration as pseudo-labels for 142 supervision. Since they also encode the current confidence to the node label distributions. The direct results of the classifier training are the probability predictions that are consistent between neighborhoods and the node itself, i.e., smoothening regularization. Similar ideas are employed and viewed as consistency regularization [9, 98, 134], which enforces model predictions to be consistent under any input transformations. Then, the loss term for unlabeled samples can be expressed as L u = 1 |U||C| X h (t− 1) ∈H (t− 1) u h (t− 1) M ∈H (t− 1) M,u H Sharpen(h (t− 1) ),p model (y|h (t− 1) M ;θ ) , (D.2) whereSharpen(· )isafunctiontoadjusttheentropyofthelabeldistributionswithineachem- bedding. In particular, a temperature is introduced to alternate the categorical distribution, which is defined as Sharpen(p,T) i =p 1 T i C X j=1 p 1 T j , (D.3) where p is the categorical input distribution (i.e., a label embeddingh in GraphHop) and T is the temperature hyperparameter. Using a higher value for T produces a softer probability distribution over classes and vice versa. By adjusting the temperature, models can decide how confident they should believe in the iteration of the current label embeddings. A similar idea is adopted in entropy minimization [38] and knowledge distillation [47]. The sharpening operation may result in uniform label distributions by setting a large temperature. To enforce the classifier outputs low-entropy predictions on unlabeled data, we minimize the entropy [9, 38] of model prediction p model (y|h;θ ) (abbr. p model ) with an additional loss term L ′ u = 1 |U||C| X h (t− 1) M ∈H (t− 1) M,u H(p model ,p model ). (D.4) 143 Combining the above loss term with labeled and unlabeled samples in Eqs. (D.1) and (D.2), respectively, the final loss function can be derived as L =L l +αL u +βL ′ u , (D.5) where α and β are two hyperparameters to adjust the scales of the unlabeled samples and the entropy, respectively. The LR classifier is trained in multiple epochs until convergence. Finally, new label embeddings for the next iteration are predicted by Eq. (4.10). E GraphHop++: Theory and Proposition Proof Proof of Theorem 4 Proof. The cost function in Eq. (5.14) can be expressed as − n X i=1 u α,i c X j=1 f ij log(σ (z) j ), (E.1) wherez =Wf T M,i andσ (z) j =e z j P c k=1 e z k isthejthentryofthevectorappliedthesoftmax function. Then, it can be rewritten as n X i=1 u α,i c X j=1 f ij log c X k=1 e z k − z j ! . (E.2) It is proved in [11] that the log-sum-exp function is a convex function. A nonnegative linear combination of convex functions is still a convex function. Thus, the cost function in Eq. (5.14) is a convex function, which leads to a convex optimization problem without any constraint. 144 Proof of Theorem 5 Proof. We first prove the cost function in Eq. (5.17) is a convex function. This is done by showingthethreetermsareallconvexfunctionsw.r.t. parameterF. Thefirstterm, ˜ L,isthe random-walk normalized Laplacian matrix. It has eigendecomposition ˜ L = QΛQ T , where Q is the eigenvector matrix and Λ = diag(λ 1 ,...,λ n ) is the diagonal eigenvalue matrix. We can rewrite the decomposition to ˜ L =MM T , where M =QΛ 1 2 . As a result, we have tr(F T ˜ LF) = tr(F T MM T F) =||M T F|| 2 F , (E.3) where||·|| F is the Frobenius norm. The result is convex w.r.t. F since the composition of an affine function and the Frobenius norm is still convex. Similar to the above derivation, the second term can be expressed as tr((F− F init ) T U(F− F init )) =tr((F− F init ) T NN T (F− F init )) =||N T (F− F init )|| 2 F , (E.4) where N = U 1 2 since U is a diagonal matrix. So it is also a convex function w.r.t. F. Similarly, for the last term, we have tr (F− σ (F M W T )) T U α (F− σ (F M W T )) =tr (F− σ (F M W T )) T SS T (F− σ (F M W T )) =||S T (F− σ (F M W T ))|| 2 F , (E.5) where S = U 1 2 α since U α is a diagonal matrix. Thus, it is also a convex function w.r.t. F. Finally, since the addition preserves convexity, the cost function is a convex one. Further- more, the equality constraint is a linear function and the inequality constraint is convex. Therefore, Eq. (5.18) is a convex optimization problem. 145 Proof of Proposition 1 Proof. Without loss of generality, we set F (0) =F init . By Equation (5.22), we have F (t) = (U β ˜ A) t− 1 F init + t− 1 X i=0 (U β ˜ A) i (I− U β )Y ′ . (E.6) Since each diagonal entry of U β is in (0,1) and the eigenvalues of ˜ A in [− 1,1], we have lim t→∞ (U β ˜ A) t− 1 = 0 and lim t→∞ t− 1 X i=0 (U β ˜ A) i = (I− U β ˜ A) − 1 . (E.7) Hence, F ∗ = (I− U β ˜ A) − 1 (I− U β )Y ′ = (I− (I+U ′ ) − 1 ˜ A) − 1 (I− (I+U ′ ) − 1 )Y ′ = (U ′ +I− ˜ A) − 1 (I+U ′ )(I+U ′ ) − 1 U ′ Y ′ = (U ′ + ˜ L) − 1 U ′ Y ′ (E.8) Proof of Theorem 6 Proof. We first show by induction that the iteration in Eq. (5.22) satisfies the probability constraints in Eq. (5.18). Based on the assumption, the initial F (0) satisfies the constraints. Note that Y ′ 1 c = (U+U α ) − 1 UF init 1 c +(U+U α ) − 1 U α σ (F M W T )1 c = (U+U α ) − 1 U1 n +(U+U α ) − 1 U α 1 n =1 n . (E.9) 146 Now, we assume that variable F (t− 1) at iteration (t− 1) meets the probability constraints. At the next iteration t, we have F (t) 1 c =U β ˜ AF (t− 1) 1 c +(I− U β )Y ′ 1 c =U β ˜ A1 n +(I− U β )1 n =U β 1 n +(I− U β )1 n =1 n . (E.10) Since all matrices have nonnegative entries with only sum and multiplication operations, the constraint F (t) ≥ 0 (E.11) holds. Byinduction,theconvergenceresultinEq. (5.19)satisfiestheprobabilityconstraints. Based on the fact that the optimization problem in Eq. (5.18) is a convex optimization problem and Proposition 1, the optimality that meets the constraints of the cost function is the solution to Eq. (5.18) 147
Abstract (if available)
Abstract
Graphs are generic data representation forms that effectively describe the geometric structures of data domains in various applications, e.g., social, sensors, transportation, communication, citation networks, etc. Graph learning, which learns knowledge from this graph-structured data, is an important machine learning application on graphs. Examples range from collective classification, such as document classification in academic graphs, to relational learning, such as item recommendation in e-commerce networks. The current state-of-the-art graph learning algorithms, graph neural networks (GNNs), have shown superior performance to traditional methods in tasks such as semi-supervised node classification, link prediction, and representation learning. Although significant progress has been achieved in numerous graph learning applications, there is still a wide range of problems where either GNNs’ applicabilities have not been explored or are restricted by the internal deficiencies of the GNN framework, such as interpretability, scalability, and label efficiency. In this dissertation, we investigate and propose new graph learning techniques from the aspects of graph convolutional networks, graph signal processing, and regularization frameworks, which generalize the GNN application and identify a new path in solving graph learning problems with efficiency and effectiveness co-design.
We first extend GNN to a specific bipartite graph structure for representation learning. The main challenges are 1) bipartite graphs have two different but correlated feature domains; 2) unsupervised network embedding and computational efficiency should be considered simultaneously. Accordingly, we propose an efficient layerwise training bipartite graph neural network (L-BGNN), which employs a customized message passing on bipartite networks followed by an adversarial domain message alignment. The adversarial training enables L-BGNN to learn the node representations in an unsupervised manner without label supervision. In addition, L-BGNN adopts a layerwise training mechanism that can be efficiently generalized to large-scale bipartite graphs without performance deterioration. Extensive experiments on various scales of networks and numerous downstream tasks verify the superior performance of the proposed L-BGNN method.
Then, we propose a node classification algorithm named GraphHop, which is label efficiency with a dominant performance in extremely small label rates and can be directly generalized to large-scale networks with low memory cost and fast running time. We regard node attributes and labels as signals on graphs, where designed low-pass filters are respectively applied for signal smoothening. The two signal types are connected through classifier predictions. This separation of signal smoothening and feature space transformation reduces memory storage of parameters, enabling the generalization to large graphs. In addition, classifiers are introduced to train on local neighborhoods and make predictions to the whole nodes, serving as further smoothening to the label signals. The effective low-pass filtering to the graph signals enhances the capability in extremely limited label rates. Finally, we derive a different insight into the GraphHop method from a regularization framework. We show that the GraphHop model can be approximately cast into an alternate optimization of a particular regularization function on graphs. Then, based on this variational interpretation, we propose two approaches to address the approximation in the optimization of the GraphHop method. Experiments show that equipped with these two improvements, our model called GraphHop++ achieves significantly better performance than the former GraphHop model, and the state-of-the-art methods on various benchmarking datasets and applications to object recognition with extremely limited labels.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Scaling up deep graph learning: efficient algorithms, expressive models and fast acceleration
PDF
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
PDF
Deep learning techniques for supervised pedestrian detection and critically-supervised object detection
PDF
Sampling theory for graph signals with applications to semi-supervised learning
PDF
Neighborhood and graph constructions using non-negative kernel regression (NNK)
PDF
Scaling up temporal graph learning: powerful models, efficient algorithms, and optimized systems
PDF
Labeling cost reduction techniques for deep learning: methodologies and applications
PDF
Human motion data analysis and compression using graph based techniques
PDF
Fast and label-efficient graph representation learning
PDF
Graph machine learning for hardware security and security of graph machine learning: attacks and defenses
PDF
Efficient machine learning techniques for low- and high-dimensional data sources
PDF
Critically sampled wavelet filterbanks on graphs
PDF
Object classification based on neural-network-inspired image transforms
PDF
Novel algorithms for large scale supervised and one class learning
PDF
Data-efficient image and vision-and-language synthesis and classification
PDF
Hardware-software codesign for accelerating graph neural networks on FPGA
PDF
Green image generation and label transfer techniques
PDF
Graph embedding algorithms for attributed and temporal graphs
PDF
Advanced techniques for object classification: methodologies and performance evaluation
PDF
Compression of signal on graphs with the application to image and video coding
Asset Metadata
Creator
Xie, Tian
(author)
Core Title
Efficient graph learning: theory and performance evaluation
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2022-08
Publication Date
07/08/2022
Defense Date
05/04/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
graph learning,graph neural networks,label propagation,large-scale graphs,OAI-PMH Harvest,semi-supervised learning
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Nakano, Aiichiro (
committee member
), Ortega, Antonio (
committee member
), Rajgopal, Kannan (
committee member
)
Creator Email
xiet@usc.edu,xietianfudan@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111369236
Unique identifier
UC111369236
Legacy Identifier
etd-XieTian-10818
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Xie, Tian
Type
texts
Source
20220708-usctheses-batch-951
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
graph learning
graph neural networks
label propagation
large-scale graphs
semi-supervised learning