Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Effective graph representation and vertex classification with machine learning techniques
(USC Thesis Other)
Effective graph representation and vertex classification with machine learning techniques
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
EFFECTIVE GRAPH REPRESENTATION AND VERTEX CLASSIFICATION
WITH MACHINE LEARNING TECHNIQUES
by
Fenxiao (Jessica) Chen
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER ENGINEERING)
May 2020
Copyright 2020 Fenxiao (Jessica) Chen
Contents
List of Tables v
List of Figures vii
Abstract ix
1 Introduction 1
1.1 Significance of the Research . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions of the Research . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Survey on Graph Representation Methods and Performance Bench-
marking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Deep-Tree Recursive Neural Network (DTRNN) . . . . . . . . 4
1.2.3 DeepWalk-Assisted Graph PCA (DGPCA) . . . . . . . . . . . 5
1.2.4 GraphHop: A Successive Subspace Learning (SSL) Method for
Graph Vertex Classification . . . . . . . . . . . . . . . . . . . 5
1.3 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . . . 6
2 Survey on Graph Representation Methods and Performance Benchmarking 7
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Challenges and Opportunities . . . . . . . . . . . . . . . . . . 7
2.1.2 Our Contribution . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Definition and Preliminaries . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Graph Notations . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 Graph Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.4 Graph Output . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Graph Representation Learning Techniques . . . . . . . . . . . . . . . 13
2.3.1 Classical Methods . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.2 Matrix Factorization based Methods . . . . . . . . . . . . . . . 15
2.3.3 Random Walk Based Methods . . . . . . . . . . . . . . . . . . 16
2.3.4 Machine Learning Methods . . . . . . . . . . . . . . . . . . . 17
2.3.5 Large Graph Embedding Method . . . . . . . . . . . . . . . . 18
ii
2.3.6 Other Emerging Method . . . . . . . . . . . . . . . . . . . . . 19
2.4 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.1 Vertex Classification . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.2 Link Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5 Experiments of Small and Large Data Sets . . . . . . . . . . . . . . . . 21
2.5.1 Small Data set . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.2 Large Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 23
2.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6.1 Community Detection . . . . . . . . . . . . . . . . . . . . . . 25
2.6.2 Recommendation System . . . . . . . . . . . . . . . . . . . . 26
2.6.3 Graph Compression . . . . . . . . . . . . . . . . . . . . . . . 26
2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3 Graph Representation via Deep Tree Recurrent Neural Networks (DTRNN) 30
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Review of Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Proposed Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3.1 Deep-Tree Recursive Neural Network (DTRNN) Algorithm . . 33
3.3.2 Deep-Tree Generation (DTG) Algorithm . . . . . . . . . . . . 35
3.4 Impact of Attention Model . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5.2 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . 41
3.5.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5.4 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . 41
3.5.5 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . 42
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4 Graph Representation via DeepWalk-assisted Graph Principal Component
Analysis (DGPCA) 44
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Graph Network Representation Learning . . . . . . . . . . . . . . . . . 46
4.2.1 DeepWalk-based Vertex Representation . . . . . . . . . . . . . 48
4.2.2 Graph Principal Component Analysis (GPCA) . . . . . . . . . 49
4.2.3 DeepWalk-Assisted Graph PCA (DGPCA) . . . . . . . . . . . 51
4.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.2 Benchmarking Methods . . . . . . . . . . . . . . . . . . . . . 53
4.3.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.5 Run-time Analysis . . . . . . . . . . . . . . . . . . . . . . . . 55
iii
4.3.6 Comparison with Other Machine Learning Methods . . . . . . 56
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5 GraphHop: A Successive Subspace Learning (SSL) Method for Graph Ver-
tex Classification 59
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2.1 Graph Convolutional Network (GCN) . . . . . . . . . . . . . . 64
5.2.2 Successive Subspace Learning (SSL) . . . . . . . . . . . . . . 65
5.3 Proposed GraphHop Method . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.1 Pre-process of Vertex Feature . . . . . . . . . . . . . . . . . . 66
5.3.2 Local-to-Global Attribute Building . . . . . . . . . . . . . . . 69
5.3.3 1D Saab Transformation . . . . . . . . . . . . . . . . . . . . . 71
5.3.4 A Sequence of GraphHop units in cascade . . . . . . . . . . . . 73
5.3.5 Classification and Ensembles . . . . . . . . . . . . . . . . . . . 74
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.4.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.4.2 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . 76
5.4.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.4.4 Hyper Parameter Tuning . . . . . . . . . . . . . . . . . . . . . 77
5.5 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.5.1 Classification Performance . . . . . . . . . . . . . . . . . . . . 79
5.5.2 Run-time Analysis . . . . . . . . . . . . . . . . . . . . . . . . 80
5.5.3 Properties of Saab Filters . . . . . . . . . . . . . . . . . . . . . 81
5.5.4 BP and FF Design Comparison . . . . . . . . . . . . . . . . . 82
5.6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . 84
6 Conclusion and Future Work 86
6.1 Summary of the Research . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . 88
6.2.1 Scalable Graph Embedding for Large Graphs . . . . . . . . . . 88
6.2.2 Interpretable Machine Learning Methods For Graph Embedding 97
Bibliography 102
iv
List of Tables
2.1 Summary of Data Set Used . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Youtube Micro and Macro F1 score, samples and weighted result . . . 24
2.3 Flickr Micro and Macro F1 score, samples and weighted result . . . . . 24
2.4 Wiki Micro and Macro F1 score, samples and weighted result . . . . . 24
2.5 BlogCatalog Micro and Macro F1 score . . . . . . . . . . . . . . . . . 24
2.6 Cora Accuracy using different methods . . . . . . . . . . . . . . . . . 25
2.7 Facebook Link Prediction Average Precision . . . . . . . . . . . . . . 25
2.8 Wiki Link Prediction True Positive, True Negative, False Positive, False
Negative, Area Under Curve, Average Precision Result . . . . . . . . . 25
2.9 Time Complexity Comparison for Youtube, Flickr, and Wiki Data . . . 25
4.1 Comparison of run-time of DGPCA and GRNN on three data sets . . . 56
4.2 Comparison of the run-time for DGPCA and other GCN based methods
on three data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.1 Hyper Parameter Tuning Result . . . . . . . . . . . . . . . . . . . . . . 76
5.2 Hyper Parameter Tunning Results . . . . . . . . . . . . . . . . . . . . 78
5.3 Results for node classification accuracies on Cora, Citeseer, and PubMed
datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.4 Results for node classification run-time on Cora, Citeseer, and PubMed
data-sets in seconds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.1 The SRCC performance comparison (100) for the SGNS alone, the
SGNS+PPA and the SGNS+PVN against word similarity datasets, where
the last row is the average performance weighted by the pair number of
each dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.2 The SRCC performance comparison (100) for the SGNS alone, the
SGNS+PPA and the SGNS+PVN against word analogy datasets. . . . . 92
6.3 Word similarity datasets used in our experiments, where pairs indicate
the number of word pairs in each dataset. . . . . . . . . . . . . . . . . 93
6.4 The SRCC performance comparison (100) for the GloVe alone, the
GloVe+PPA and the GloVe+PVN against word similarity and analogy
datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
v
6.5 The SRCC performance comparison (100) for the SGNS alone and the
SGNS+PDE against word similarity datasets, where the last row is the
average performance weighted by the pair number of each dataset. . . . 95
6.6 The SRCC performance comparison (100) for the SGNS alone and the
SGNS+PDE against word analogy datasets. . . . . . . . . . . . . . . . 95
6.7 The SRCC performance comparison (100) for the SGNS alone and
the SGNS+PVN/PDE model against word similarity datasets, where the
last row is the average performance weighted by the pair number of each
dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.8 The SRCC performance comparison (100) of the SGNS alone the and
SGNS+PVN/PDE combined model against word analogy datasets. . . . 96
vi
List of Figures
1.1 Graph Embedding Life Cycle . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Node Embedding in Network . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Example of The Input Adjacency and Feature Matrix . . . . . . . . . . 3
2.1 Embedding Dimension v.s. Node Classification Accuracy . . . . . . . . 29
3.1 The workflow of the DTRNN algorithm. . . . . . . . . . . . . . . . . . 32
3.2 (a) The graph to be converted into a tree, (b) the tree converted to a sub-
graph using breadth-first search, (c) the tree converted using the deep-
tree generation method (our method), and (d) the DTRNN constructed
from the tree with LSTM units. . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Comparison of four methods on three data sets (from left to right): Cite-
seer, Cora and WebKB, where the x-axis is the percentage of training
data and the y-axis is the average Macro-F1 score. . . . . . . . . . . . . 37
3.4 Comparison of runtime for three data sets (from left to right): Citeseer,
Cora and WebKB, where the x-axis is the percentage of training data
and the y-axis is the runtime in seconds. . . . . . . . . . . . . . . . . . 37
3.5 Performance comparison of DTRNN with and without the soft attention
layer (from left to right: Citeseer, Cora and WebKB), where the x-axis
is the percentage of training data and the y-axis is the average Macro-F1
score. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1 The system diagram of the proposed DGPCA method. . . . . . . . . . 52
4.2 Comparison of four methods on three data sets (from left to right: Cite-
seer, Cora and WebKB), where the x-axis is the percentage of training
data and the y-axis is the average Macro-F1 score. . . . . . . . . . . . . 54
5.1 The workflow of GraphHop architecture. . . . . . . . . . . . . . . . . . 67
5.2 Vertex Representation : Max Pooling . . . . . . . . . . . . . . . . . . 70
5.3 Dimension reduction using SAAB transform: Adapted design for 2D data. 73
5.4 The classification accuracy verses dimension for different number of k . 78
5.5 Classification accuracy for Cora . . . . . . . . . . . . . . . . . . . . . 80
5.6 Classification accuracy for Citeseer . . . . . . . . . . . . . . . . . . . . 81
5.7 The log energy plot and preservation of energy as a function of AC filters 81
vii
6.1 Using SEM and NHEM for graph coarsening [54] . . . . . . . . . . . . 90
6.2 Adjacency matrix and matching matrix [54] . . . . . . . . . . . . . . . 91
6.3 Comparison of variances of the top 30 principal components for SGNS
and GloVe baseline embeddings with and without PVN. . . . . . . . . . 100
6.4 A post-processing system for word embedding methods using integrated
PVN and PDE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
viii
Abstract
Graph representation learning is an important task nowadays due to the fact that most
real-world data naturally comes in the form of graphs in many applications. Graph
data often come in high-dimensional irregular form which makes them more difficult to
analyze than the traditional low-dimensional data. Graph embedding has been widely
used to convert graph data into a lower dimensional space while preserving the intrinsic
properties of the original data.
In this thesis dissertation, we specifically study two graph embedding problems:
1) Developing effective and graph embedding techniques can provide researcher with
deeper understanding of the collected data more efficiently; 2) Use the embedded infor-
mation to conduct applications such as node classification and link prediction.
To find an efficient way to learn and encode graph into a low dimensional embed-
ding. We first present a novel DeepWalkassisted Graph PCA (DGPCA) method is
proposed for processing language network data represented by graphs. This method
can generate a precise text representation for nodes (or vertices) in language networks.
Unlike other existing work, our learned low dimensional vector representations add flex-
ibility in exploring vertices neighborhood information while reducing noise contained in
the original data. To demonstrate the effectiveness, we use DGPCA to classify vertices
that contain text information in three language networks. Experimentally, DGPCA is
ix
shown to perform well on the language datasets in comparison to several state-of-the-art
benchmarking methods.
To solve the node prediction problem, we present A novel graph-to-tree conversion
mechanism called the deep tree generation (DTG) algorithm is first proposed to pre-
dict text data represented by graphs. The DTG method can generate a richer and more
accurate representation for nodes (or vertices) in graphs. It adds flexibility in explor-
ing the vertex neighborhood information to better reflect the second order proximity
and homophily equivalence in a graph. Then, a Deep-Tree Recursive Neural Network
(DTRNN) method is presented and used to classify vertices that contains text data in
graphs. To demonstrate the effectiveness of the DTRNN method, we apply it to three
real-world graph datasets and show that the DTRNN method outperforms several state-
of-the-art benchmarking methods.
Next, we review a wide range of different graph embedding techniques and present a
structured analysis of the performance of different methods. We evaluate the some stat-
of-the-art approaches on both small and large datasets and compared the performance
among them. Then, we provide a discussion of how to choose the best graph embedding
techniques based on the size of input data and the purpose of application. Some potential
applications and future directions are discussed with concluding remarks at the end of
the paper. Finally, a open source Python library named GRLL (Graph Representation
Learning Library, available at https://github.com/jessicachen626/GRLL) is provided for
all the embedding and evaluation method mentioned in the paper.
Finally, we present an effective and explainable graph vertex classification method,
called GraphHop. GraphHop generates an effective representation for each vertex in
graphs without backpropagation. GraphHop determines the local-to-global attributes
of each vertex through successive one-hop information exchange. To control the rapid
x
increase of the dimension of vertex attributes, 1D Saab transform is adopted for dimen-
sion reduction. Afterwards, our GraphHop architecture ensembles vertex attributes for
the classification task, where multiple Graph-Hop units are cascaded to obtain the high
order proximity information. We applied GraphHop to three real-world graph datasets
and shown to offer state-of-the-art classification performance at much lower training
complexity.
xi
Chapter 1
Introduction
1.1 Significance of the Research
Research on graph representation learning has gained increasing attention among
researchers because many speech/text data such social networks, linguistic (word co-
occurrence) networks, biological networks and many other multi-media domain specific
data can be well represented by graphs. Graph representation allows relational knowl-
edge about interacting entities to be stored and accessed efficiently. Analyzing these
graph data can provide significant insights into community detection, behavior analysis
and many other useful applications for node classification, link prediction and cluster-
ing. To analyze the graph data, the first step is to find an accurate and efficient graph
representation. The steps of graph embedding is shown in figure 1.1. The input is a
graph represented by an adjacency matrix, sometimes concatenated with additional fea-
tures as shown in Figure 1.3, graph representation learning aims to embed the matrix
into a latent dimension that captures the intrinsic characteristics of the original graph.
For each nodeu in the network, we embed it to ad dimensional space that represent the
feature of that node, as shown in Figure 1.2.
Obtaining an accurate representation for the graph is challenging because of several
factors. Finding the optimal dimension of the representation is not an easy task. Rep-
resentation with higher number of dimensions might preserve more information of the
original graph at the cost of more space and time. Representation with lower number
of dimensions might be time and space efficient, and also might reduce the noise in
1
Figure 1.1: Graph Embedding Life Cycle
Figure 1.2: Node Embedding in Network
the original graph, while risking the loss of some information from the original graph.
The choose of dimension can also be domain-specific and depends on the type of input
graph. Choosing which property of the graph to embed is also challenging given the
plethora of properties graphs have.
In this thesis dissertation, we first focus on node prediction task in deep learning
models. Specifically, we explore node classification using tree-structured recursive neu-
ral networks. Then we switch our goal to improve the accuracy and efficiency of the
deep walk based matrix factorization method.
2
Figure 1.3: Example of The Input Adjacency and Feature Matrix
1.2 Contributions of the Research
1.2.1 Survey on Graph Representation Methods and Performance
Benchmarking
A wide range of graph embedding techniques and structured analysis of the performance
of different methods are presented. We evaluate the some stat-of-the-art approaches on
both small and large data set and compared the performance among them. A discussion
of how to choose the best graph embedding techniques based on the input data and
the purpose of application is also provided. We claim two major contributions in this
review:
We analyzed the performance of each method on both small and large data set.
The domain-specific performance of different methods used for different data sets
are analyzed in details. We believe it is the first review paper that systematically
analyze the domain and application specific performance of different embedding
methods.
3
We present Graph Representation Learning Library (GRLL), an open-source
Python library that provides a unified interface for all graph embedding meth-
ods discussed in this paper. To the best of our knowledge, this is the library that
covers the most graph embedding techniques to this date.
1.2.2 Deep-Tree Recursive Neural Network (DTRNN)
DTRNN can be used for node classification or node perdition based on the known label
of its neighbours. We propose to enhance the deep tree presentation of the original graph
in terms of the structure preservation, so that the nodes that should be embedded together
in the vector space do not not loose that information in the graph to tree conversion step.
we propose a graph-to-tree conversion mechanism called the DTG algorithm. The
DTG algorithm captures the structure of the original graph well, especially on
its second order proximity. The second-order proximity between vertices is not
only determined by observed direct connections but also shared neighborhood
structures of vertices [88]. To put it another way, nodes with shared neighbors are
likely to be similar.
We present the DTRNN method that brings the merits of the Long Short-Term
Memory (LSTM) network [41] and the deep tree representation together. The pro-
posed DTRNN method not only conserves the link feature better but also includes
the impact feature of nodes with more outgoing and incoming edges.
We extend the tree-structured RNN and models the long-distance vertex rela-
tion on more representative subgraphs to offer the state-of-the-art performance as
demonstrated in our conducted experiments. An in-depth analysis on the impact
of the attention mechanism and runtime complexity of our method is also pro-
vided.
4
1.2.3 DeepWalk-Assisted Graph PCA (DGPCA)
Graph PCA [80] has been used for processing image data but not so much on language
data due to the low embedding accuracy. To solve this problem, we combine DeepWalk-
assisted matrix factorization method with Graph PCA to facilitate faster and more accu-
rate embedding for language networks.
We developed a framework that combines text-assisted Deep-Walk method with
graph PCA to generate a more accurate vector representation for language net-
works.
The dimension of the learned vector representation is reduced compared to the
original dimension to allow fast processing. We call it the DeepWalk-assisted
Graph PCA (DGPCA) method. To the best of our knowledge, this is the first
work that applies a noise term to the robust graph PCA method so as to reduce
errors and increase node prediction accuracy in language networks.
We evaluate the proposed DGPCA method on three language network datasets.
DGPCA offers state-of-the-art performance in conducted experiments.
1.2.4 GraphHop: A Successive Subspace Learning (SSL) Method
for Graph Vertex Classification
Unlike the graph convolutional network (GCN), GraphHop generates an effective repre-
sentation for each vertex in graphs without backpropagation. It determines the local-to-
global attributes of each vertex through successive one-hop information exchange. !D
Saab transform is used for dimension reduction. Then our GraphHop method ensembles
vertex attributes for the classification task, where multiple Graph-Hop units are cascaded
to obtain the high order proximity information.
5
We propose a local-to-global attribute building method through iterative one-hop
information exchange.
our method uses a 1-D Saab dimension reduction techniques adapted from 2-
D Saab transformation, which will be used for classifying and ensembling node
attributes. The proposed GraphHop method not only
We evaluate the proposed Graphhop method on three language network dataset to
show that it offers close to the state-of-the-art performance with run time greatly
reduced, as demonstrated in our conducted experiments. To the best of our knowl-
edge, this is the first time feed-forward MLP is used for graph structured text data.
1.3 Organization of the Dissertation
The rest of the dissertation is organized as follows. In Chapter 2, we review the research
background, including different graph embedding techniques, along with the experi-
mental results we obtained by running different embedding methods. In Chapter 4, we
propose a training pipeline to address the node classification problem. In Chapter 3, we
propose a graph embedding method based on deep walk, where minimize the error in
the original data. Graphhop architecture is presented in Chapter 5. Finally, concluding
remarks and future research directions are given in Chapter 6.
6
Chapter 2
Survey on Graph Representation
Methods and Performance
Benchmarking
2.1 Introduction
Research on graph representation learning has gained increasing attention among
researchers because many speech/text data such social networks, linguistic (word co-
occurrence) networks, biological [90] networks and many other multi-media domain
specific data can be well represented by graphs. Graph representation allows relational
knowledge about interacting entities to be stored and accessed efficiently [1]. Analyz-
ing these graph data can provide significant insights into community detection [28],
behavior analysis and many other useful applications for node classification [4], link
prediction [56] and clustering [34]. To analyze the graph data, the first step is to find an
accurate and efficient graph representation.
2.1.1 Challenges and Opportunities
Obtaining an accurate representation for the graph is challenging because of several
factors. Finding the optimal dimension of the representation is not an easy task. Rep-
resentation with higher number of dimensions might preserve more information of the
original graph at the cost of more space and time. Representation with lower number
7
of dimensions might be time and space efficient, and also might reduce the noise in
the original graph, while risking the loss of some information from the original graph.
The choose of dimension can also be domain-specific and depends on the type of input
graph. Choosing which property of the graph to embed is also challenging given the
plethora of properties graphs have.
2.1.2 Our Contribution
In this review, we first introduce the problem statement along with several graph embed-
ding definitions. Then we provide a description of various graph embedding methods.
The experiments and setup are discussed in section 2.5 that is contained in the Python
library GRLL. Finally, conclusion and future work are discussed. We claim two major
contributions in this review:
Different graph embedding techniques are discussed in this paper, including the
most recent graph representation learning models. We analyzed the performance
of each method on both small and large data set. The domain-specific performance
of different methods used for different data sets are analyzed in details. We believe
it is the first review paper that systematically analyze the domain and application
specific performance of different embedding methods.
We present GRLL, an open-source Python library that provides a unified inter-
face for all graph embedding methods discussed in this paper. To the best of our
knowledge, this is the library that covers the most graph embedding techniques to
this date.
8
2.2 Definition and Preliminaries
2.2.1 Graph Notations
Graph Definition A graphG = (V;E) consists of vertices,V =fv
1
;v
2
;:::;v
n
g, and
edges, E =fe
i;j
g, where edge e
i;j
connects vertex v
i
to vertex v
j
.Graph are usually
represented by an a djacency matrix or a derived vector space representation [20]. The
adjacency matrixA of graphG contains the non-negative weights associated with each
edge, a
ij
0. If v
i
and v
j
are not directly connected to one another, a
ij
= 0. For
undirected graphs,a
ij
=a
ji
for all 1ijn.
Graph Embedding Graph embedding aims to map each node to a low dimensional
vector while preserving the distance characteristic among the node. Embedding methods
use the information in the graphG = (V;E) to find a mappingf :v
i
!x
i
2R
d
, where
djVj, and X
i
=fx
1
;x
2
;:::;x
d
g is the feature vector that captures the structural
properties of each vertexv
i
.
First Order Proximity First-order Proximity: The first-order proximity in a network
is the pairwise proximity between vertices. For example, in weighted networks, the
weights of the edges are the first-order proximity between vertices. If there is no edge
observed between two vertices, the first-order proximity between them is 0. If two
vertices are linked by an edge with high weight, they should be close in the embed-
ding space. The objective function can be easily obtained by minimizing the distance
between the joint probability distribution in the vector space and the empirical prob-
ability distribution. KL-divergence is applied to calculate the distance. The objective
function is shown in 2.1.
9
O
1
=
X
(i;j)2E
w
ij
logp
1
(v
i
;v
j
); (2.1)
where p
1
(v
i
;v
j
) =
1
1+exp(u
T
i
;u
j
)
. u
i
2 R
d
is the vector representation and w
ij
is the
weight.
Second Order Proximity Second-order Proximity: The second-order proximity
implies the numbers of shared neighborhood between two vertices. If two vertices have
similar neighbors. Although there isnt a direct edge between them, the representation
vectors of those two nodes should be close in the embedding. The objective function can
be easily obtained similarly as used in the first-order proximity. The objective function
is shown in 2.2.
O
2
=
X
(i;j)2E
w
ij
logp
2
(v
j
jv
j
); (2.2)
2.2.2 Optimization
Negative Sampling Optimizing usually requires the summation over the entire set of
vertices, it is computationally expensive to apply to large-scale networks. Therefore,
negative sampling [62] is used to address this problem. Negative sampling can help dis-
tinguish the neighbors from other nodes by sampling multiple negative edges according
to the noise distribution.
Edge Sampling A problem will occur during the training stage if the difference
between weights is significant. When the weight differences between edges is big, it
is hard to choose an appropriate learning rate during the optimization. A solution to this
problem is called edge sampling. Edge sampling unfolds weighted edges into several
binary edges. However, the memory consumption will accordingly increase. Therefore,
10
instead of unfolding the weighted edges, one can treat the weighted edges as binary
edges with the sampling probabilities proportional to the original weights. This treat-
ment would not modify the objective function.
2.2.3 Graph Input
Graph embedding methods takes a graph as the input, the graph can be homogeneous
graph, heterogeneous, with/without auxiliary information or could be a constructed
graph [9]. Homogeneous graph refers to the graph that both nodes and edges belong
to a single type. In homogeneous graphs, all nodes and edges are treated equally. For
heterogeneous graphs, only the basic structural information of the input graph is pro-
vided. These kind of graph can be directed or undirected. Most social network graphs
are directed graphs [88]. Heterogeneous graphs mainly exist in community-based ques-
tion answering (cQA) cites, multimedia network, or knowledge graphs. These kind of
graphs contains different kind of edges to represent different relations among different
entities or categories. Graph with Auxiliary information are the graphs that have label,
attribute, node feature, information propagation etc. label also refers to the category
that the node falls int. Nodes with different labels should be embedded further away
than nodes with the same label. Attribute is a discreet or continous value that con-
tains additional information about the graph rather than just the structural information.
Node feature are present the text information for each node. An example of information
propagation is post sharing or ”retweet”. It dindcation the dynamic interaction among
nodes. Some popular knowledge bases are Wikipedia [104], DBpedia [6], Freebase [7]
etc. Graphs constructed from non-relational data is assumed to lie in a low dimensional
manifold. The input feature matrix can be represented asX2R
jVjN
where each row
X
i
is a N-dimensional feature vector for thei-th training instance. A similarity matrix
S is constructed by computing the similarity betweenX
i
andX
j
.
11
2.2.4 Graph Output
Graph embedding output a low dimensional vectors representing the input graph. Graph
embedding output could be node embedding, edge embedding, hybrid embedding or
whole-graph embedding. What kind of output is preferred is application oriented and
task driven. Node embedding can represent each node as a vector which would be useful
for node clustering and classification. For node embedding, nodes that are close in the
graph are embedded closer together in the vector representations. Closeness can be first-
order proximity, second-order proximity or other similarity calculation. Edge embed-
ding aims to map each edge into a low-dimensional vector. Edge embedding is useful
for knowledge graph embedding. Edge embedding is useful for predicting whether a
link exists between two nodes in a graph and also knowledge graph entity/relation pre-
diction.
Hybrid embedding is the combination of different types of graph components such
as node and edge embedding. Substructure embedding is studied for hybrid embedding.
Hybrid embedding is useful for semantic proximity search and subgraphs learning, and
it can be used for graph classification based on the graph kernels. Substructure or com-
munity embedding can also be done by aggregating the individual node and edge embed-
ding inside it. Sometimes, better node embedding is learnt by incorporating the hybrid
embedding methods. Whole graph embedding is usually done for small graphs, such
as proteins and molecules. Such smaller graphs are represented as one vector and two
similar graphs are embedded to be closer. Whole-graph embedding benefits the graph
classification task by providing a straightforward and efficient solution for calculating
graph similarities.
12
2.3 Graph Representation Learning Techniques
2.3.1 Classical Methods
Classical graph embedding methods aims to reduce the dimension of high-dimensional
graph data into a lower dimensional representation and preserves the desired properties
of the original data.
Principal Component Analysis (PCA) PCA computes the low-dimensional repre-
sentation that maximize data variance. Mathematical terms, it first finds a linear trans-
formation matrixW
RDd
by solving
W
=
argmaxTr(W
T
Cov(X)W ) (2.3)
whereW
T
W =I PCA assumes principal components with larger associated variances
represent the most important structure informaton while those with lower variances rep-
resent noise. PCA also assumes the principal components are orthogonal, which makes
PCA soluble with eigen decomposition techniques.
Linear Discriminat Analysis (LDA) LDA assumes each class is Gaussian distributed
and the linear projection matrixW
RDd
is obtained by maxmizing the ratio between the
interclass scatter and intraclass scatters. The maximization problem can be solved by
eigendecomposition and the number of low dimension d can be obtained by detecting a
prominent gap in the eigen- value spectrum.
Multidimensional Scaling (MDS) MDS is a distance-preserving manifold learning
method. This method preserves spatial distances while some preserve graph distances.
MDS takes a dissimilarity matrix D where D
i;j
represents the dissimilarity between
13
points i and j and produces a mapping on a lower dimension, preserving the dissimilari-
ties as closely as possible.
The three methods above are sometimes referred to as ”subspace learning” [98] with
a linearity assumption. However, the linear methods might fail if the data is structuraly
hightly non-linear [76]. To solve this problem, non-linear dimensionality reduction
(NLDR) is used for manifold learning that automatically learns the topology. Classi-
cal manifold learning methods includes:
Isometric Feature Mapping (Isomap) Isomap finds low-dimensional representation
that most accurately preserves the pairwise geodesic distances between feature vectors
in all scales as measured along the submanifold from which they were sampled. Isomap
first constructs neighborhood graph on the manifold, then it computes the shortest path
between pairwise points. Finally it constructs low-dimensional embedding by applying
MDS.
Locally Linear Embedding (LLE) LLE preserves the local linear structure of nearby
feature vectors. LLE first assign neighbors to each data point. Then it compute the
weights W
i;j
that best linearly reconstruct Xi from its neighbors. Finally it compute
the low-dimensional embedding that best reconstructed byW
i;j
. Besides NLDR, kernel
PCA is another dimension reduction technique that is comparable to Isomap, LLE.
Kernel Methods Kernel extension can be applied to algorithms that only need to com-
pute the inner product of data pairs. After replacing the inner product with kernel func-
tion, data is mapped implicitly from the original input space to higher dimensional space
and then apply linear algorithm in the new feature space. The benefit of kernel trick is
data that are not linearly separable in the original space could be separable in new high
14
dimensional space. Kernel PCA is often used for NLDR with polynomial or Gaussian
kernels.
2.3.2 Matrix Factorization based Methods
Matrix factorization based graph embedding method is a structure-preserving dimen-
sional reducing process that factorizes the graph in the matrix form to obtain the node
embedding.
Graph Laplacian Eigenmaps Graph Laplacian Eigenmaps technique minimizes a
cost function to ensure that points close to each other on the manifold are mapped close
to each other in the low-dimensional space to preserve local distances.
Node Proximity Matrix Factorization : This method approximate the node proxim-
ity into a low-dimensional space using matrix factorization by minimizing the following
objective function:
minjWYY
C
j (2.4)
TADW TADW [100] is an improved DeepWalk method for text data. TADW incorpo-
rates the text features of vertices into network representation learning using the matrix
factorization method. Unlike the matrix factorization based DeepWalk, TADW factor-
izeM into three matrices: W 2 R
kjVj
,H2 R
kft
and text featuresT 2 R
ftjVj
as
shown in equation 4.3. In TADW,W andHT are concatenated as the representation for
vertices.
M =W
HT; (2.5)
HSCA Homophily, Structure, and Content Augmented Network Representation
Learning (HSCA) is an improvement upon the TADW model which uses Skip-Gram
15
and hierarchical Softmax to learn a distributed word representation. The objective func-
tion for HSCA can be written as:
min
W;H
jjMW
H
Tjj
2
F
+
2
(jjWjj
2
F
+jjHjj
2
F
) +(R
1
(W ) +R
2
(H)): (2.6)
In above, the first term aims to minimize the matrix factorization error of DeepWalk.
The second term imposes the low-rank constraint onW andH, and uses
2
to control
the trade-off. The last regularization term enforces the structural homophily between
connected nodes in the network. The regularization term R(W;H) makes connected
nodes close to each other in the learned network representation.R(W;H) is defined as:
R(W;H) =
1
4
jVj
X
i=1;j=1
A
i;j
jj
2
4
w
i
Ht
i
3
5
2
4
w
j
Ht
j
3
5
jj
2
2
(2.7)
In the equations above,jj:jj
2
is the matrix l
2
norm andjj:jj
F
is the matrix Frobenius
form. The algorithm for findingW andH is shown in Algorithm 1. Conjugate Gradient
(CG) [64] is used for UpdatingW andH.
2.3.3 Random Walk Based Methods
DeepWalk DeepWalk is one of the most widely used network representation learning
methods for graph embedding. In DeepWalk, a target vertex,v
i
, is said to belong to a
sequenceS =fv
1
;:::;v
jsj
g sampled from random walk ifv
i
can reach any vertex in S
within a certain number of steps. The set of vertices,V
s
=fv
it
;:::;v
i1
;v
i+1
;:::;v
i+t
g,
is the context of center vertexv
i
with a window size oft. DeepWalk aims to maximize
16
the average logarithmic probability of all vertex context pairs in a random walk sequence
S using the following equation:
1
jSj
jSj
X
i=1
X
tjt;j6=0
logp(v
i+j
jv
i
) (2.8)
wherep(v
j
jv
i
) is calculated using the softmax function. It is proven that DeepWalk is
equivalent to factoring a matrix [101],M2R
jVjjVj
, via:
M =W
T
H (2.9)
where each entry in M
ij
is the logarithm of the average probability that vertex v
i
can
reach vertex v
j
in a fixed number of steps. W 2 R
kjVj
is the vertex representation
for matrix factorization, and the information in H 2 R
kjVj
is rarely utilized in the
classical DeepWalk model.
GenVector GenVector leverages large-scale unlabeled data to learn large-scale social
knowledge graph that uses weakly supervision based on unsupervised techniques with
a multi-modal Bayesian embedding model. It uses latent discrete topic variables to gen-
erate continuous word embeddings and network-based user embeddings that combines
the advantages of topic models and word embeddings.
2.3.4 Machine Learning Methods
Machine learning (ML) based graph embedding applies neural network models on
graphs. The input is either paths sampled from a graph or the whole graph itself. Neural
network models such as convolution Neural Network (CNN) and its variants have been
widely adopted in graph embedding. Some machine learning methods directly use the
17
original CNN model designed for Euclidean domains and reformat input graphs to fit it
while some methods generalize the deep neural model to non-Euclidean graphs. Some
other types of deep learning based graph embedding methods.
GCN Graph Convolutional Network (GCN) [17] allows end-to-end learning of the
graph with arbitrary size and shape. This model uses convolution operator on the graph
and attractively aggregates the embedding of neighbors for odes. This approach is
widely for semi-supervised learning on graph-structured data that is based on an effi-
cient variant of convolutional neural networks that operate directly on graphs. This
model scales linearly in the number of graph edges and learns hidden layer representa-
tions that encode both local graph structure and features of nodes.
V AE Variational Graph Auto-Encoders (VGAE) [45] uses GCN as the encoder and an
inner product decoder to embed graphs. An autoencoder tries to minimize the recon-
struction error of the input and output using an encoder and a decoder in which many
non-linear equations are used. The encoder maps input data to a representation space
which is further mapped to a reconstruction space to per serve neighbourhood informa-
tion.
2.3.5 Large Graph Embedding Method
GPNN Graph partition neural networks (GPNN) extends graph neural networks
(GNNs) to embed extremely large graphs. GPNNs alternate between locally and
globally propagating information between the nodes and subgraphs. A modified
multi-seed flood fill for fast partitioning of large scale graphs is used in GPNN.
18
LINE LINE [89] is suitable to deal with arbitrary types of information networks, such
as undirected, directed, and/or weighted networks. While most of the embedding meth-
ods tend to only preserve either the first-order proximity or the second-order proximity
in the original networks, LINE constructs vector spaces that can reflect the local and
global structures of the original networks. By utilizing negative sampling to reduce
the complexity of optimization, LINE becomes an efficient network embedding model
that is especially useful for embedding networks with millions of nodes and billions of
edges. Its main contributions lie in the scalability to the networks and effectiveness and
efficiency of the inference. The model is firstly trained to preserve the first-order prox-
imity and second-order proximity separately. The first order proximity can reflect the
local structure of a network, while the second-order proximity reflects the global struc-
ture of a network. Merging two embeddings will generate a vector space that can better
represent the original network. A practical way to merge two embeddings together is to
simply concatenate the embedding vectors trained by two different objective functions
for each vertex
2.3.6 Other Emerging Method
Graph GAN GraphGAN [91] uses Generative Adversarial Nets (GAN) [32] to model
the connectivity probability among vertex in a graph. Simlar to GAN, GraphGAN
adopts a generator G(vjv
c
) to fit the distribution of vertices connected to target ver-
tex v
c
and generate the vertices that are most likely connected to it. A discriminator
D(v;v
c
) outputs the edge probability between v and v
c
to differentiate the the vertex
pair generated by the generator from the ground truth. The final vertex representation is
found using a minimax equation.
19
2.4 Evaluation Methods
2.4.1 Vertex Classification
Vertex classification is one of the most important tasks in graph representation learning
and it is an effective evaluation method used for natural language processing (NLP).
This task aims to assign a class label to each node in a graph based on information
learned from other labelled nodes. Intuitively, similar nodes should have the same label.
For example, in citation networks, similar publication may be labeled as the same topic.
In social network, similar individuals might have same interests or political beliefs. The
classification is conducted by applying a classifier, such as Support Vector Machine
(SVM) [31], logistic regression [92], k-nearest neighbor [49] etc. Graph embedding
methods embed each node into a low-dimensional vector. Given the vector, the trained
classifier can predict the label that vertex belongs to. The learned vertex representa-
tions can be done either unsupervised or semi-supervised. Node clustering is a unsu-
pervised method that aims to group similar nodes together. It is useful when labels
are unavailable. Semi-supervised method can be used when labeled data is available.
For binary-class classification,F 1 score is usually used for evaluation. For multi-class
classification,Micro-F 1 score is used instead. Various studies have demonstrated that
accurate vertex representations can contribute to high classification accuracy. Therefore,
vertex classification is commonly used to measure the performance of different network
embedding methods.
2.4.2 Link Prediction
Link prediction [26] is another important application for graph representation learning.
This application aims to infer the existence of relationships or interactions among pairs
of vertices in the network. The learned representation for the graph should help infer the
20
structure of the graph, especially when there are missing links. For example, in social
networks, links might be missing between two users who should otherwise be con-
nected. In this scenario, link prediction can be used to recommend friendships. Learned
network representation should preserve the network proximity and structural similarity
among vertices. The information encoded in the vector representation for each vertex
can be used to predict missing links in incomplete networks. The performance for link
prediction is usually measured by area under curve (AUC) or ROC curve. Better network
representations should be able to capture the connections among network vertices bet-
ter, therefore, enabling better application to link prediction. Numerous researches have
demonstrated that on these networks links predicted using the learned representations
achieve better performance than traditional similarity-based link prediction approaches.
2.5 Experiments of Small and Large Data Sets
In our research, we ran experiment for vertex classification and link prediction for both
small and large data set. Example of large network is Wikipedia, a word co-occurrence
network constructed from the entire set of English Wikipedia pages. There is no class
label on this network. Smaller data sets are the citation networks are directed infor-
mation networks formed by author-author citation relationships or paper-paper citation
relationships. They are collected from different databases of academic papers, such as
Cora and Citeseer. Cora, Citeseer and PubMed are the binary paper citation networks
along with vertex text attributes as the content of papers. The details of the data set is
described in 2.5.1, 2.5.2 and table 2.1.
21
2.5.1 Small Data set
Citeseer: It is a citation indexing dataset consisting of academic literatures from
six different categories. It contains 3,312 documents and 4,723 links . Each doc-
ument is represented by a binary vector of 3,703 dimensions.
Cora: It consists of 2,708 scientific publications of seven different classes. The
network has 5,429 links that indicate citation relations among all documents
. Each document is represented by a 0/1-valued word vector indicating the
absence/presence of the corresponding word from a dictionary consists of 1,433
unique words. As a result, each document is represented by a binary vector of
1,433 dimensions.
WebKB: It contains seven classes of web pages collected from computer science
departments, including student, faculty, course, project, department, staff and oth-
ers. It has 877 web pages and 1,608 hyper-links between web pages. Each page is
represented by a binary vector of 1,703 dimensions. KARATE: Zacharys karate
network is a well-known social network of a university karate club. It has been
widely studied in social network analysis. The network has 34 nodes, 78 edges
and 2 communities.
2.5.2 Large Dataset
BLOGCATALOG: is a network of social relationships of the bloggers listed on
the BlogCatalog website. The labels represent blogger interests inferred through
the meta-data provided by the bloggers. The network has 10,312 nodes, 333,983
edges and 39 different labels.
22
YOUTUBE: is a social network of Youtube users. This is a large network contain-
ing 1,157,827 nodes and 4,945,382 edges. The labels represent groups of users
who enjoy common video genres.
Facebook: is a set of posting collected from Facebook social network website.
This data contains 4039 nodes, 88243 edges and no labels. It is used for link
prediction tasks.
Flickr: is an online photo management and sharing application. This data contains
80513 nodes, 5899882 edges and 195 labels.
Wikipedia: is an online encyclopedia created and edited by volunteers around the
world. This data contains 2405 nodes, 17981 edges and 19 labels.
Node Edge Labels
Youtube 1138499 2990443 47
cora 2708 5429 7
blogcatalog 10312 333983 39
Facebook 4039 88243 N/A
Flickr 80513 5899882 195
Wiki 2405 17981 19
Table 2.1: Summary of Data Set Used
2.5.3 Experimental Results
To our knowledge, most empirical evaluation are carried out on different data set under
different settings. It is difficult to draw conclusions as for which algorithm is the best
due to the lack of consistency among different bench marking methods. We performed
experiments to compare several representative graph embedding methods. For each data
we run different embedding methods that is publicly available online and compare the
performance among them. The results are shown in 2.2 through 2.9.
23
micro macro samples weighted
DeepWalk 0.293 0.206 0.293 0.246
node2vec 0.301 0.221 0.301 0.259
LINE 0.266 0.17 0.266 0.209
Table 2.2: Youtube Micro and Macro F1 score, samples and weighted result
micro macro samples weighted
DeepWalk 0.313 0.212 0.313 0.265
node2vec 0.311 0.203 0.300 0.265
LINE 0.289 0.162 0.289 0.237
Table 2.3: Flickr Micro and Macro F1 score, samples and weighted result
micro macro samples weighted
GraRep 0.65 0.51 0.65 0.64
DeepWalk 0.67 0.56 0.67 0.67
node2vec 0.68 0.54 0.68 0.67
LINE 0.52 0.36 0.52 0.5
GraphGAN
(DeepWalk pretrained)
0.7 0.57 0.7 0.7
Table 2.4: Wiki Micro and Macro F1 score, samples and weighted result
micro F1 Macro F1
DeepWalk 0.393 0.247
LINE 0.356 0.194
node2vec 0.4 0.25
GraRep 0.393 0.23
HOPE 0.308 0.143
GF 0.236 0.072
SDNE 0.148 0.041
GraphGAN
(node2vec pre-trained)
0.37 0.19
Table 2.5: BlogCatalog Micro and Macro F1 score
24
GCN 0.801
node2vec 0.803
DeepWalk 0.829
LINE 0.432
GraRep 0.788
GF 0.499
HOPE 0.646
SDNE 0.573
Table 2.6: Cora Accuracy using different methods
Average Precision 0.25 hidden 0.5 hidden 0.75 hidden
spectral clus 0.984 0.985 0.982
node2vec 0.988 0.984 0.984
Table 2.7: Facebook Link Prediction Average Precision
TP TN FP FN ROC AP
deepwalk 2536 3383 2415 3262 0.51 0.51
grarep 1902 3971 1827 3896 0.5 0.51
vgae 5710 3122 2676 88 0.98 0.98
line 2904 2966 2832 2894 0.51 0.51
Table 2.8: Wiki Link Prediction True Positive, True Negative, False Positive, False
Negative, Area Under Curve, Average Precision Result
DeepWalk node2vec LINE
YouTube 37366 41626.49 185153.29
Flickr 3636.14 40779.22 31707.87
Wiki 37.23 27.52 79.42
Table 2.9: Time Complexity Comparison for Youtube, Flickr, and Wiki Data
2.6 Applications
2.6.1 Community Detection
Graph embedding can be used to predict the label of a given node given a fraction
of labeled node. In social networks graphs, labels might be gender, demographics, or
25
beliefs. In language networks, documents might be labeled with categories or keywords.
Missing labels can be inferred using the labeled nodes and the links in the network.
Graph embeddings can be interpreted as automatically extracted node features based
on network structure and thus used to predict the label to assign which community that
node belongs to. Both vertex classification and link prediction can facilitate community
detection tasks as described in 2.4.1 and 2.4.2.
2.6.2 Recommendation System
Recommendation is useful for social network and advertising platforms. Besides struc-
ture, content and label information, many network include spacial and temporal infor-
mation that can be useful. For example, Facebook can recommend user related content
to each individual and Yelp can recommend restaurant based on your location and pref-
erence. Spatial-temporal [106] embedding is a emerging graph representation learning
task that lead to more real-time graph embedding applications.
2.6.3 Graph Compression
Network compression (graph simplification) refers to the process of converting a graph
G toG
0
that has smaller number of edges. This application aims to store the network
more efficiently and run graph analysis algorithms faster. The graph is partitioned bipar-
tite cliques and replaced by trees to reduce the number of edges.
26
2.7 Discussion
From the experimental results, we can see that node2vec has the best performance for
Youtube Wiki, BlogCatalog data. Deepwalk has the best performance for Flickr, Blog-
Catalog and Cora data. Facebook link prediction accuracy is almost the same for spectral
clustering and node2vec.
The robustness to the change of embedding dimension is a critical indicator of deter-
mining a good graph embedding method. As the embedding dimension decreases, less
information from the original network is preserved. Therefore, the representation with
lower dimension mostly yield weaker performance in several applications, such as link
prediction, node classification, or clustering. However, some methods are more stable
to the change of the embedding size, while other methods will be influenced a lot. We
want to summarize the robustness to the change of embedding size between different
types of embedding methods by bringing out a experiment of node classification on the
wiki dataset.
The wiki dataset is selected in this experiment. The network has 2405 nodes, 17981
edges, and 19 classes of labels. Two random-walk based methods: node2vec, Deep-
Walk, two structural preserving methods: LINE, GraRep, one neural network based
method: SDNE, and one graph factorization method: GF are tested in this experiment.
The most common setting of the embedding dimension is 128. The use of embed-
ding vectors greater than 128-dimension is rare and not useful in the current researches.
Therefore, the embedding dimension will swing between 4 to 128. The experimental
results are plotted in Fig 2.1. As we can tell from the plot, the random-walk based repre-
sentation learning methods (node2vec and Deep- Walk) are highly immune to the change
of embedding size. Only about a 20decay while the embedding dimension shrinks 32
times smaller. However, the structural preserving methods (LINE and GreRep) and
27
graph factorization method (GF) have a hard time dealing with the decreasing embed-
ding size. The performance drops as much as 45% when the embedding size changes
from 128 to 4. A possible explanation is that since the structural preserving methods
optimized the representation vectors in the embedding space, a small information loss
will result in a big difference in the embedding space. Random-walk based methods,
they obtained the embedding vectors by randomly select paths from the original net-
work, the relationship between nodes are still preserved when the embedding dimension
becomes small. As for the neural network based method (SDNE), by adopting the auto-
encoder architecture, the embeddings between different hidden layers are alike. The
information is successfully preserved between each layer. Thus, the performance almost
remains the same regardless of the embedding dimension.
28
Figure 2.1: Embedding Dimension v.s. Node Classification Accuracy
29
Chapter 3
Graph Representation via Deep Tree
Recurrent Neural Networks (DTRNN)
3.1 Introduction
A novel graph-to-tree conversion mechanism called the deep-tree generation (DTG)
algorithm is first proposed to predict text data represented by graphs. The DTG method
can generate a richer and more accurate representation for nodes (or vertices) in graphs.
It adds flexibility in exploring the vertex neighborhood information to better reflect the
second order proximity and homophily equivalence in a graph. Then, a Deep-Tree
Recursive Neural Network (DTRNN) method is presented and used to classify ver-
tices that contains text data in graphs. To demonstrate the effectiveness of the DTRNN
method, we apply it to three real-world graph datasets and show that the DTRNN
method outperforms several state-of-the-art benchmarking methods.
Research on natural languages in graph representation has gained more interests
because many speech/text data in social networks and other multi-media domains can
be well represented by graphs. These data often come in high-dimensional irregular
form which makes them more difficult to analyze than the traditional low-dimensional
corpora data. Node (or vertex) prediction is one of the most important tasks in graph
analysis. Predicting tasks for nodes in a graph deal with assigning labels to each vertex
based on vertex contents as well as link structures. Researchers have proposed different
techniques to solve this problem and obtained promising results using various machine
30
learning methods. However, research on generating an effective tree-structure to best
capture connectivity and density of nodes in a network is still not yet extensively con-
ducted.
In our proposed architecture, the input text data come in form of graphs. Graph
features are first extracted and converted to tree structure data using our deep-tree gen-
eration (DTG) algorithm. Then, the data is trained and classified using the deep-tree
recursive neural network (DTRNN). The process generates a class prediction for each
node in the graph as the output. The workflow of the DTRNN algorithm is shown in
Figure 5.1.
There are two major contributions of this work. First, we propose a graph-to-tree
conversion mechanism and call it the DTG algorithm. The DTG algorithm captures
the structure of the original graph well, especially on its second order proximity. The
second-order proximity between vertices is not only determined by observed direct con-
nections but also shared neighborhood structures of vertices [88]. To put it another way,
nodes with shared neighbors are likely to be similar. Next, we present the DTRNN
method that brings the merits of the Long Short-Term Memory (LSTM) network [41]
and the deep tree representation together. The proposed DTRNN method not only con-
serves the link feature better but also includes the impact feature of nodes with more
outgoing and incoming edges. It extends the tree-structured RNN and models the long-
distance vertex relation on more representative sub-graphs to offer the state-of-the-art
performance as demonstrated in our conducted experiments. An in-depth analysis on
the impact of the attention mechanism and runtime complexity of our method is also
provided. .
The rest of this paper is organized as follows. Related previous work is reviewed in
Sec. 3.2. Both the DTRNN algorithm and the DTG algorithm are described in Sec. 3.3.
The impact of the attention model is discussed in Sec. 3.4. The experimental results and
31
Figure 3.1: The workflow of the DTRNN algorithm.
their discussion are provided in Sec. 5.4. Finally, concluding remarks are given in Sec.
3.6.
3.2 Review of Related Work
Structures in social networks are non-linear in nature. Network structure understanding
can benefit from modern machine learning techniques such as embedding and recursive
models. Recent studies, such as DeepWalk [70] and node2vec [36], aim at embedding
large social networks to a low-dimensional space. For example, the Text-Associated
DeepWalk (TADW) method [102] uses matrix factorization to generate structural and
vertex feature representation. However, these methods do not fully exploit the label
information in the representation learning. As a result, they might not offer the optimal
result.
Another approach to network structure analysis is to leverage the recursive neural
network (RNN). The Recursive Neural Tensor Network (RNTN) [81] was demonstrated
32
to be effective in training non-linear data structures. The Graph-based Recurrent Neural
Network (GRNN) [95] utilizes the RNTN based on local sub-graphs generated from the
original network structure. These sub-graphs are generated via breadth-first search with
a depth of at most two. Later, the GRNN is improved by adding an attention layer in
the Attention Graph-based Recursive Neural Network (AGRNN) [94]. Motivated by the
GRNN and AGRNN models, we propose a new solution in this work, called the Deep-
tree Recursive Neural Network (DTRNN), to improve the node prediction performance
furthermore.
Figure 3.2: (a) The graph to be converted into a tree, (b) the tree converted to a sub-
graph using breadth-first search, (c) the tree converted using the deep-tree generation
method (our method), and (d) the DTRNN constructed from the tree with LSTM units.
3.3 Proposed Methodology
3.3.1 Deep-Tree Recursive Neural Network (DTRNN) Algorithm
A graph denoted by G = (V;E) consists of a set of vertices, V = fv
1
;v
2
;:::;v
n
g,
and a set of edges, E =fe
i;j
g, where edge e
i;j
connects vertex v
i
to vertex v
j
. Let
X
i
=fx
1
;x
2
;:::;x
n
g be the feature vector associated with vertexv
i
,l
i
be the label or
class thatv
i
is assigned to andL the set of all labels. The node prediction task attempts
33
to find an appropriate label for any vertexv
t
so that the probability of given vertexv
k
with labell
k
is maximized. Mathematically, we have
^
l
k
= arg max
l
k
2L
P
(l
k
jv
k
;G): (3.1)
A softmax classifier is used to predict label l
k
of the target vertex v
k
using its hidden
statesh
k
P
(l
k
jv
k
;G) = softmax(W
s
h
k
+b
s
) (3.2)
where denotes model parameters. Typically, the negative log likelihood criterion is
used as the cost function. For a network ofN vertices, its cross-entropy is defined as
J() =
1
N
N
X
k=1
logP
(l
k
jv
k
;G): (3.3)
To solve the graph node classification problem, we use the Child-Sum Tree-LSTM
[87] data structure to represent the node and link information in a graph. Based on input
vectors of target vertex’s child nodes, the Tree-LSTM generates a vector representation
for each target node in the dependency tree. Like the standard LSTM, each nodev
k
has a
forget gate, denoted byf
kr
, to control the memory flow amount fromv
k
tov
r
; input and
output gatesi
k
ando
k
, where each of these gates acts as a neuron in the feed-forward
neural network,f
kr
i
k
ando
k
represent the activation of the forget gate, input and output
gates, respectively; hidden statesh
k
for representation of the node (output vector in the
LSTM unit, and memory cells);c
k
that indicates the cell state vector. Each child takes
on inputx
k
which is the vector representation of child nodes.
34
As a result, the DTRNN method can be summarized as:
^
h
k
= maxfh
r
g; (3.4a)
f
kr
=(W
f
x
k
+U
f
h
c
+b
f
); (3.4b)
i
k
=(W
i
x
k
+U
i
^
h
k
+b
i
); (3.4c)
o
k
=(W
o
x
k
+U
o
^
h
k
+b
o
); (3.4d)
u
k
= tanh(W
u
x
k
+U
u
^
h
k
+b
u
); (3.4e)
c
k
=i
k
u
k
+
X
vr2C(v
k
)
f
kr
c
r
; (3.4f)
h
k
=o
k
tanh(c
k
): (3.4g)
In the equations above, and denote the element-wise multiplication and the sigmoid
function,W
f
,W
i
,W
o
are the weights between the forget layer and forget gate, the input
layer and the input gate, the forget gate and the output gate;U
f
,U
i
,U
o
are the weights
between the hidden recurrent layer and the forget gate, the input gate and the output gate
of the memory block;b
f
,b
i
,b
o
are the additive biases of the forget gate, the input gate
and the output gate, respectively.
The DTRNN is trained with back propagation through time [93]. The model param-
eters are randomly initialized. In the training process, the weight are updated after the
input has been propagated forward in the network. The error is calculated using the
negative log likelihood criterion.
3.3.2 Deep-Tree Generation (DTG) Algorithm
In [58], a graph was converted to a tree using a breadth-first search algorithm with a
maximum depth of two. However, it fails to capture long-range dependency in the graph
so that the long short-term memory in the Tree-LSTM structure cannot be fully utilized.
35
Algorithm 1 Deep-Tree Generation Algorithm
Input:G,u
i
, maxCount
TreeGeneration (GraphG, Nodeu
i
, maxCount)
Initialize walk to a queueQ = [u
i
]
WhileQ is not empty and
totalNode< maxCount do
v
c
=Q.pop()
ifv.child exists then
forw in G.outVertex(v) do
addw as the child ofv
Q.push(w)
end for
end if
end while
return T
The main contribution of this work is to generate a deep-tree representation of a target
node in a graph. The generation starts at the target/root node. At each step, a new edge
and its associated node are added to the tree. The deep-tree generation strategy is given
in Algorithm 1. This process can be well explained using an example given in Figure
3.2.
Currently, the most common way to construct a tree is to traverse the graph using the
breadth first search (BFS) method. The BFS method starts at the tree root. It explores
all immediate children nodes first before moving to the next level of nodes until the
termination criterion is reached. For the graph given in Figure 3.2(a), it is clear that
nodev
5
is connected tov
6
viae
5;6
, and the shortest distance fromv
4
tov
6
is three hops;
namely, through e
4;1
,e
1;2
and e
2;6
. For the BFS tree construction process as shown in
Figure 3.2(b), we see that such information is lost in the translation. On the other hand,
if we construct a tree by incorporating the deepening depth first search, which is a depth
limited version of the depth first search [33], as shown in Algorithm 1, we are able to
recover the connection fromv
5
tov
6
and get the correct shortest hop fromv
4
tov
6
as
shown in Figure 3.2(c). Apparently, the deep-tree construction strategy preserves the
36
original neighborhood information better. The maximum number for a node to appear
in a constructed tree is bounded by its total in- and out-degrees. This is consistent with
our intuition that a node with more outgoing and incoming edges tends to have a higher
impact on its neighbors.
Figure 3.3: Comparison of four methods on three data sets (from left to right): Citeseer,
Cora and WebKB, where the x-axis is the percentage of training data and the y-axis is
the average Macro-F1 score.
Figure 3.4: Comparison of runtime for three data sets (from left to right): Citeseer, Cora
and WebKB, where the x-axis is the percentage of training data and the y-axis is the
runtime in seconds.
37
Figure 3.5: Performance comparison of DTRNN with and without the soft attention
layer (from left to right: Citeseer, Cora and WebKB), where the x-axis is the percentage
of training data and the y-axis is the average Macro-F1 score.
3.4 Impact of Attention Model
An attentive recursive neural network can be adapted from a regular recursive neural
network by adding an attention layer so that the new model focuses on the more relevant
input. The attentive neural network has demonstrated improved performance in machine
translation, image captioning, question answering and many other different machine
learning fields. The added attention layer might increase the classification accuracy
because the graph data most of the time contain noise. The less irrelevant neighbors
should has less impact on the target vertex than the neighbors that are more closely
related to the target vertex. Attention models demonstrated improved accuracy in several
applications.
In this work, we examine how the added attention layers could affect the results of
our model. In the experiment, we added an attention layer to see whether the attention
mechanism could help improve the proposed DTRNN method. The attention model
is taken from [94] that aims to differentiate the contribution from a child vertex to a
target vertex using a soft attention layer. It determines the attention weight,
r
, using
38
a parameter matrix denoted byW
. MatrixW
is used to measure the relatedness ofx
andh
r
. It is learned by the gradient descent method in the training process. The softmax
function is used to set the sum of attention weights to equal 1. The aggregated hidden
state of the target vertex is represented as the summation of all the soft attention weight
times the hidden states of child vertices as
r
= Softmax(x
T
W
h
r
); (3.5)
where
^
h
r
=
r
h
r
: (3.6)
Based on Eqs. 4(a), (5) and (6), we can obtain
^
h
k
= maxfSoftmax(x
T
W
h
r
)h
r
g: (3.7)
Athough the attention model can improve the overall accuracy of a simple-tree model
generated by a graph, its addition does not help but hurts the performance of the pro-
posed deep-tree model. This could be attributed to several reasons.
It is obvious to see that
r
is bounded between 0 and 1 because of the softmax
function. If one target root has more child nodes,
r
will be smaller and getting closer
to zero. By comparing Figures 3.2(b) and (c), we see that nodes that are further apart will
have vanishing impacts on each other under this attention model since our trees tend to
have longer paths. The performance comparision of DTRNN with and without attention
added is given in Figure 4.2. For Cora, we see that DTRNN without the attention layer
outperforms the one with attention layer by 1.8-3.7%. For Citeseer, DTRNN without
the attention layer outperforms by 0.8-1.9%. For WebKB, the performance of the two
are about the same.
39
Furthermore, this attention model pays close attention to the immediate neighbor
of a target yet ignores the second-order proximity, which can be interpreted as nodes
with shared neighbors being likely to be similar [88]. Prediction tasks on nodes in
networks should take care of two types of similarities: (1) homophily and (2) structural
equivalence [42]. The homophily hypothesis [103] states that nodes that are highly
interconnected and belong to similar network clusters or communities should be similar
to each other. The vanishing impact of scalded h
r
tends to reduce these features in
our graph. In the next section, we will show by experiments that the DTRNN method
without the attention model outperforms a tree generated by the traditional BFS method
with an attention LSTM unit and also DTRNN method with attention model .
3.5 Experiments
3.5.1 Datasets
To evaluate the performance of the proposed DTRNN method, we used the following
two citation and one website datasets in the experiment.
Citeseer: The Citeseer dataset is a citation indexing system that classifies aca-
demic literature into 6 categories [29]. This dataset consists of 3,312 scientific
publications and 4,723 citations.
Cora: The Cora dataset consists of 2,708 scientific publications classified into
seven classes [60]. This network has 5,429 links, where each link is represented
by a 0/1-valued word vector indicating the absence/presence of the corresponding
word from a dictionary consists of 1,433 unique words.
WebKB: The WebKB dataset consists of seven classes of web pages collected
from computer science departments: student, faculty, course, project, department,
40
staff and others [16]. It consists of 877 web pages and 1,608 hyper-links between
web pages.
3.5.2 Experimental Settings
These three datasets are split into training and testing sets with proportions varying from
70% to 90%. We run 10 epochs on the training data and recorded the highest and the
average Micro-F1 scores for items in the testing set.
3.5.3 Baselines
We implemented a DTRNN consisting of 200 hidden states, and compare its perfor-
mance with that of three benchmarking methods, which are described below.
Text-associated Deep Walk (TADW). It incorporates text features of vertices
under the matrix factorization framework [99] for vertex classification.
Graph-based LSTM (G-LSTM). It first builds a simple tree using the BFS only
traversal and, then, applies an LSTM to the tree for vertex classification [95].
Attentive Graph-based Recursive Neural Network (AGRNN). It is improved upon
the GRNN with soft attention weight added in the each attention unit as depicted
in Eqs. (3.5) and (3.6) [94].
3.5.4 Results and Analysis
The Macro-F1 scores of all four methods for the above-mentioned three datasets are
compared in Figure 4.2. We see that the proposed DTRNN method consistently out-
performs all benchmarking methods. When comparing the DTRNN and the AGRNN,
which has the best performance among the three benchmarks, the DTRNN has a gain
up to 4.14%. The improvement is the greatest on the WebKB dataset. In the Cora and
the Citeseer datasets, neighboring vertices tend to share the same label. In other words,
41
labels are closely correlated among short range neighbors. In the WebKB datasets, this
short range correlation is not as obvious, and some labels are strongly related to more
than two labels [95]. Since our tree-tree generation strategy captures the long distance
relation among nodes, we see the largest improvement in this dataset.
3.5.5 Complexity Analysis
The graph-to-tree conversion is relatively fast. For both the breadth-first and our method,
the time complexity to generate the tree isb
d
, whereb is the max branching factor of the
tree, andd is the depth. The DTRNN algorithm builds a longer tree with more depth.
Thus, the tree construction and training will take longer yet overall it still grows linearly
with the number of input node asymptotically.
The bottleneck of the experiments was the training process. During each training
time step, the time complexity for updating a weight isO(1). Then, the overall LSTM
algorithm has an update complexity of O(W ) per time step, where W is the number
of weights [41] that need to be updated. In addition, LSTM is local in space and time,
meaning that it does not depend on the network size to update complexity per time step
and weight, and the storage requirement does not depend on input sequence length [78].
For the whole training process, the run time complexity isO(Wie), wherei is the input
length ande is the number of epochs.
In our experiments, the input length is fixed per time step because the hidden states of
the child vertices are represented by max pooling of all children’s inputs. The number
of epochs is fixed at 10. The actual running time for each data set is recorded for
the DTRNN method and the G-LSTM method. The results are shown in Figure 3. If
attention layers are added as described in the earlier section, they come at a higher cost.
The attention weights need to be calculated for each combination of child and target
vertex. If we havec children on average forn target vertices, there will becn attention
42
values. The actual machine runtime of three datasets are shown in Figure 3.4. The CPU
runtime shows that the DTRRN is faster than the AGRNN-d1 (with an attention model
of depth equal to 1) by 20.59% for WebKB, 14.78% for Citeseer, and 21.06% for Cora
while having the highest classification accuracy among all three methods.
3.6 Conclusion
A novel strategy to convert a social citation graph to a deep tree and to conduct the vertex
classification problem was proposed in this work. It was demonstrated that the proposed
deep-tree generation (DTG) algorithm can capture the neighborhood information of a
node better than the traditional breath first search tree generation method. Experimental
results on three citation datasets with different training ratios proved the effectiveness of
the proposed DTRNN method. That is, our DTRNN method offers the state-of-the-art
classification accuracy for graph structured text.
We also trained graph data in the DTRNN by adding more complex attention models,
yet attention models does not generate better accuracy because our DTRNN algorithm
alone already captures more features of each node. The complexity of the proposed
method was analyzed. We considered both asymptotic run time and real time CPU run-
time and showed that our algorithm is not only the most accurate but also very efficient.
In the near future, we would like to apply the proposed methodology to graphs of a
larger scale and higher diversity such as social network data. Furthermore, we will find
a new and better way to explore the attention model although it does not help much in
our current implementation.
43
Chapter 4
Graph Representation via
DeepWalk-assisted Graph Principal
Component Analysis (DGPCA)
4.1 Introduction
Language network learning is an important task with many applications such as text
classification, link prediction and community detection. One of the challenges in
this domain is finding an efficient way to learn and encode network structure into a
low dimensional embedding. In this paper, a novel DeepWalk-assisted Graph PCA
(DGPCA) method is proposed for processing language network data represented by
graphs. This method can generate a precise text representation for nodes (or vertices)
in language networks. Unlike other existing work, our learned low dimensional vector
representations adds flexibility in exploring vertices’ neighborhood information, while
reducing noise contained in the original data. To demonstrate the effectiveness, we use
DGPCA to classify vertices that contain text information in three language networks.
Experimentally, DGPCA is shown to perform well on the language datasets in compar-
ison to several state-of-the-art benchmarking methods.
Real-world network data such as social networks, linguistic (word co-occurrence)
networks and communication networks can be modeled as of graphs. Graph network
data allow relational knowledge about interacting entities to be stored and accessed
44
efficiently [2]. Analyzing these graph networks can provide significant insights into
community detection[27], behavior analysis and many other useful applications. Graph
analysis has been widely used in node classification [5], link prediction [57] and clus-
tering [21]. In the node classification task, the label of nodes is determined based on
labels of its neighboring nodes and the network topology. Missing links or links that are
likely to occur are predicted in the link prediction task. Clustering is used for group-
ing nodes of similar characteristics. In this paper, we study the problem of node (or
vertex) prediction in language networks. Node prediction deals with assigning labels to
each vertex based on vertex contents and link structures. Various techniques have been
proposed to solve this problem. The Graph-based Recurrent Neural Network (GRNN)
model [94] offers the state-of-the-art performance among among all the techniques [77].
However, training neural network models is time consuming and hardware demanding
due to the fact that the training process aims to find thousand or even millions of param-
eters using back propagation (BP) [50]. Neural network models are also sensitive to
adversarial attacks [37] and overfitting [85]. Therefore, we aim to find an alternative
that has similar or better performance but demands much less training time.
Language network data often come in high-dimensional irregular form. This makes
them more difficult to analyze than the traditional low-dimensional corpora data. PCA is
one of the most widely used techniques for dimensionality reduction [3] and data anal-
ysis. However, most PCA-based concepts are defined for signals lying in the Euclidean
space. They are not directly applicable to network data. In addition, real-world network
data can be corrupted by stochastic or deterministic noise. For example, in collaborative
filtering, collected user ratings could contain noise [23] since the data collection process
might not be properly controlled. To deal with noise in the network data, we develop a
45
new Graph PCA method by adding a noise term. The new method can extract the low-
rank and sparse term of the original data so the learned representation is more robust
against noise.
There are two main contributions in this research. First, we developed a framework
that combines text-assisted DeepWalk method with Graph PCA to generate a more accu-
rate vector representation for language networks. The dimension of the learned vector
representation is reduced compared to the original dimension to allow fast processing.
We call it the DeepWalk-assisted Graph PCA (DGPCA) method. Second, to the best of
our knowledge, this is the first work that applies a noise term to the robust graph PCA
method so as to reduce errors and increase node prediction accuracy in language net-
works. We evaluate the proposed DGPCA method on three language network datasets.
DGPCA offers state-of-the-art performance in conducted experiments.
The rest of this paper is organized as follows. We first introduce the problem of inter-
est and explain how language networks are represented. Then a text-based DeepWalk
algorithm is presented to capture the structure of the original network, especially on its
structure and content information. Next, we present the DGPCA method that brings the
merits of the text-associated DeepWalk and the Graph PCA together. The accuracy and
runtime of the proposed method is demonstrated by experimental results and analysis.
Finally, concluding remarks are given.
4.2 Graph Network Representation Learning
The primary input to our representation learning method is a language network rep-
resented by a graph G = (V;E). This graph consists of a set of vertices, V =
fv
1
;v
2
;:::;v
n
g, and a set of edges, E =fe
i;j
g, where edge e
i;j
connects vertex v
i
to
vertexv
j
. Graph networks are usually represented by an adjacency matrix or a derived
46
vector space representation [35]. The adjacency matrixA of graphG contains the non-
negative weights associated with each edge,a
ij
0. Ifv
i
andv
j
are not directly con-
nected to one another,a
ij
= 0. For undirected graphs,a
ij
= a
ji
for all 1 i j n
. The goal is to use the information in the graph to find X
i
= fx
1
;x
2
;:::;x
n
g ,the
feature vector associated with vertex v
i
, and l
i
(l
i
2 L the set of all labels), the label
or class that v
i
is assigned to. The node prediction task attempts to find an appropri-
ate label for any vertexv
t
so that the probability of given vertexv
k
for one classifica-
tion is maximized. To put it in a mathematical representation, we are trying to solve:
^
l
k
= arg max
l
k
2L
P
(l
k
jv
k
;G):
Network learning for graphs aims to find the distributed vector representationX
i
for
every vertex in a network. Each vertex can yield rich information about the topology,
structure and content of the network. In the learned network vector representation, the
relationship among nodes is captured by the distances between nodes in the vector space
[19]. Nodes with similar traits should be close to each other in the learned vector space.
The topological and structural characteristics of a node should also be encoded into the
learned vector.
However, obtaining an accurate vector representation is challenging because choos-
ing which properties to embed is not easy given the plethora of distance metrics and
topological properties for graphs. In addition, finding the optimal dimension for the
representation is difficult. Many networks are large, and most graph data lie in a space
of hundreds-to-thousands dimensions or even more. Therefore, an embedding method
should also be scalable. To deal with the curse of dimensionality, we want to embed
the data in a subspace of lower dimension. DeepWalk [70] is one of the state-of-the-art
network representation learning method while text-associated DeepWalk (TADW) [99]
is an improved DeepWalk method for text data. TADW incorporates the text features of
vertices into network representation learning using the matrix factorization method.
47
4.2.1 DeepWalk-based Vertex Representation
DeepWalk is one of the most widely used network representation learning methods for
graph embedding. In DeepWalk, a target vertex, v
i
, is said to belong to a sequence
S =fv
1
;:::;v
jsj
g sampled from random walk if v
i
can reach any vertex in S within a
certain number of steps. The set of vertices, V
s
=fv
it
;:::;v
i1
;v
i+1
;:::;v
i+t
g, is the
context of center vertex v
i
with a window size of t. DeepWalk aims to maximize the
average logarithmic probability of all vertex context pairs in a random walk sequenceS
using the following equation:
1
jSj
jSj
X
i=1
X
tjt;j6=0
logp(v
i+j
jv
i
); (4.1)
wherep(v
j
jv
i
) is calculated using the softmax function. It is proven that DeepWalk is
equivalent to factoring a matrix [99],M2R
jVjjVj
, via:
M =W
H; (4.2)
where each entry in M
ij
is the logarithm of the average probability that vertex v
i
can
reach vertexv
j
in a fixed number of steps. W2R
kjVj
is the vertex representation for
matrix factorization, and the information inH2R
kjVj
is rarely utilized in the classical
DeepWalk model.
TADW is improved upon classical DeepWalk by factorizingM into three matrices:
W 2 R
kjVj
, H2 R
kft
and text featuresT 2 R
ftjVj
as shown in equation 4.3. In
TADW,W andHT are concatenated as the representation for vertices.
M =W
HT; (4.3)
48
Homophily, Structure, and Content Augmented Network Representation Learning
(HSCA) is an improvement upon the TADW model which uses Skip-Gram and hier-
archical Softmax to learn a distributed word representation. The objective function for
HSCA can be written as:
min
W;H
jjMW
H
Tjj
2
F
+
2
(jjWjj
2
F
+jjHjj
2
F
) +(R
1
(W ) +R
2
(H)): (4.4)
In above, the first term aims to minimize the matrix factorization error of DeepWalk.
The second term imposes the low-rank constraint onW andH, and uses
2
to control
the trade-off. The last regularization term enforces the structural homophily between
connected nodes in the network. The regularization term R(W;H) makes connected
nodes close to each other in the learned network representation.R(W;H) is defined as:
R(W;H) =
1
4
jVj
X
i=1;j=1
A
i;j
jj
2
4
w
i
Ht
i
3
5
2
4
w
j
Ht
j
3
5
jj
2
2
(4.5)
In the equations above,jj:jj
2
is the matrix l
2
norm andjj:jj
F
is the matrix Frobenius
form. The algorithm for findingW andH is shown in Algorithm 1. Conjugate Gradient
(CG) [65] is used for UpdatingW andH.
4.2.2 Graph Principal Component Analysis (GPCA)
Principal Component Analysis (PCA) has been widely used in dimensionality reduction
and many other data analysis tasks. PCA uses orthogonal transformation to convert vari-
ables into linearly uncorrelated principal components. For input dataX of dimension
nt, PCA generates a linear subspace of dimension nd , where d smaller than t
so that all the data lie close to the original data in Euclidean distance. In the reduced
49
Algorithm 1: W, H Solver Algorithm
Input:A,k,n,f
t
,,lambda, MaxIteration
HWSolver (A,k,n,f
t
,,lambda)
W = random(k, n)
H = random(k,f
t
)
while iteration< MaxIteration
W
0
=W ,H
o
=H
W = UpDate(W, A, lamdba)
H = UpDate(H, A, lambda)
Delta
W
=WW
0
,Delta
H
=HH
0
val =trace(Delta
W
Delta
)
W
+trace(Delta
H
Delta
)
H
ifval< do
Break
end if
end while
return W, H
subspace, most of the variability of the data is preserved. PCA solves the following
equations:
min
U
d
;y
i
t
X
i=1
jjx
i
U
d
y
i
jj
2
; (4.6)
where U
d
is the orthonormal matrix that spans subspace R
nd
. The solution to Eq.
(4.6) isy
i
= U
>
d
x
i
. U
d
can be found from the singular value decomposition (SVD) of
X =UV
>
.
Robust PCA on graphs [10] has the advantage of eliminating outliers by recovering
low-rank matrixL from corrupted dataM. It has been used in many applications such
as video surveillance, face recognition, ranking and collaborative filtering. However, its
application to language networks is rarely studied. A graph network data,X, could be
decomposed intoL andS, whereL is the low rank matrix andS is the sparse matrix.
If we do not know dimensions ofL andS, we need an efficient way to recover the low-
rank and sparse components accurately. Classical principal component analysis (PCA)
[9] finds the best rank-k estimate ofL by minimizingjjXLjj subject torank(L)k.
50
Principal Component Pursuit (PCP) recovers low rank matrix L and sparse matrix S
by solving minjjLjj +jjSjj
1
, where L +S = M. All of them attempt to recover
low-rank representation L from corrupted data. For language networks, we are more
concerned about noise in the original data rather than data corruption. Therefore, we
need to incorporate the noise factor in Graph PCA.
4.2.3 DeepWalk-Assisted Graph PCA (DGPCA)
In the DGPCA model, the original data matrix is first factorized intoW ,H andT using
Equation 7. Then, W and HT are concatenated as the representation vector for each
vertex [107]. In the Graph PCA step, we use the new observation model:
M =L +S +N; (4.7)
whereN is the perturbation and error term. It is used to represent noise contained in the
original dataset. To make it more general, we have
M =W
1
(L) +W
2
(S) +W
3
(N);
whereW
1
,W
2
andW
3
are known linear maps [110]. To recover matricesL andS, our
proposed model solves the following optimization problem:
min
L;S
(jjLjj +jjSjj
1
) (4.8)
subject to:
jjMLSjj
F
(4.9)
51
wherejj:jj
1
andjj:jj
F
are thel
1
norm and the Frobenius norm of matrices, respectively,
and is the bound on the noise term.
After applying Graph PCA to the new representation using Equation (4.8) and the
constrain in Equation (4.9), we can obtain low rank representationL, sparse representa-
tionS and noiseN. For the final representation, we setX =L +S with the noise term
removed. For the classification task, we use a semi-supervised classifier, known as the
Transductive SVM (TSVM), to test our method.
Figure 4.1: The system diagram of the proposed DGPCA method.
4.3 Experiments
4.3.1 Datasets
To evaluate the performance of the proposed DGPCA method, we evaluated its classifi-
cation accuracy using the following three datasets.
52
Citeseer: It is a citation indexing dataset consisting of academic literatures from
six different categories. It contains 3,312 documents and 4,723 links [29]. Each
document is represented by a binary vector of 3,703 dimensions.
Cora: It consists of 2,708 scientific publications of seven different classes. The
network has 5,429 links that indicate citation relations among all documents
[60]. Each document is represented by a 0/1-valued word vector indicating the
absence/presence of the corresponding word from a dictionary consists of 1,433
unique words. As a result, each document is represented by a binary vector of
1,433 dimensions.
WebKB: It contains seven classes of web pages collected from computer science
departments, including student, faculty, course, project, department, staff and oth-
ers. It has 877 web pages and 1,608 hyper-links between web pages [16]. Each
page is represented by a binary vector of 1,703 dimensions.
4.3.2 Benchmarking Methods
We compare the DGPCA method with the following three benchmarking methods.
Logistic Regression (LR): It uses a logistic regression model to predict the label
of each vertex based on its attribution. LR model calculates the class probability
[22] for each category and choose the class with the highest calculated probability.
Graph-based Recursive Neural Network (GRNN): It first builds a tree from the
graph using breath-first-search method [94]. Then applies long short-term mem-
ory (LTSM) [41] based Recursive Neural Network (RNN) [81] model to predict
vertex label.
53
Text-associated Deep Walk (TADW): It incorporates text features of vertices using
matrix factorization for vertex classification.
Homophily, Structure, and Content Augmented Network Representation Learning
(HSCA): It is improved upon TADW with a regularization term.
4.3.3 Experimental Setup
The three datasets are split into training and testing sets with proportions varying from
70% to 90%. All optimization algorithms converge within 50 iterations on the training
data. All experiments are conducted on a Ubuntu 14.04 machine with an Intel Core
i7-5930K CPU and 32 GB memory.
Figure 4.2: Comparison of four methods on three data sets (from left to right: Citeseer,
Cora and WebKB), where the x-axis is the percentage of training data and the y-axis is
the average Macro-F1 score.
4.3.4 Experimental Results
The Macro-F1 scores of all four methods on the three datasets are compared in Figure
4.2. DGPCA consistently outperforms all other three benchmarking methods. GRNN
54
has the second-best performance. By comparing the performance of our method and
GRNN, we see that DGPCA has a gain up to 6.5%, 7.5% and 4.1% for Citeseer, Cora
and WebKB, respectively. The improvement is the highest for the Cora dataset. In the
Cora dataset, neighboring vertices tend to share the same label. It means that labels are
closely correlated among near neighbors. We see from accuracy comparison that our
DGPCA method captures the short range information very well while reducing noise in
the original data.
4.3.5 Run-time Analysis
Real-world network data are usually sparse. For this reason, we will use the approxima-
tionO(jVj) O(jEj). The computation of matrixM has time complexity ofO(jVj
2
)
becuaseM is approximated using (A +A
2
)=2, whereA is the transition matrix using
PageRank [68]. The optimization of matrix factorization takes O(nnz(M) +jVjk
2
)
time [99], where nnz() is the non-zero entries ofM andk is the rank. In the PCA step,
the runtime is reduced toO(k
3
) [111], wherek is smaller thanjVj. Thus, the overall
runtime isO(jVjk
2
).
We use GRNN as the benchmarking method since it is one of the state-of-the-art
machine learning methods for vertex classification. It also gives the second-best accu-
racy in Figure 2. The network conversion in GRNN uses the breadth-first-search (BFS)
method to convert a graph to a tree. This conversion step has time complexity ofO(b
d
),
whereb is the max branching factor of the tree andd is the depth. In each training step,
the time complexity in updating a weight is O(1). The overall LSTM algorithm has
an update complexity ofO(W ) per time step, whereW is the number of weights to be
updated. For the whole training process, the run-time complexity isO(Wie), where
i is the input length ande is the number of epochs. Then, the overall time complexity
for GRNN isO(b
d
+Wie) [11]. The actual running time for each data set of the two
55
Table 4.1: Comparison of run-time of DGPCA and GRNN on three data sets
Training Percentage 70% 75% 80% 85% 90%
DGPCA
Citeseer 116.04 107.07 109.00 107.88 114.08
Cora 83.91 80.73 79.72 81.33 84.22
WebKB 45.47 42.55 46.55 46.51 47.46
GRNN
Citeseer 721.4 775.41 822.97 877.27 929.42
Cora 325.46 347.66 368.1 391.83 412.01
WebKB 121.35 134.15 142.91 153.30 148.66
The run time are recorded in second (s)
methods is shown in Table 4.1. The CPU run-time shows that our method is faster than
GRNN by saving 83.9%, 74.2% and 62.5% training time on the average for Citeseer,
Cora and WebKB, respectively.
4.3.6 Comparison with Other Machine Learning Methods
Recursive Neural Tensor Network (RNTN) [81] was demonstrated to be effective in
training non-linear data structures. In the previous session, we discussed the Graph-
based Recurrent Neural Network which utilizes the RNTN based on local sub-graphs
generated from the original network structure. Graph-based LSTM (G-LSTM) is a vari-
ation of GRNN. G-LSTM first builds a simple tree using BFS traversal then, applies an
LSTM to the tree for vertex classification [95]. Later, the GRNN is improved by adding
an attention layer in the Graph-based Recursive Neural Network. The improved model
is called Attentive Graph-based Recursive Neural Network (AGRNN) [94]. AGRNN
adds a soft attention weight to GRNN so that the new model focuses on the more rele-
vant inputs in the network. G-LSTM and AGRNN are demonstrated to perform better
than GRNN. In Table 4.2, we compared both the macro-F1 score and the runtime among
our method, G-LSTM, and AGRNN. From the result, we can see that our method has
the best or close to best classification accuracy as well as the lowest runtime. Even
56
Table 4.2: Comparison of the run-time for DGPCA and other GCN based methods on
three data sets
Training Percentage Model
Citeseer Cora WebKB
Macro-F1 Runtime Macro-F1 Runtime Macro-F1 Runtime
70%
DGPCA 0.7537 116.04 0.8861 83.91 0.8636 65.47
G-LSTM 0.7512 714.85 0.8356 348.52 0.8743 124.29
AGRNN 0.7554 1074.38 0.8465 648.93 0.8474 325.05
75%
DGPCA 0.7582 107.67 0.8821 80.73 0.8681 62.55
G-LSTM 0.7589 762.83 0.8327 356.83 0.8623 132.49
AGRNN 0.7684 1183.45 0.8483 659.43 0.8486 384.34
80%
DGPCA 0.7662 109.00 0.8856 79.72 0.8977 66.55
G-LSTM 0.7487 814.73 0.8318 368.49 0.8998 141.45
AGRNN 0.7696 1238.47 0.8349 674.39 0.8612 412.23
85%
DGPCA 0.7657 107.88 0.8813 81.33 0.8863 66.51
G-LSTM 0.7623 864.87 0.8434 376.34 0.8753 149.54
AGRNN 0.7784 1296.74 0.8482 683.27 0.8732 434.34
90%
DGPCA 0.7719 114.08 0.8822 84.22 0.8863 67.46
G-LSTM 0.7636 943.24 0.8474 384.23 0.8714 157.74
AGRNN 0.7714 1315.48 0.8548 673.23 0.8711 482.45
The run time are recorded in second (s)
though our method does not always have the highest in classification accuracy, but the
difference is very minor. In addition, the run time for our method is much less com-
pared to the other state-of-the-art machine learning methods in all cases, which further
demonstrated the effectiveness of DGPCA.
4.4 Conclusion
A novel vertex classification method, which applies Graph PCA to graph data processed
by text-based DeepWalk, was proposed in this work. The proposed DGPCA method can
capture the neighborhood information of a node well and decrease noise in the original
data. The representation learning for language networks data used in DGPCA is not
only accurate, but also efficient. The effectiveness of the DGPCA method was demon-
strated by experimental results on three language datasets with different training ratios.
57
In the future, we would like to apply the proposed methodology to networks of a larger
scale and higher diversity such as social network data. While our proposed method is
highly scalable in theory, there is still significant work to be done in embedding massive
datasets with billions of nodes and edges. We would also like to extend our graph rep-
resentation method to operate on a wide range of other emerging application domains.
Nowadays, many application domains involve highly dynamic graphs. Embedding tem-
poral graphs with timing information along the edges is also an important task that we
would like to solve in the near future.
58
Chapter 5
GraphHop: A Successive Subspace
Learning (SSL) Method for Graph
Vertex Classification
5.1 Introduction
An effective and explainable graph vertex classification method, called GraphHop, is
proposed in this work. Unlike the graph convolutional network (GCN), GraphHop gen-
erates an effective representation for each vertex in graphs without backpropagation.
GraphHop determines the local-to-global attributes of each vertex through successive
one-hop information exchange. To control the rapid increase of the dimension of vertex
attributes, the Saab transform is adopted for dimension reduction. Afterwards, it ensem-
bles vertex attributes for the classification task, where multiple Graph-Hop units are
cascaded to obtain the high order proximity information. GraphHop is applied to three
real-world graph datasets and shown to offer state-of-the-art classification performance
at much lower training complexity.
Graphs provide a powerful data structure to describe the relationship between mul-
tiple objects. The graph representation has been widely used in various application
domains. For example, keywords and citations of articles can be well represented by
graphs, where each article can be represented by a vertex. Keywords provide vertex
59
attributes while citations are characterized by link structures. Namely, if there is a cita-
tion relationship between two articles, an undirectional link is assigned to their asso-
ciated vertices. To give another example, text and/or speech data in multimedia social
networks can also be well represented by graphs.
Vertex classification in graph data is one of the most fundamental problems in
machine learning. Vertex classification in graphs aims at assigning a category to each
vertex based on vertex attributes and link structures. For example, both keywords and
citation relationship are useful in article categorization in a citation dataset. Yet, it is
challenging to determine proper contributions from vertex attributes and link structures.
This actually depends on an individual case since the information from one domain
(either vertex attributes or link structures) could be confusing while that from the other
domain is clear. A good solution needs to combine both information properly and adap-
tively.
Many machine learning techniques have been proposed to solve the vertex classi-
fication problem for decades. The first step is to learn a graph representation, which
is called graph embedding. The simplest embedding is to represent an attribute of a
vertex with one dimension. Suppose that the averaged number of attributes per vertex
is N
a
and the total vertex number is N. Consequently, the graph has a dimension of
N
a
N for vertex attributes. Furthermore, to take the link structure into account, we
can construct an adjancency matrix of dimensionNN, whose elementsl
i;j
describe
the relationship between verticesv
i
andv
j
. Thus, the total dimension of a graph is equal
to (N
a
+N)N.
The classification of “the whole graph” and that of “a graph vertex” are two different
tasks. To classify an individual vertex in a large graph, it is difficult (yet unnecessay)
to examine the entire graph but the local neighborhood of the target vertex. This can
60
be achieved by a local sampling technique, e.g., random walks [84, 12]. Random-walk-
based methods sample a graph with multiple paths by starting walks from an initial
vertex. The paths capture the connectivity of adjacent vertices. The randomness gives
the ability to explore the graph and capture the local structural information. DeepWalk
[70] is the most popular random-walk-based graph embedding method. The node2vec
method [36] is a modified version of DeepWalk.
To simplify raw graph representations, one can conduct dimension reduction tech-
niques such as Principal Component Analysis (PCA) [43], Linear Discriminant Analy-
sis (LDA) [105], Multidimensional Scaling (MDS) [72]. They are known as subspace
learning [97] under the linear assumption. Non-linear dimensionality reduction (NLDR)
[19], called manifold learning, can also be used. Examples include Isometric Feature
Mapping (Isomap) [74], Locally Linear Embedding (LLE) [73], Kernel Methods [39].
By leverage the sparsity of real-world networks, one can apply the matrix factoriza-
tion technique that finds an approximation matrix for the original graph. After local
sampling and dimension reduction, machine learning can be applied to the dimension-
reduced data space.
Being inspired by the success of recurrent neural networks (RNNs) and convolu-
tional neural networks (CNNs), researchers attempt to generalize and apply them to
graphs. For example, in the context of Natural language processing (NLP), people often
use RNNs to find a vector representation for words. The Word2Vec [61] and the skip-
gram models [63] aim to learn the continuous feature representation of words by opti-
mizing a neighborhood-preserving likelihood function. By following this idea, one can
adopt a similar approach for graph representation, leading to the Node2Vec method [36].
Node2Vec utilizes random walks [84] with a bias to sample the neighborhood of a tar-
get vertex and optimizes its representation using the stochastic gradient descent (SGD)
technique.
61
CNNs offers another family of neural-network-based graph representation and deci-
sion methods. The input can be either the whole graph or paths sampled from a partial
graph. For graph representation, some use the original CNN model designed for the
Euclidean domain and reformat the graph input to fit it while others generalize the deep
neural model so that they can be applied to non-Euclidean graphs. The graph convolu-
tional network (GCN) [46] has received a lot of attention in recent years. It applies semi-
supervised learning to graph-structured data using an efficient variant of convolutional
neural networks that operates directly on graphs and learns hidden layer representations
that encode both local graph structure and node attributes. It allows end-to-end learning
of a graph of an arbitrary size and shape. It was shown in [53] that the GCN model is
a special from of Laplacian smoothing [24]. More than two convolutional layers tend
to lead to over-smoothing, which make the features of vertices similar to each other.
As a result, they are more difficult to separate from each other. In this work, we adopt
the learnable graph convolutional layer (LGCL) [25] as the basic model, provide an
interpretable design, and convert it to a new GraphHop method based on the successive
subspace learning (SSL) principle.
SSL offers a new machine learning paradigm. It contains three main building blocks:
1) successive subspace transformation (SST), 2) successive subspace partitioning (SSP)
and 3) ensemble classification (EC). In the SST module, GraphHop generates an effec-
tive representation for each vertex of graphs in a feedforward (FF) manner without BP
[48]. It determines attributes of local-to-global neighborhoods of each vertex through
successive one-hop information exchange. This is implemented by multiple Graph-Hop
units in cascade. The high order proximity information can be obtained consequently.
To control the rapid increase of the attribute dimension, a dimension reduction technique
called the Saab transform is used. In the SSP module, the attribute space is partitioned
into several subspaces to lower the cross-entropy loss function as much as possible.
62
Finally, in the EC moduel, GraphHop ensembles decisions made based on attributes of
different neighborhoods for vertex classification.
There is an emerging field in signal processing known as graph signal processing
(GSP). GSP aims at the processing of signals defined on a graph , where the signal at
each vertex is often a scalar. It examines graph signal processing techniques such as
filtering, sampling, transform, etc. On one hand, GSP offers useful insights into link
structures of a global or local graph. On the other hand, it does not pay much attention
to vertex attributes. As mentioned earlier, vertex classification in graphs demands good
representations of both vertex attributes and link structures. In this work, we leverage
GSP to provide a richer set of representations of link structures to boost the performance
of the ensemble classifier.
To demonstrate the effectiveness of the GraphHop method, we apply it to three real-
world graph datasets and show that it offers state-of-the-art performance at much lower
training complexity. We consider both heavily and weakly supervised scenarios in the
experiments.
The rest of this paper is organized as follows. Related work is reviewed in Sec. 5.2.
The GraphHop method is introduced in Sec. 5.3. Experimental results are shown in
Sec. 5.4. Follow-up discussion is made in Section 5.5. Finally, concluding remarks and
future research directions are given in Sec. 5.6.
5.2 Related Work
In this section, we provide in-depth review of three topics that are most relevant to our
current work.
63
5.2.1 Graph Convolutional Network (GCN)
Graph Convolutional Network (GCN) [46] is based on a variant of CNNs that operates
directly on graphs and it learns hidden layer representations that encode both local graph
structure and features of nodes. GCN allows end-to-end learning of the graph of any size
and shape. This model uses convolution operator on the graph and iteratively aggregates
the embedding of neighbors for nodes. GCN has been widely used for semi-supervised
learning on graph-structured data. In the first step of the GCN, a vertex sequence will be
selected. The neighborhood nodes will be assembled and normalized to impose order
of the graph, then convolutional layers will be used to learn representation of nodes and
edges. The propagation rule used is
f(H
(l)
;A) =(D
1
2
AD
1
2
H
(l)
W
(l)
); (5.1)
where A is the identity matrix, with enforced self loops to include the vertex features
of itself.
^
D is the diagonal vertex degree matrix of
^
A to normalize the neighbors ofA.
Under spectral graph theory of CNNs on graphs, GCN is equivalent to Graph Laplacian
in the non-euclidean domain [18]. The decomposition of eigenvalues for the Normalized
Graph Laplacian data can also be used for tasks such as classification and clustering.
However, GCN usually only uses two convolutional layer and why it works its not well
explained.
Learnable graph convolutional layer (LGCL) [25]. For each feature dimension,
every vertex in the LGCL method selects a fixed number of features from its neighbor-
ing nodes with value ranking. By including the feature vector of the target node itself,
they obtain a data matrix in a grid-like structure. Then, the traditional CNN is applied
on the grid structure to generate the final feature vector. To embed large-scale graphs,
they also proposed a sub-graph selection method to reduce the memory and resource
64
requirements which begins randomly sampled nodes, then breadth-first-search (BFS) is
used to find all first-order neighboring nodes of initial nodes. At the second iteration
more nodes are randomly selected. The nodes from the two iterations are selected as the
input to LGCL.
Emerging methods based on convolutional neural networks (CNNs) are most
application-oriented without theoretical justification. Although the interpretability of
neural network models have been examined by researchers, end-to-end analysis on deep
neural network models remains to be a challenge. Furthermore, determination of param-
eters of neural network models are typically carried out by backpropagation (BP), where
millions of parameters have to be optimized iteratively. This is a non-convex optimiza-
tion problem, and it is mathematically intractable [82]. Besides, the training of deep
networks demands lots of computational resources. To address these shortcomings, we
propose an interpretable graph vertex classification method called the GraphHop method
in this work.
5.2.2 Successive Subspace Learning (SSL)
It is worthwhile to point out that a new machine learning methodology called successive
subspace learning (SSL) has been recently proposed by Chen and Kuo in [14]. SSL has
been applied to image data and 3D point cloud data, leading to the PixelHop method [15]
and the PointHop method [108], respectively. We show that SSL can also be applied to
graph data in this work. The resulting method is called the GraphHop method. One
common idea behind these methods is an effective representation learning followed by
an ensemble decision.
65
5.3 Proposed GraphHop Method
In this section, we introduce the GraphHop method for generic graph data. The block
diagram for GraphHop architecture is shown in Figure 5.1. It takes a graph G as input,
and outputs the corresponding class label. The input for our architecture is a graph
denoted byG = (V;E), which consists of a set of vertices, V =fv
1
;v
2
;:::;v
n
g, and
a set of edges, E =fe
i;j
g, where edgee
i;j
connects vertexv
i
to vertexv
j
. LetX
i
=
fx
1
;x
2
;:::;x
n
g be the feature vector associated with vertex v
i
, l
i
be the label or class
thatv
i
is assigned to andL the set of all labels. The vertex prediction task attempts to
find an appropriate label for any vertexv
t
so that the probability of given vertexv
k
with
labell
k
is maximized. Mathematically, we have
^
l
k
= arg max
l
k
2L
P
(l
k
jv
k
;G): (5.2)
Our proposed architecture first preprocess the graph data using spectral clustering
method as described in Section 5.3.1. The processed features are fed into a sequence
of GraphHop units in cascade to obtain the attributes of the ith GraphHop unit. The
GraphHop unit contains a local-to-global attribute building stage described in 5.3.2 and
a 1-D SAAB dimension transformation stage specified in Secion 5.3.3. The resulting
output from multiple GraphHop units are aggregated. Finally,these attributes are feed
through a classification and assembling part, which will be explained in greater details
in Section 5.3.5.
5.3.1 Pre-process of Vertex Feature
Vertex attributes plays a very important role in graph representation learning. In the
citation datasets, we have witnessed that vertex attributes are mostly formed by one-
hot encoding of paper keywords. This will lead the dimension of vertex attributes to a
66
Figure 5.1: The workflow of GraphHop architecture.
high level, which are not favorable for most of the machine learning models. Further-
more, the relationship between keywords in this scenario has not been widely studied
before. Therefore, we proposed two method to perform pre-processing on sparse vertex
attributes: Shifted Positive Pointwise Mutual Information (SPPMI) and spectral embed-
ding [67].
Shifted Positive Pointwise Mutual Information
Pointwise Mutual Information is widely used to quantify the discrepancy between two
random variables. Here we view keywordx andy as two random variables. Pointwise
mutual information is defined as:
PMI(x;y) =log
P (x;y)
P (x)P (y)
67
In the NLP word, the preceived similarity of two words is more influenced by the posi-
tive mutual information values. Therefore, the negative values are usually set to zero as
’uninformative’ representation. So the Positive Pointwise Mutual Information is intro-
duced as:
PPMI(x;y) =max(PMI(x;y); 0)
Then the Shifted Positive Pointwise Mutual Information is defined as:
SPPMI(x;y) =max(PMI(w;c)logk; 0)
It is derived through the theoretical analysis of word2vec model. For more details, we
refer to [52]. Through matrix factorization on SPPMI matrix, the word embedding for
each keyword is obtained. We then represent the node attribute as the averaging of all
its keyword embeddings.
Spectral Embedding
Given feature matrixX, we build the co-occurrence matrixC of keywords by
C =X
T
X
The co-occurrence matrixC is the adjacency matrix for the keyword graph. Then spec-
tral embedding is applied on matrix C to find the embedding for each keyword. The
embedding fori
th
keyword isv
i
. The final vertex attributes are computed by
^ x
i
=Vx
i
whereV = [v
1
;v
2
;:::;v
m
].m is the size of the keyword vocabulary.
68
In this work, we have tested by pre-processing methods. Both provides better
attribute representation comparing with original node attributes.
5.3.2 Local-to-Global Attribute Building
Based on the assumption that directly connected nodes in the graph should share more
similarity than not connected nodes attributes, the propagation of attributes between
neighboring nodes is important. Initially, the attribute of each vertex is a dense 1D
vector obtained by spectral embedding. To get the propagated attributes, We apply the
graph convolution operator to each node. [38] and [46] proposes an approximation of
graph convolution by preserving the first-order Chebyshev polynomial of the lapacian
eigen-vectors. The linear convolution operatorL is defined asL =I +D
1=2
AD
1=2
,
whereA is the adjacency matrix andD is the diagonal degree matrix. The operator is
then normalized to have unit row sum asL =
^
D
1=2
^
A
^
D
1=2
, where
^
A =A +I, and
^
D
is a diagonal matrix of the row sum in
^
A.
The addition of identity matrix enforce self-loops in the graph to sum up all the
feature vectors of neighboring nodes including the node itself.
Graph convolution is used to propagate the spectral components between the
neighboring nodes. L
m
X is the m-hop propagation of attributes in the spectral domain.
The larger the m is, the more global information can be exchanged. However, the
attributes will be over-smoothen when the chosenm is too large. The concatenation of
all m-hop propagated attributesX
prop
= [X;LX;L
2
X;:::;L
m
X] is used. Then PCA is
used to reduce the dimension.
To exchange the spatial similarity, we pick the top k one hop neighboring node based
on max pooling for the target node. For example, if we want to build the neighborhood
information for node 1, We concatenate the feature vector of node 1 with its one hop
69
neighbor, where the maximum values will be picked first. The top k neighbor selec-
tion is shown in Figure 5.2. Consider building the top k neighbor for node 1. Node 1
has 3 adjacent nodes:2,4 and 5, each node has 7 features represented by a 7-component
feature vector. If choose k to be 2. For each feature, 2 largest values are selected
from neighborhood nodes. For example, the results of the selection process for the first
feature is [3,6,9,6,7,5,9] which is the maximum value for feature 1 through 7 among
all neighborhood nodes. Repeating the same process we obtain the second maximum
value,[2,2,4,4,5,4,8] as the second feature. Concatenating them with the feature of node
1 we get the final feature vector. If we use a k greater than the number of neighbor-
hood nodes, zero padding is used. The resulting vector representation after the attribute
building is shown at the bottom of Figure 5.2.
Figure 5.2: Vertex Representation : Max Pooling
In our GraphHop design, we chose maximum pooling because it outperforms aver-
age pooling. The pooling operation can be interpreted as a filtering operation that pre-
serves the more important features and filters out the less significant features. Maximum
70
pooling is also used to reduce computational and storage resources by reducing the spa-
tial dimension. Without maximum pooling, concatenating the direct neighbors of node 1
would yield a 1 by 28 feature, withk set to 2, a 1 by 21 feature is obtained instead. This
is especially useful for nodes with a large degree. Without this operation, the feature
would be very large and hard to process.
5.3.3 1D Saab Transformation
In each GraphHop unit, the node’s attribute vector dimension grows drastically because
of the the local-to-global attribute building. The increasing dimensionality of the vertex
attribute vector is reduced by using a 1D Saab (Subspace Approximation with Adjusted
Bias) transform. Saab transform [48] was developed in order to avoid the sign confusion
problem in Convolutional Neural Networks. Saab transforms was originally used for
2D image data. Image features are extracted from edges, textures or salient points. An
input image usually goes through a sequence of vector space transformation where the
cascade of spacial-spectral filtering and spatial pooling extract the discriminant dimen-
sions. Following this idea, we developed a 1D Saab transform to extract the features
from graph data. Instead of using 2D patch to sample the pixels as in image data, we use
1D window to sample the feature of the target vertex. PCA can be used to find a set of
anchor vectors to from a vector space. The linear combination of anchor vectors from a
subspace which can represent the vertex feature.
To simulate the multi-stage design in CNN, we concatenated multiple graph hop
units to provide a sequence of spatial-spectral transformation to convert and input vertex
to its spatial-spectral representation layer by layer. In the process, local spatial-spectral
features are projected onto PCA-based kernels to enhance the discriminability of some
dimensions due to the fact that a spatial-spectral component with a larger receptive field
can be more discriminant. In the GraphHop design, we choose to conduct the Saak
71
transform in non-overlapping windows for computational and storage efficiency. Even
though some architectures for images such as LeNet-5 [51] uses overlapping windows
with the advantage of providing a richer set of features, the redundant representation
comes with higher computational and storage resources and is not necessary for our
graph data since our input are combination of adjacency matrix and one hot encoding
for key words.
The Saab transform provides a method to choose the anchor vectors and bias. By
deriving anchor vectors from the statistics of input data, we can avoid using back prop-
agation training procedure which is built upon tht optimization of a function defined
at the system output. The covariance matrix of input vectors can be computed and
the eigenvectors can be used as the desired anchor vectors. Saab Transform divides
the input subspace to DC and AC subspace. The AC anchor vector is denoted by a
k
,
which is affine vector used in computational neurons when bias is set to zero. For a
N-dimensional inputx = (x
0
;x
1
;:::;x
N1
) thekth neuron hasN filter weights that can
be expressed in vector form with a linear combination of filter weights a
k
associated
with thekth neuron and biasb
k
. The outputy = (y
0
;y
1
;:::;y
K1
)
T
is the projection of
the input vectorx onto a subspace spanned bya
k
. The affine computation is:
y
k
=
N1
X
n=0
a
k;n
x
n
+b
k
=a
T
k
x +b
k
;k = 0; 1;:::;K 1 (5.3)
The DC subspace is evaluated by the anchor vector:
a
0
=
(1;:::; 1)
T
p
N
(5.4)
which is of dimensionN = (k + 1)d wherek is the number of adjacent nodes selected
for learning a vertex representation andd is the dimension of the node’s attribute vector
at the input of each Graph Hop Unit stage. The DC subspace is equivalent to the mean
72
removal from the feature dimension. The input vector space S can be viewed as the
direct sum of the DC subspace S
DC
which is spanned by the DC anchor vectors and
the AC subspace,S
AC
, is spanned by the AC anchor vectors. We can projectx toa
0
to
derive the DC component:
x
DC
=x
T
a
0
=
1
p
N
N1
X
n=0
x
n
+b
k
(5.5)
The AC subspace is obtained by subtracting the DC subspace from the input subspace.
Therefore we have:
x
AC
=xx
DC
(5.6)
The AC anchor vectors are evaluated by deriving from the data in a single pass, instead
of using a back-propagation model.
Figure 5.3: Dimension reduction using SAAB transform: Adapted design for 2D data.
5.3.4 A Sequence of GraphHop units in cascade
The purpose of cascading multiple GraphHop units is to compute the attributes of near-
to-far neighbors of target vertex through I GraphHop units. The dimension of the
attributes grows larger and larger in this step, it is critical to control the rapid grow
73
of dimensions using a suitable dimension reduction technique. A subspace approxi-
mation technique, called Saab transform, is used o achieve this. AC anchor vectors
in the 1D Saab transform step are obtained by applying Principal Component Analy-
sis to the stacked vertex representation to reduce ton anchor vector each of dimension
(k + 1)d dimensions. Thus, obtaining (n + 1) orthogonal anchor vector each of dimen-
sion (k + 1)d. To avoid the Sign Confusion problem ,the bias term selection is based
on positive response and constant bias constraints. The convolution of each vertex rep-
resentation of dimension (k + 1)d with (n + 1) anchor vectors of dimension (k + 1)d
results is a new node representation of (n + 1) dimension as depicted in Figure 5.3. The
number of AC anchor vector /kernel (n) is a hyperparameter which can be chosen based
on the energy to be preserved while reducing the dimension.
5.3.5 Classification and Ensembles
For the classification task, features from multiple GraphHop units are cascaded to form
the final vector. To enhance the classification accuracy, we used ensemble method by
fusing the classification output of multiple networks (for example, majority voting [69],
support vector machine (SVM) [86] etc). In the final classification stage, after obtaining
the final feature vector, we decided to use random forest (RF) [40] classifier for node
classification task. The random forest classifier trains a number of decision trees and
each decision tree gives an output, then the RF classifier ensembles the outputs from all
decision tress to give the final prediction.
5.4 Experiments
In this section, we evaluate our proposed GraphHop method on node classification tasks
and compare the results with prior state-of-the-art models. We will also discuss how to
74
choose hyper-parameters in Section 5.4.4. We compared the testing accuracy of our FF
design with neural network modles based on BP, along with the runtime. We notice a
small drop of 1.3% to 3.17% drop in classification accuracy without BP optimization.
But overall experimental results show that our GraphHop method has comparable clas-
sification accuracy with the most current state-of-the-art model while greatly reduces the
training time. Our code is publicly available at https://github.com/xxxxxxx/GraphHop
(will be filled in after peer review).
5.4.1 Datasets
To evaluate the performance of the proposed DTRNN method, we used the following
three citation in the experiment. The experimental settings we used are the same as the
those in GCN [46] and LGCN [25]. The classification tasks use transductive learning
[109] settings where the unlabled testing data are available during training.
Citeseer: The Citeseer dataset is a citation indexing system that classifies aca-
demic literature into 6 categories [30]. This dataset consists of 3,312 scientific
publications and 4,723 citations.
Cora: The Cora dataset consists of 2,708 scientific publications classified into
seven classes [59]. This network has 5,429 links, where each link is represented
by a 0/1-valued word vector indicating the absence/presence of the corresponding
word from a dictionary consists of 1,433 unique words.
PubMed: PubMed [79] is a free resource citation network that comprises 19,717
nodes, 44,338 edges classified into 3 categories. The dimension of each vertex is
500. PubMed citations and abstracts include the fields of biomedicine and health,
and cover portions of the life sciences, behavioral sciences, chemical sciences,
and bioengineering.
75
Dataset Features Training Validation Test
Cora 1433 140 500 1000
Citeseer 3703 120 500 1000
PubMed 500 604 500 1000
Table 5.1: Hyper Parameter Tuning Result
5.4.2 Experimental Settings
For the transduction learning step, we employ proposed GraphHop Units. The input
datasets are in high-dimensional bag-of-word representation, the dimension of the input
gets reduced after going through our GraphHop units as graph embedding layers. The
output of each GraphHop unit is concatinated with the output from the previous layer.
Finally an ensemble learning classifier used to make predictions and output a label for
each node. These three datasets are split into training and testing sets, we recorded the
highest and the average Micro-F1 scores for items in the testing set. The summary of
dataset split is shown in Table 5.1.
5.4.3 Baselines
We compared the performance of our GraphHop method with two benchmarking meth-
ods as given below.
Text-associated Deep Walk (TADW) [102] and its variations [12] incorporates text
features of vertices under the matrix factorization framework for vertex classifica-
tion.
Graph-based LSTM (G-LSTM). It first builds a simple tree using the BFS only
traversal and, then, applies an LSTM to the tree for vertex classification [96] [11].
76
Graph Convolutional Networks (GCN) [46]. The original GCN algorithm is
adapted from CNN and designed for semi-supervised learning in a transductive
setting. This method use convolutional architecture architecture based on local-
ized first-order approximation of spectral graph convolutions.
Large-Scale Learnable Graph Convolutional Networks[25]. This method propose
the learnable graph convolutional layer (LGCL) to automatically select a fixed
number of neighboring nodes for feach feature based on value ranking to trans-
form grid structured data into 1-D format. Then it uses regular convoluintal oper-
ations on the graph. To embed large-scale graphs, this method utilizes a sub-graph
selection method to reduce the memory and resource requirements.
5.4.4 Hyper Parameter Tuning
In the experiment, we tuned three major hyper parameters. The first one is the chose
of dimension for the pre-processed attribute. We use PCA to reduce the dimension of
the pre-processed attribute for each node, the choice of what dimension to reduce to
will impact the performance of our output. We also plotted the classification accuracy
verses the number of dimensions for the 1st Hop attribute vector, as shown in Figure
5.4, to guide our decision. For Cora, when k is 3, the best classification accuracy is
achieved when dimension is 27, with k is 4, the best dimension is 19, when k is 8,
the best dimension is 79, when k is 10, the best dimension is 27 as the figure shows.
The second parameter we tuned is k in the top k neighborhood building. Since some
graph is more sparse and each graph has its own node degree feature, some number of
k performs better than others. The average degree of nodes in a graph can help us chose
the hyper-parameter k. The average node degree for Cora, Citeseer and Pubmed are 4, 5
and 6, respectively. We conducted experiments settings of k around the average degree
of nodes to find the best number of k for all data sets. The last parameter we tuned is
77
Figure 5.4: The classification accuracy verses dimension for different number of k
the number of AC kernels. We plotted the energy versus the number of kernels used,
by keeping 80% of energy, we usually get good performance. The finally use of hyper
parameter is shown in Table 5.2.
Data Dimension k kernel number
Cora 25 4 23
Citeseer 25 6 17
PubMed 25 6 25
Table 5.2: Hyper Parameter Tunning Results
78
Models Cora Citeseer Pubmed
TADW 77.43% 71.34% N/A
G-LSTM 80.43% 68.54 % N/A
GCN 81.5% 70.3% 79%
LGCN 82.8% 72.6% 79.4%
GraphHop (Ours) 81.7% 70.3% 78.2%
Table 5.3: Results for node classification accuracies on Cora, Citeseer, and PubMed
datasets
5.5 Results and Analysis
The classification accuracy and run-time experiments conducted on three popular cita-
tion data sets are shown in Section 5.5.1 and 5.5.2. From the results, we can see that
our method has a close to state-of-the-art classification accuracy and a great reduction
in training time. The results also demonstrated that the proposed local-to-global neigh-
borhood building method is shown to be effective. These experimental results proved
the effectiveness of applying GraphHop for graph data embedding.
5.5.1 Classification Performance
The testing accuracy of our FF GraphHop design is summarized in Table 5.3 for three
citation data sets. According to the results, compared with other state-of-the-art method,
our method is very close to the state-of-the-art in terms of classification accuracy.
The results above show in Figure 4.2, the proposed GraphHop models on graphs
consistently yield good performance in node classification tasks. These results demon-
strate the effectiveness of applying GraphHop on transformed graph data. In addition,
the proposed local-to-global neighborhood building method is also shown to be effec-
tive.
79
Figure 5.5: Classification accuracy for Cora
5.5.2 Run-time Analysis
For the experiments above, we use 1-D SAAB transformation in each GraphHop unit to
learn the features of the graph without using back propagation. BP is an iterative opti-
mization process and typically demands tens or hundreds of epochs to converge. The
real CPU run-time results is shown in Figure 4.2, the proposed GraphHop models con-
sistently run faster in node classification tasks than other state-of-the-art method. Even
LGCN has a slight better classification accuracy. Our FF GraphHop design is signif-
icantly faster than the BP design based on experimental results. Through the experi-
ments, the CPU run-time shows that the GraphHop is much faster than the GCN and
other machine learning methods using BP with negligible loss in terms of model per-
formance. According to the results above, with the risk of slight performance loss, we
point out the great advantage of of GraphHop method in terms of training speed. It
can be seen form the results in Table 5.4 that the improvement is outstanding, and this
80
Figure 5.6: Classification accuracy for Citeseer
method will be very powerful in practice for large-scale graph data. As a result, our
GraphHop method is not only effective but also efficient.
5.5.3 Properties of Saab Filters
Figure 5.7: The log energy plot and preservation of energy as a function of AC filters
81
Models Cora Citeseer Pubmed
TADW 22.29 50.86 62.62
G-LSTM 348.52 714.85 1229.84
GCN 32.12 47.45 60.30
LGCN 2867.27 3026.49 12509.09
GraphHop (Ours) 15.99 15.24 37.78
Table 5.4: Results for node classification run-time on Cora, Citeseer, and PubMed data-
sets in seconds
In the Saab transformation, the signal space is transformed into two sub-spaces:
DC and AC subspace. The terms are borrowed from circuit theory meaning ”direct
current” and ”alternating current”. DC subspace is obtained using the DC filter which
is a normalized constant-element vector. AC filters are obtained by applying PCA to
the AC subspace. In figure 5.7, we examine the relationship between the number of AC
filters and the energy preserved in the subspace approximation for Cora. The leading
AC filers can capture the majority amount of energy. As shown in the figure, the energy
drops dramatically as the number of AC components becomes larger. As we use more
kernels, the drop becomes less rapid, therefore more AC kernels are needed to preserve
the same amount of energy. This relationship suggests that in the GraphHop design, we
can use a higher energy ratio in the beginning of the GraphHop units and lower energy
ratio in the later ones. More kernel would result in higher complexity, we can use this
property of Saab design to balance the classification performance and the complexity.
5.5.4 BP and FF Design Comparison
BP design for graph embedding is usually determined by three major factors: 1. The
neighborhood building method for each node. 2. A selected network architecture. and
3. A cost function at the output end for optimization. Contributions can be made either
82
in the neighborhood building stage (for example G-LSTM uses a tree-structured sam-
ple method, DeepWalk uses random walks), or novel network architecture and/or new
cost function. In comparison, our FF design focuses on data statistics to determine a
sequence of spectral transformation. BP designs relies on stochastic gradient descent
(SGD) [8], while our method based on linear algebra. The GraphHop units utilize PCA
for generating discriminant dimension and dimension reduction. The SSL method used
in our GraphHop design is fundamentally different from DL models. Even though they
both collect attributes by successively growing neighborhoods and both trade spatial-
domain patterns for spectral components using convolutional filters, they are drastically
different in terms of training processes and training complexities. Deep learning meth-
ods relies heavily on model parameter learning. After a fixed neural network model is
chosen, DL utilized its ability to train a large number of model parameters to achieved
its superior performance. Sometimes the network is over-parameterized and can lead to
over-fitting. Our SSL method is a non-parametric model. We can add or deleted more
filters depending on the size of the input data. It is a more flexible model and much less
hardware demanding. Moreover, DL models are hard to interpret, they are essentially
based on a black box tool whose properties are not well understood. DL determines
its model parameters with BP as an end-to-end optimization approach. Our SSL model
is mathematically transparent. It adopts unsupervised and supervised dimension reduc-
tion technique to project into an effective subspace for feature extraction in a one-pass
feed-forward fashion. In terms of training and testing complexity, DL methods requires
a great amount of computational resources because of the usage of BP. The deeper the
layers, the higher the computational complexity will be for the DL architecture. Training
complexity of SSL models is much lower, since it uses one-pass feed-forward design.
Therefore training and testing can be done much more effectively. In DL networks, it is
possible to find a path from the output decision space to the input data space and this lead
83
to the possibility of adversarial attacks. Therefore, a decision outcome can be changed
by adding a small perturbations to the input data. This result in a major weakness of DL
networks.[112] has shown that the accuracy of node classification significantly drops
when performing only a few perturbations. In SSL, the weak perturbation can be read-
ily filtered out by PCA, making it difficult for attacks to take place.
5.6 Conclusion and Future Work
In this work, an interpretalbe FF machine learning method called GraphHop was pro-
posed for vertex classification in language graphs. This architecture builds neighboring
attributes for each node through one-hop information aggregation. We also used a 1-
D Saab transform to reduce the vertex attribute dimension in each GraphHop unit. In
the classification stage, we used ensemble methods to predict node label. Experimental
results on three citation data sets proved the effectiveness of the proposed GraphHop
method. GraphHop method offers classification accuracy that is close to the state-of-
the-art with a great reduction in training time. We compared the real time CPU run
time and showed that our algorithm is not only accurate but also very efficient. Based
on this work, we discuss several possible direction for future work. First, we mainly
addressed the node classification problem in this work. We would like to apply Graph-
Hop method to other interesting applications such as link prediction or graph classifi-
cation. Second, in the near future, we would like to apply the proposed methodology
to graphs of a larger scale and higher diversity such as social network data or other text
data in larger corpora context. The FF machine learning method for natural language
processing is still in an early exploratory stage. Nowadays, machine learning methods
have been focused on better and better performance with more and more complicated
network architectures using deep neural networks. Providing an interpretable design as
84
alternative to advanced neural network architectures can shed light on the fundamental
principles of current machine learning research. With the new development top and pre-
liminary experimental results, we hope our work can inspire more follow-up work along
this direction.
85
Chapter 6
Conclusion and Future Work
6.1 Summary of the Research
In this thesis dissertation, we focus graph representation learning using three methods:
deep learning with recursive neural networks, PCA on graphs with deep-walk assisted
matrix factorization method and GraphHop method for vertex classification.
We first introduced graph representation learning and some current state-of-the art
methods, which need to satisfy two requirements: generate accurate representation for
the original data, the representation method should have a reduced dimension with rel-
atively fast training time. We provide a survey to review different categories of graph
representation learning algorithms and run the experiments under uniform settings. We
summarize the evaluation methods as well as the emerging research and application
direction for graph embeddings. For the deep learning with recursive neural network
model, we first introduced a deep-tree generation (DTG) algorithm can capture the
neighborhood information of a node better than the traditional breath first search tree
generation method, then we apply recursive neural to each node on the tree to train a
representation vector use semi-supervised learning. We also trained graph data in the
DTRNN by adding more complex attention models to see if it generates a better accu-
racy. However, our DTRNN algorithm alone already captures more features of each
node, adding attention layers does not improve the classification accuracy. The com-
plexity of the proposed DTRNN was also analyzed. We considered both asymptotic run
86
time and real time CPU runtime and showed that our algorithm is not only the most
accurate but also efficient.
To obtain a more time efficient graph embedding method, we also studied PCA
on graphs with deep-walk assisted matrix factorization method. We apply Graph PCA
to graph data processed by text-based DeepWalk. The proposed DGPCA method can
capture the neighborhood information of a node well and decrease noise in the original
data. The representation learning for language networks data used in DGPCA is very
close to the state-of-the-art methods in terms of accuracy, and much faster compared to
deep learning methods. The effectiveness of the DGPCA method was demonstrated by
experimental results on three language data sets with different training ratios.
In the last work, an interpretalbe FF machine learning method called GraphHop is
proposed for vertex classification in language graphs. GraphHop builds neighboring
attributes for each node through one-hop information aggregation and uses a 1-D Saab
transform to reduce the vertex attribute dimension. In the classification stage, we used
ensemble methods to predict node label. Experimental results on three citation data
sets proved the effectiveness of the proposed GraphHop method. GraphHop method
offers classification accuracy that is close to the state-of-the-art with a great reduction
in training time. We compared the real time CPU run time.
87
6.2 Future Research Directions
Machine learning (ML) techniques are now ubiquitous. However, they usually requires
large amounts of training data and it is hardware demanding. In this general direction,
we bring up the following research problems:
Can we apply the proposed methodology to networks of a larger scale and higher
diversity such as social network data. While our proposed method is highly salable
in theory, there is still significant work to be done in embedding massive data sets
with billions of nodes and edges.
How to extend the SSL model to other applications such as GAN, link prediction,
whole graph classification etc. Can the classification accuracy of GraphHop be
further improved.
6.2.1 Scalable Graph Embedding for Large Graphs
multi-level graph embedding Graph embedding method that can scale to a large-sized
graph with millions of nodes with less computational complexity and memory require-
ments [54] is an important task.
Hierarchical Representation Learning for Networks (HARP) [13] proposes a hier-
archical scheme for graph embedding based on iterative learning methods. But HARP
focuses on improving the quality of embeddings by using the learned embeddings from
the previous level as the initialized embeddings for the next level and that leads to great
computational overhead. Graph partition neural networks (GPNN) [55] is able to han-
dle extremely large graphs. GPNNs alternate between locally propagating information
between nodes and globally propagating information between the subgraphs. GPNNs
can achieve similar performance as standard GNNs with fewer propagation steps but the
88
partition and propagation steps are time consuming. In the future, we would like to focus
on designing a general purpose graph embedding framework that is highly scalable.
We explore the multi-level framework for scalable graph embedding (MILE) pro-
posed in [54]. This framework first repeatedly coarsen the original graph into smaller
graphs using a hybrid strategy facilitated with Structural Equivalence Matching (SEM)
and a normalized heavy edge matching (NHEM) [44]. For SEM, given two verticesv
andu in an unweighted graph, the vertices are structurally equivalent if they are inci-
dent on the same set of neighborhoods. The NHEM method is a widely used matching
method for graph coarsening. For an unmatched nodeu, its heavy edge matching is a
pair of vertices (u;v) where the weight of the edge between them is the largest. The
normalize the edge weights when applying heavy edge matching uses the formula in
Equation 6.1. The weight of an edge is normalized by the degree of the two vertices on
which the edge is incident. Therefore, it penalizes the weights of edges connected with
high-degree nodes. In Figure 6.1, nodeB is equally likely to be matched with nodeA
orC without edge weight normalization. With normalization, nodeB will be matched
withC, which is a more accurate second order matching since structurallyB is similar
toC. In Equation 6.1, (u;v) is a pair of vertices. A
i
is the adjacency matrix andD
i
is
the degree matrix. The heaving edge matching is a pair of vertices that the wight of the
edge betweenu andv is the largest.
W
i
(u;v) =
A
i
(u;v)
p
D
i
(u;u)D
i
(v;v))
(6.1)
A
i+1
=M
T
i;i+1
A
i
M
i;i+1
(6.2)
89
A toy example for illustrating graph coarsening [75]. 6.1 shows the process of apply-
ing Structural Equivalence Matching (SEM) and Normalized Heavy Edge Matching
(NHEM) for graph coarsening. 6.2 shows the adjacency matrixA
0
of the input graph,
the matching matrix M
0;1
corresponding to the SEM and NHEM matchings, and the
derivation of the adjacency matrixA
1
of the coarsened graph using Equation 6.1.
The adjacency matrixA
i+1
is found through matrix operations. The matching matrix
storing the matching information from graph G
i
to G
1+1
is a binary matrix M
i;i+1
2
f0; 1g
jV
i
jjV
i+1
j
. The value in the matrix is set to 1 if corresponding node is collapsed to
super-node is set to 0 otherwise.
Figure 6.1: Using SEM and NHEM for graph coarsening [54]
The last step in MILE is to use a graph-based neural network model to perform
embedding refinement using the cost function in Equation 6.3. V
m
is the vertex set,
m
is embedding of node inG
m
.
L =
1
jV
m
j
jj
m
H
(l)
(
m
;A
m
)jj
2
(6.3)
Following this idea, we plan to apply post-processing via variance normalization
(PVN) and post-processing via dynamic embedding (PDE) to normalize the variance of
principal components of and learn the orthogonal latent variables in the graph vectors.
90
Figure 6.2: Adjacency matrix and matching matrix [54]
We have already evaluated the performance of PVN and PDE on : 1) word similarity and
2) word analogy. Our proposed post-processing methods work well in both evaluation
methods.
Word similarity evaluation is widely used in evaluating word embedding quality. It
focuses on the semantic meaning of words. The corresponding datasets are composed of
word pairs and human labeled similarity score (e.g. a score from 0 to 10 used to indicate
the closeness of a word pair in their meaning). Here, we use the cosine distance mea-
sure (CosSim(v
1
;v
2
) = v
T
1
v
2
=kv
1
kkv
2
k) and Spearman’s rank correlation coefficients
(SRCC) [83] to measure the distance and evaluate the similarity between our results and
human scores, respectively. Tests are conducted on 13 popular datasets (see Table 6.3).
The performance of the PVN as a post-processing tool for the SGNS and the GloVe
baseline methods is given in Tables 6.2 and 6.4, respectively. The tables also show
results of the baselines and the baseline+PPA [66] for performance benchmarking.
Table 6.1 compares the SRCC scores of the SGNS alone, the SGNS+PPA and the
SGNS+PVN against word similarity datasets. We see that the SGNS+PVN performs
91
dimension:d 5 7 9 11
Type SGNS PPA PVN(ours) PPA PVN(ours) PPA PVN(ours) PPA PVN(ours)
WS-353 65.7 67.7 67.2 67.7 67.5 68.5 67.9 67.6 68.1
WS-353-SIM 73.2 73.6 73.4 73.8 73.8 74.2 73.9 73.8 73.9
WS-353-REL 58.1 59.7 59.8 60.2 60.1 60.7 60.7 59.4 60.7
MC-30 72.2 74.3 71.8 77.6 73.3 76.7 73.5 80.5 74.4
RG-65 73.1 70.9 71.6 71.2 71.8 71.4 72.0 70.9 71.7
Rare-Word 39.5 42.4 42.5 42.5 42.8 42.0 42.8 42.4 42.9
MEN 70.2 73.1 73.3 72.8 73.4 72.7 73.4 72.5 73.2
MTurk-287 62.8 66.5 67.4 65.7 66.9 65.1 66.8 64.7 66.4
MTurk-771 64.6 67.5 67.3 66.5 67.2 66.2 67.0 66.2 66.8
YP-130 39.5 44.5 44.0 43.5 44.3 41.9 43.4 42.3 43.0
SimLex-999 41.6 42.9 43.0 42.8 43.0 42.6 42.9 42.6 42.8
Verb-143 35.0 40.2 39.7 40.1 39.9 39.2 39.6 38.9 39.5
SimVerb-3500 26.5 28.5 28.5 28.7 28.7 28.1 28.6 28.1 28.5
Average 47.8 50.2 50.3 50.1 50.5 49.8 50.4 49.8 50.3
Table 6.1: The SRCC performance comparison (100) for the SGNS alone, the
SGNS+PPA and the SGNS+PVN against word similarity datasets, where the last row is
the average performance weighted by the pair number of each dataset.
dimension:d 5 7 9 11
Type: SGNS PPA PVN(ours) PPA PVN(ours) PPA PVN(ours) PPA PVN(ours)
Google
Add 59.6 61.0 61.6 61.3 61.7 61.2 62.0 61.3 62.1
Mul 61.2 60.2 61.8 60.1 61.8 60.1 61.9 60.3 61.9
Semantic
Add 57.8 60.9 60.6 61.4 61.3 61.5 61.9 62.4 62.4
Mul 59.3 58.5 59.8 58.6 60.1 58.7 60.5 59.5 60.9
Syntactic
Add 61.1 61.1 62.3 61.2 62.0 60.9 62.0 60.5 61.8
Mul 62.7 61.7 63.4 61.4 63.2 61.2 63.0 61.0 62.7
MSR
Add 51.0 52.7 53.2 52.8 53.2 52.9 53.4 53.0 53.4
Mul 53.3 53.4 54.5 53.5 54.6 53.4 54.8 53.3 54.9
Table 6.2: The SRCC performance comparison (100) for the SGNS alone, the
SGNS+PPA and the SGNS+PVN against word analogy datasets.
better than the SGNS+PPA in the average SRCC scores regardless of dimension thresh-
old,d.
Table 6.2 compares the SRCC scores of the SGNS, the SGNS+PPA and the
SGNS+PVN against word analogy datasets. We use the addition as well as the mul-
tiplication evaluation methods. The PVN performs better the PPA in both. For the
multiplication evaluation, the performance of the PPA is even worse than the baseline.
In contrast, the proposed PVN method has no negative effect. It performs consistently
well. This can be explained below. When the multiplication evaluation is adopted, the
92
Name Pairs Year
WS-353 353 2002
WS-353-SIM 203 2009
WS-353-REL 252 2009
MC-30 30 1991
RG-65 65 1965
Rare-Word 2034 2013
MEN 3000 2012
MTurk-287 287 2011
MTurk-771 771 2012
YP-130 130 2006
SimLex-999 999 2014
Verb-143 143 2014
SimVerb-3500 3500 2016
Table 6.3: Word similarity datasets used in our experiments, where pairs indicate the
number of word pairs in each dataset.
missing dimensions of the PPA influence the relative angles of vectors a lot. This is
further verified by the fact that some linguistic properties are captured by these high-
variance dimensions and their total elimination is sub-optimal.
The variances of top 30 principal components for SGNS and GloVe embeddings
with and without PVN are shown in Figs. 6.3 (a) and (b), respectively. It is evident
from Fig. 6.3 (a) that several top principal components of the SGNS occupy too much
energy. We can infer the threshold,d, based on this figure. That is, we can setd = 11.
To increased furthermore brings little improvement. Fig. As compared with SGNS, the
variance curve of the GloVe embedding is steeper with a higher decreasing rate. Since
there is a bump at around the 11th dimension, we set d = 11. Evaluation results on
four word similarity and four word analogy datasets are shown in Table 6.4. Our PVN
method offers better results than the PPA. This is especially true on word analogy tasks.
The final word representation is composed by two parts: ~ v(w) = [v
s
(w)
T
;v
d
(w)
T
]
T
,
where v
s
(w) is the static part obtained from dimension reduction using the PCA and
v
d
(w) =A
T
v(w) is the projection ofv(w) to dynamic subspaceA. Here, we set the
93
dimension:d 11
Type GloVe PPA PVN(ours)
WS-353 60.9 65.3 65.6
WS-353-SIM 66.4 70.2 70.2
WS-353-REL 57.3 60.7 62.8
Rare-Word 41.2 43.3 44.4
Google
Add 71.7 70.7 71.8
Mul 72.7 71.5 73.1
Semantic
Add 77.4 76.8 78.1
Mul 77.4 76.8 78.3
Syntactic
Add 67.0 65.6 66.6
Mul 68.8 67.1 68.8
MSR
Add 64.2 61.5 63.6
Mul 66.1 63.2 66.3
Table 6.4: The SRCC performance comparison (100) for the GloVe alone, the
GloVe+PPA and the GloVe+PVN against word similarity and analogy datasets.
dimensions of v
s
(w) and v
d
(w) to 240 and 60, respectively. The SRCC performance
comparison of the SGNS alone and the SGNS+PDE against the word similarity and
analogy datasets are shown in Tables 6.5 and 6.6, respectively. By adding the ordered
information via PDE, we see that the quality of word representations is improved in both
evaluation tasks.
We can integrate PVN and PDE together to improve their individual performance.
Since the PVN provides a better word embedding, it can help PDE learn better. Also,
the PPA-PCA-PPA pipeline was proved to be effective for dimensionality reduction of
word representations in [71]. Intuitively, the PVN-PCA-PVN pipeline should have com-
parable or even better performance. Furthermore, normalizing variances for dominant
principal components is beneficial since they occupy too much energy and mask the
contributions of remaining components. On the other hand, components with very low
variances may contain much noise. They should be removed or replaced while the PDE
can be used to remove the noisy components.
94
Type SGNS PDE(ours)
WS-353 65.7 65.9
WS-353-SIM 73.2 73.6
WS-353-REL 58.1 59.3
MC-30 72.2 75.7
RG-65 73.1 76.0
Rare-Word 39.5 38.6
MEN 70.2 72.4
MTurk-287 62.8 63.5
MTurk-771 64.6 65.1
YP-130 39.5 40.8
SimLex-999 41.6 41.8
Verb-143 35.0 37.5
SimVerb-3500 26.5 26.6
Average 47.8 48.5
Table 6.5: The SRCC performance comparison (100) for the SGNS alone and the
SGNS+PDE against word similarity datasets, where the last row is the average perfor-
mance weighted by the pair number of each dataset.
Type SGNS PDE(ours)
Google
Add 59.6 60.8
Mul 61.2 61.3
Semantic
Add 57.8 59.6
Mul 59.3 59.6
Syntactic
Add 61.1 61.8
Mul 62.7 62.7
MSR
Add 51.0 51.6
Mul 53.3 52.7
Table 6.6: The SRCC performance comparison (100) for the SGNS alone and the
SGNS+PDE against word analogy datasets.
The integrated PVN/PDE post-processing system is shown in Fig. 6.4. The SRCC
performances of the baseline SGNS method and the SGNS+PVN/PDE method for the
word similarity and the word analogy tasks are listed in Table 6.7 and Table 6.8, respec-
tively. For word similarity, the average improvement is 5.4%. The improvement is even
over 20% for the Verb-143 dataset. The performance for word analogy evaluation also
95
Type SGNS Combined(ours)
WS-353 65.7 69.0
WS-353-SIM 73.2 75.3
WS-353-REL 58.1 61.9
MC-30 72.2 74.3
RG-65 73.1 74.3
Rare-Word 39.5 42.5
MEN 70.2 73.9
MTurk-287 62.8 67.1
MTurk-771 64.6 67.2
YP-130 39.5 44.2
SimLex-999 41.6 42.8
Verb-143 35.0 44.1
SimVerb-3500 26.5 27.5
Average 47.8 50.4
Table 6.7: The SRCC performance comparison (100) for the SGNS alone and the
SGNS+PVN/PDE model against word similarity datasets, where the last row is the aver-
age performance weighted by the pair number of each dataset.
Type SGNS Combined(ours)
Google
Add 59.6 62.8
Mul 61.2 62.1
Semantic
Add 57.8 62.8
Mul 59.3 60.9
Syntactic
Add 61.1 62.8
Mul 62.7 63.1
MSR
Add 51.0 53.7
Mul 53.3 54.3
Table 6.8: The SRCC performance comparison (100) of the SGNS alone the and
SGNS+PVN/PDE combined model against word analogy datasets.
improves in all cases using either addition or multiplication. The improvement over the
Verb-143 dataset has a high ranking among all datasets with either joint PVN/PDE or
PDE alone.
96
6.2.2 Interpretable Machine Learning Methods For Graph Embed-
ding
Graph embedding using convolutional neural networks (CNNs) are usually trained with
backpropagation (BP) to find the model parameters. This kind of methods, while
yield good performance from time to time, is mostly a backbox and not fully inter-
pretable. Interpretable feedforward (FF) design without any BP[47] adopts a data-
centric approach to network parameters of the current layer based on data statistics from
the output of the previous layer in a one-pass manner. Saab (Subspace approximation
with adjusted bias) transform, a variant of the principal component analysis (PCA) with
an added bias vector to annihilate activations nonlinearity, can be used to yield multiple
convolutional layers. Following this idea, we propose to try more feed-foward approach
for graph representation learning. The network or graph input can be viewed as high-
dimensional vectors. The desired output is low-dimensional vectors where each column
is a single dimension vector representing a node. A sequence of vector space transfor-
mation is needed to map the input space to the output space, and our task is to find this
transformation using a interpretable approach
Deep learning on graphs with Graph Convolutional Networks (GCNs) is designed
to work directly on graphs and leverage their structural information. GCN aggregates
information from the previous layers and produces useful feature representations of
nodes in graphs. In each layer of GCN, convolution layers offers a sequence of spatial-
spectral filtering, and the resolutions becomes coarser along the process. In our Graph-
Hop architecture, we have already demonstrated the advantages of using FF design with-
out BP. It is a good alternative to GCN. In the future, we would like to to avoid the BP
training procedure for filter weights in convolutional layers. Instead, anchor vectors will
be derived from the statistics of input data. Covariance matrix of input vectors can be
97
computed and the eigenvectors can be used as the desired anchor vectors. Principal com-
ponent analysis (PCA) can be used to find a subspace and determine the anchor vector
set accordingly This is data-centric approach. This is different from the traditional BP
approach which is built by optimizing a cost function. It will be useful to extend this FF
approach for other tasks and bigger networks. We can also try to improve the classifica-
tion accuracy of our GraphHop design using Label-Assisted Regression (LAG). We can
try to partition our input data into different clusters to generate pseudo-classes. These
pseudo-classes account for intra-class variations. Then least squared regression (LSR)
can be used to map the input space to an output label. To achieve this, the first step
would be to cluster input data to create object-oriented subspaces and find the centroid
of each space. This can be done using k-mean clustering to partition the input samples.
In the second step, a soft association can be used to assign each target output a proba-
bility vector. This is done by measuring the Euclidean distance of the attribute vector
to the centroid of each centroid obtained in the first step. In the last step, a linear SLR
problem can be solved using the vectors obtained from the previous step.
We would also like to apply GraphHop based method to other interesting applica-
tions such as link prediction or graph classification. Second, in the near future, we
would like to apply the proposed methodology to graphs of a larger scale and higher
diversity such as social network data or other text data in larger corpora context. The FF
machine learning method for natural language processing is still in an early exploratory
stage. Nowadays, machine learning methods have been focused on better and better
performance with more and more complicated network architectures using deep neural
networks. Providing an interpretable design as alternative to advanced neural network
architectures can shed light on the fundamental principles of current machine learn-
ing research. With the new development top and preliminary experimental results, we
hope our work can inspire more follow-up work along this direction. SSL is still at int
98
infancy, we would like to extent it for other applications as well. One possible direction
is to develop SSL-based generative adversarial netowrk (GAN).
In addition, dynamic graph is one promising research direction for graph representa-
tion learning. For example, social graph with twitter is constantly changing. Embedding
the additional spatial and temporal information in the graph will provide more insight
into how graphs evolve. For the social network graph we learned, more meta informa-
tion will be added and embedded into the vector space to benefit more real-time and
interactive applications.
Research directions. The directions worth exploration in the future include
Apply graph coarsing to further reduce the dimension of graphs for more scalable
graph embedding
Use PVN and PDE to improve the quality of graph embedding after the coursing
step for better accuracy.
Embed spacial and temporal data as additional features into social graphs. Ana-
lyze how graph embedding changes with additional meta data and whether the
change reflect actual user profile change.
Apply GraphHop method to graphs of larger scale and diversity, especially social
media graphs. Develop an SSL-based generative model.
99
(a) SGNS versus SGNS+PVN
ThesisProposal_temp/chapters/chapter5/figures/figure6.png
(b) GloVe versus GloVe+PVN
Figure 6.3: Comparison of variances of the top 30 principal components for SGNS and
GloVe baseline embeddings with and without PVN.
100
d=5 d=5 60 dims
SGNS PVN PDE PVN v
d
(w)
PVN PCA PVN v
s
(w)
"()
concatenation
240 dims
d=5
d=5
Figure 6.4: A post-processing system for word embedding methods using integrated
PVN and PDE.
101
Bibliography
[1] R. Angles and C. Gutierrez. Survey of graph database models. ACM Computing
Surveys (CSUR), 40(1):1, 2008.
[2] R. Angles and C. Gutierrez. Survey of graph database models. ACM Computing
Surveys (CSUR), 40(1):1, 2008.
[3] M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for
embedding and clustering. In Advances in neural information processing sys-
tems, pages 585–591, 2002.
[4] S. Bhagat, G. Cormode, and S. Muthukrishnan. Node classification in social
networks. In Social network data analytics, pages 115–148. Springer, 2011.
[5] S. Bhagat, G. Cormode, and S. Muthukrishnan. Node classification in social
networks. In Social network data analytics, pages 115–148. Springer, 2011.
[6] C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak, and S. Hell-
mann. Dbpedia-a crystallization point for the web of data. Web Semantics: sci-
ence, services and agents on the world wide web, 7(3):154–165, 2009.
[7] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a col-
laboratively created graph database for structuring human knowledge. In Pro-
ceedings of the 2008 ACM SIGMOD international conference on Management of
data, pages 1247–1250. AcM, 2008.
[8] L. Bottou. Large-scale machine learning with stochastic gradient descent. In
Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.
[9] H. Cai, V . W. Zheng, and K. C.-C. Chang. A comprehensive survey of graph
embedding: Problems, techniques, and applications. IEEE Transactions on
Knowledge and Data Engineering, 30(9):1616–1637, 2018.
[10] E. J. Cand` es, X. Li, Y . Ma, and J. Wright. Robust principal component analysis?
Journal of the ACM (JACM), 58(3):11, 2011.
102
[11] F. Chen, B. Wang, and C.-C. J. Kuo. Graph-based deep-tree recursive neural net-
work (dtrnn) for text classification. In 2018 IEEE Spoken Language Technology
Workshop (SLT), pages 743–749. IEEE, 2018.
[12] F. Chen, B. Wang, and C.-C. J. Kuo. Deepwalk-assisted graph pca (dgpca) for
language networks. In ICASSP 2019-2019 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), pages 2957–2961. IEEE,
2019.
[13] H. Chen, B. Perozzi, Y . Hu, and S. Skiena. Harp: hierarchical representation
learning for networks. arXiv preprint arXiv:1706.07845, 2017.
[14] Y . Chen and C.-C. J. Kuo. Pixelhop: A successive subspace learning (ssl) method
for object classification. arXiv preprint arXiv:1909.08190, 2019.
[15] Y . Chen and C.-C. J. Kuo. Pixelhop: A successive subspace learning (ssl) method
for object recognition. Journal of Visual Communication and Image Representa-
tion, page 102749, 2020.
[16] M. Craven, A. McCallum, D. PiPasquo, T. Mitchell, and D. Freitag. Learning to
extract symbolic knowledge from the world wide web. 1998.
[17] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks
on graphs with fast localized spectral filtering. In Advances in Neural Information
Processing Systems, pages 3844–3852, 2016.
[18] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks
on graphs with fast localized spectral filtering. In Advances in neural information
processing systems, pages 3844–3852, 2016.
[19] D. DeMers and G. W. Cottrell. Non-linear dimensionality reduction. In Advances
in neural information processing systems, pages 580–587, 1993.
[20] C. H. Ding, X. He, H. Zha, M. Gu, and H. D. Simon. A min-max cut algorithm
for graph partitioning and data clustering. In Data Mining, 2001. ICDM 2001,
Proceedings IEEE International Conference on, pages 107–114. IEEE, 2001.
[21] C. H. Ding, X. He, H. Zha, M. Gu, and H. D. Simon. A min-max cut algorithm for
graph partitioning and data clustering. In Proceedings 2001 IEEE International
Conference on Data Mining, pages 107–114. IEEE, 2001.
[22] S. Dreiseitl and L. Ohno-Machado. Logistic regression and artificial neural net-
work classification models: a methodology review. Journal of biomedical infor-
matics, 35(5-6):352–359, 2002.
103
[23] M. D. Ekstrand, J. T. Riedl, J. A. Konstan, et al. Collaborative filtering recom-
mender systems. Foundations and Trends R
in Human–Computer Interaction,
4(2):81–173, 2011.
[24] D. A. Field. Laplacian smoothing and delaunay triangulations. Communications
in applied numerical methods, 4(6):709–712, 1988.
[25] H. Gao, Z. Wang, and S. Ji. Large-scale learnable graph convolutional networks.
In Proceedings of the 24th ACM SIGKDD International Conference on Knowl-
edge Discovery & Data Mining, pages 1416–1424. ACM, 2018.
[26] S. Gao, L. Denoyer, and P. Gallinari. Temporal link prediction by integrating
content and structure information. In Proceedings of the 20th ACM interna-
tional conference on Information and knowledge management, pages 1169–1174.
ACM, 2011.
[27] U. Gargi, W. Lu, V . Mirrokni, and S. Yoon. Large-scale community detection on
youtube for topic discovery and exploration. In Fifth International AAAI Confer-
ence on Weblogs and Social Media, 2011.
[28] U. Gargi, W. Lu, V . S. Mirrokni, and S. Yoon. Large-scale community detection
on youtube for topic discovery and exploration. In ICWSM, 2011.
[29] C. L. Giles, K. D. Bollacker, and S. Lawrence. Citeseer: An automatic citation
indexing system. In ACM DL, pages 89–98, 1998.
[30] C. L. Giles, K. D. Bollacker, and S. Lawrence. Citeseer: An automatic cita-
tion indexing system. In Proceedings of the third ACM conference on Digital
libraries, pages 89–98. ACM, 1998.
[31] G. H. Golub and C. Reinsch. Numerische mathematik, 14(5):403–420, 1970.
[32] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
A. Courville, and Y . Bengio. Generative adversarial nets. In Advances in neural
information processing systems, pages 2672–2680, 2014.
[33] M. T. Goodrich and R. Tamassia. Algorithm design: foundation, analysis and
internet examples. John Wiley & Sons, 2006.
[34] P. Goyal and E. Ferrara. Graph embedding techniques, applications, and perfor-
mance: A survey. Knowledge-Based Systems, 151:78–94, 2018.
[35] P. Goyal and E. Ferrara. Graph embedding techniques, applications, and perfor-
mance: A survey. Knowledge-Based Systems, 151:78–94, 2018.
104
[36] A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In
Proceedings of the 22nd ACM SIGKDD international conference on Knowledge
discovery and data mining, pages 855–864. ACM, 2016.
[37] S. Gu and L. Rigazio. Towards deep neural network architectures robust to adver-
sarial examples. arXiv preprint arXiv:1412.5068, 2014.
[38] D. K. Hammond, P. Vandergheynst, and R. Gribonval. Wavelets on graphs
via spectral graph theory. Applied and Computational Harmonic Analysis,
30(2):129–150, 2011.
[39] M. T. Harandi, C. Sanderson, S. Shirazi, and B. C. Lovell. Graph embedding dis-
criminant analysis on grassmannian manifolds for improved image set matching.
In CVPR 2011, pages 2705–2712. IEEE, 2011.
[40] T. K. Ho. Random decision forests. In Proceedings of 3rd international con-
ference on document analysis and recognition, volume 1, pages 278–282. IEEE,
1995.
[41] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computa-
tion, 9(8):1735–1780, 1997.
[42] P. D. Hoff, A. E. Raftery, and M. S. Handcock. Latent space approaches to social
network analysis. Journal of the american Statistical association, 97(460):1090–
1098, 2002.
[43] I. Jolliffe. Principal component analysis. Springer, 2011.
[44] G. Karypis and V . Kumar. Multilevelk-way partitioning scheme for irregular
graphs. Journal of Parallel and Distributed computing, 48(1):96–129, 1998.
[45] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint
arXiv:1312.6114, 2013.
[46] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolu-
tional networks. arXiv preprint arXiv:1609.02907, 2016.
[47] C.-C. J. Kuo, M. Zhang, S. Li, J. Duan, and Y . Chen. Interpretable convolutional
neural networks via feedforward design. arXiv preprint arXiv:1810.02786, 2018.
[48] C.-C. J. Kuo, M. Zhang, S. Li, J. Duan, and Y . Chen. Interpretable convolutional
neural networks via feedforward design. Journal of Visual Communication and
Image Representation, 60:346–359, 2019.
[49] T. M. Le and H. W. Lauw. Probabilistic latent document network embedding.
In 2014 IEEE International Conference on Data Mining, pages 270–279. IEEE,
2014.
105
[50] Y . LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard,
and L. D. Jackel. Handwritten digit recognition with a back-propagation network.
In Advances in neural information processing systems, pages 396–404, 1990.
[51] Y . LeCun et al. Lenet-5, convolutional neural networks. URL: http://yann. lecun.
com/exdb/lenet, 20:5, 2015.
[52] O. Levy and Y . Goldberg. Neural word embedding as implicit matrix factoriza-
tion. In Advances in neural information processing systems, pages 2177–2185,
2014.
[53] Q. Li, Z. Han, and X.-M. Wu. Deeper insights into graph convolutional networks
for semi-supervised learning. In Thirty-Second AAAI Conference on Artificial
Intelligence, 2018.
[54] J. Liang, S. Gurukar, and S. Parthasarathy. Mile: A multi-level framework for
scalable graph embedding. arXiv preprint arXiv:1802.09612, 2018.
[55] R. Liao, M. Brockschmidt, D. Tarlow, A. L. Gaunt, R. Urtasun, and R. Zemel.
Graph partition neural networks for semi-supervised classification. arXiv preprint
arXiv:1803.06272, 2018.
[56] D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social net-
works. Journal of the American society for information science and technology,
58(7):1019–1031, 2007.
[57] D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social net-
works. Journal of the American society for information science and technology,
58(7):1019–1031, 2007.
[58] S. Mac Kim, Q. Xu, L. Qu, S. Wan, and C. Paris. Demographic inference on
twitter using recursive neural networks. In Proceedings of the 55th Annual Meet-
ing of the Association for Computational Linguistics (Volume 2: Short Papers),
volume 2, pages 471–477, 2017.
[59] A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-
dimensional data sets with application to reference matching. In Proceedings
of the sixth ACM SIGKDD international conference on Knowledge discovery and
data mining, pages 169–178. Citeseer, 2000.
[60] A. K. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the
construction of internet portals with machine learning. Information Retrieval,
3(2):127–163, 2000.
[61] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word
representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
106
[62] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed repre-
sentations of words and phrases and their compositionality. In Advances in neural
information processing systems, pages 3111–3119, 2013.
[63] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed repre-
sentations of words and phrases and their compositionality. In Advances in neural
information processing systems, pages 3111–3119, 2013.
[64] M. F. Møller. A scaled conjugate gradient algorithm for fast supervised learning.
Neural networks, 6(4):525–533, 1993.
[65] M. F. Møller. A scaled conjugate gradient algorithm for fast supervised learning.
Neural networks, 6(4):525–533, 1993.
[66] J. Mu, S. Bhat, and P. Viswanath. All-but-the-top: Simple and effective postpro-
cessing for word representations. arXiv preprint arXiv:1702.01417, 2017.
[67] A. Y . Ng, M. I. Jordan, and Y . Weiss. On spectral clustering: Analysis and an
algorithm. In Advances in neural information processing systems, pages 849–
856, 2002.
[68] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking:
Bringing order to the web. Technical report, Stanford InfoLab, 1999.
[69] L. S. Penrose. The elementary statistics of majority voting. Journal of the Royal
Statistical Society, 109(1):53–57, 1946.
[70] B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social rep-
resentations. In Proceedings of the 20th ACM SIGKDD international conference
on Knowledge discovery and data mining, pages 701–710. ACM, 2014.
[71] V . Raunak. Effective dimensionality reduction for word embeddings. CoRR,
abs/1708.03629, 2017.
[72] S. L. Robinson and R. J. Bennett. A typology of deviant workplace behaviors:
A multidimensional scaling study. Academy of management journal, 38(2):555–
572, 1995.
[73] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear
embedding. science, 290(5500):2323–2326, 2000.
[74] O. Samko, A. D. Marshall, and P. L. Rosin. Selection of the optimal parame-
ter value for the isomap algorithm. Pattern Recognition Letters, 27(9):968–979,
2006.
107
[75] V . Satuluri and S. Parthasarathy. Scalable graph clustering using stochastic flows:
applications to community discovery. In Proceedings of the 15th ACM SIGKDD
international conference on Knowledge discovery and data mining, pages 737–
746. ACM, 2009.
[76] L. K. Saul, K. Q. Weinberger, J. H. Ham, F. Sha, and D. D. Lee. Spectral methods
for dimensionality reduction. Semisupervised learning, pages 293–308, 2006.
[77] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The
graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–
80, 2009.
[78] J. Schmidhuber. A local learning algorithm for dynamic feedforward and recur-
rent networks. Connection Science, 1(4):403–412, 1989.
[79] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad. Col-
lective classification in network data. AI magazine, 29(3):93–93, 2008.
[80] N. Shahid, V . Kalofolias, X. Bresson, M. Bronstein, and P. Vandergheynst. Robust
principal component analysis on graphs. In Proceedings of the IEEE Interna-
tional Conference on Computer Vision, pages 2812–2820, 2015.
[81] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts.
Recursive deep models for semantic compositionality over a sentiment treebank.
In Proceedings of the 2013 conference on empirical methods in natural language
processing, pages 1631–1642, 2013.
[82] M. Soltanolkotabi, A. Javanmard, and J. D. Lee. Theoretical insights into the
optimization landscape of over-parameterized shallow neural networks. IEEE
Transactions on Information Theory, 65(2):742–769, 2018.
[83] C. Spearman. The proof and measurement of association between two things.
The American journal of psychology, 15(1):72–101, 1904.
[84] F. Spitzer. Principles of random walk, volume 34. Springer Science & Business
Media, 2013.
[85] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov.
Dropout: a simple way to prevent neural networks from overfitting. The Jour-
nal of Machine Learning Research, 15(1):1929–1958, 2014.
[86] J. A. Suykens and J. Vandewalle. Least squares support vector machine classi-
fiers. Neural processing letters, 9(3):293–300, 1999.
108
[87] K. S. Tai, R. Socher, and C. D. Manning. Improved semantic representa-
tions from tree-structured long short-term memory networks. arXiv preprint
arXiv:1503.00075, 2015.
[88] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. Line: Large-scale
information network embedding. In Proceedings of the 24th International Con-
ference on World Wide Web, pages 1067–1077. International World Wide Web
Conferences Steering Committee, 2015.
[89] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. Line: Large-scale
information network embedding. In Proceedings of the 24th International Con-
ference on World Wide Web, pages 1067–1077. International World Wide Web
Conferences Steering Committee, 2015.
[90] A. Theocharidis, S. Van Dongen, A. J. Enright, and T. C. Freeman. Network
visualization and analysis of gene expression data using biolayout express 3d.
Nature protocols, 4(10):1535, 2009.
[91] H. Wang, J. Wang, J. Wang, M. Zhao, W. Zhang, F. Zhang, X. Xie, and M. Guo.
Graphgan: Graph representation learning with generative adversarial nets. arXiv
preprint arXiv:1711.08267, 2017.
[92] X. Wang, P. Cui, J. Wang, J. Pei, W. Zhu, and S. Yang. Community preserving
network embedding. In Thirty-First AAAI Conference on Artificial Intelligence,
2017.
[93] P. J. Werbos. Backpropagation through time: what it does and how to do it.
Proceedings of the IEEE, 78(10):1550–1560, 1990.
[94] Q. Xu, Q. Wang, C. Xu, and L. Qu. Attentive graph-based recursive neural
network for collective vertex classification. In Proceedings of t he 2017 ACM
on Conference on Information and Knowledge Management, pages 2403–2406.
ACM, 2017.
[95] Q. Xu, Q. Wang, C. Xu, and L. Qu. Collective vertex classification using recursive
neural network. arXiv preprint arXiv:1701.06751, 2017.
[96] Q. Xu, Q. Wang, C. Xu, and L. Qu. Collective vertex classification using recursive
neural network. arXiv preprint arXiv:1701.06751, 2017.
[97] S. Yan, D. Xu, B. Zhang, H.-J. Zhang, Q. Yang, and S. Lin. Graph embedding
and extensions: A general framework for dimensionality reduction. IEEE trans-
actions on pattern analysis and machine intelligence, 29(1):40–51, 2006.
109
[98] S. Yan, D. Xu, B. Zhang, H.-J. Zhang, Q. Yang, and S. Lin. Graph embedding
and extensions: A general framework for dimensionality reduction. IEEE trans-
actions on pattern analysis and machine intelligence, 29(1):40–51, 2007.
[99] C. Yang, Z. Liu, D. Zhao, M. Sun, and E. Chang. Network representation learning
with rich text information. In Twenty-Fourth International Joint Conference on
Artificial Intelligence, 2015.
[100] C. Yang, Z. Liu, D. Zhao, M. Sun, and E. Y . Chang. Network representation
learning with rich text information. In IJCAI, pages 2111–2117, 2015.
[101] C. Yang, Z. Liu, D. Zhao, M. Sun, and E. Y . Chang. Network representation
learning with rich text information. In IJCAI, pages 2111–2117, 2015.
[102] C. Yang, Z. Liu, D. Zhao, M. Sun, and E. Y . Chang. Network representation
learning with rich text information. In IJCAI, pages 2111–2117, 2015.
[103] J. Yang and J. Leskovec. Overlapping communities explain core–periphery orga-
nization of networks. Proceedings of the IEEE, 102(12):1892–1902, 2014.
[104] Z. Yang, J. Tang, and W. W. Cohen. Multi-modal bayesian embeddings for learn-
ing social knowledge graphs. In IJCAI, pages 2287–2293, 2016.
[105] J. Ye, R. Janardan, and Q. Li. Two-dimensional linear discriminant analysis. In
Advances in neural information processing systems, pages 1569–1576, 2005.
[106] C. Zhang, K. Zhang, Q. Yuan, H. Peng, Y . Zheng, T. Hanratty, S. Wang, and
J. Han. Regions, periods, activities: Uncovering urban dynamics via cross-modal
representation learning. In Proceedings of the 26th International Conference on
World Wide Web, pages 361–370. International World Wide Web Conferences
Steering Committee, 2017.
[107] D. Zhang, J. Yin, X. Zhu, and C. Zhang. Homophily, structure, and content
augmented network representation learning. In Data Mining (ICDM), 2016 IEEE
16th International Conference on, pages 609–618. IEEE, 2016.
[108] M. Zhang, H. You, P. Kadam, S. Liu, and C.-C. J. Kuo. Pointhop: An explain-
able machine learning method for point cloud classification. arXiv preprint
arXiv:1907.12766, 2019.
[109] D. Zhou and C. J. Burges. Spectral clustering and transductive learning with
multiple views. In Proceedings of the 24th international conference on Machine
learning, pages 1159–1166. ACM, 2007.
110
[110] Z. Zhou, X. Li, J. Wright, E. Candes, and Y . Ma. Stable principal component pur-
suit. In Information Theory Proceedings (ISIT), 2010 IEEE International Sympo-
sium on, pages 1518–1522. IEEE, 2010.
[111] H. Zou, T. Hastie, and R. Tibshirani. Sparse principal component analysis. Jour-
nal of computational and graphical statistics, 15(2):265–286, 2006.
[112] D. Z¨ ugner, A. Akbarnejad, and S. G¨ unnemann. Adversarial attacks on neural
networks for graph data. In Proceedings of the 24th ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining, pages 2847–2856. ACM,
2018.
111
Abstract (if available)
Abstract
Graph representation learning is an important task nowadays due to the fact that most real-world data naturally comes in the form of graphs in many applications. Graph data often come in high-dimensional irregular form which makes them more difficult to analyze than the traditional low-dimensional data. Graph embedding has been widely used to convert graph data into a lower dimensional space while preserving the intrinsic properties of the original data. ❧ In this thesis dissertation, we specifically study two graph embedding problems: 1) Developing effective and graph embedding techniques can provide researcher with deeper understanding of the collected data more efficiently
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Theory of memory-enhanced neural systems and image-assisted neural machine translation
PDF
Graph embedding algorithms for attributed and temporal graphs
PDF
Fast and label-efficient graph representation learning
PDF
Object localization with deep learning techniques
PDF
Neural networks for narrative continuation
PDF
Generating and utilizing machine explanations for trustworthy NLP
PDF
Generative graph models subject to global similarity
PDF
Human behavior understanding from language through unsupervised modeling
PDF
Techniques for vanishing point detection
PDF
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
PDF
Alleviating the noisy data problem using restricted Boltzmann machines
PDF
Machine learning techniques for outdoor and indoor layout estimation
PDF
Efficient graph processing with graph semantics aware intelligent storage
PDF
Machine learning based techniques for biomedical image/video analysis
PDF
Estimation of graph Laplacian and covariance matrices
PDF
Structured visual understanding and generation with deep generative models
PDF
Transforming unstructured historical and geographic data into spatio-temporal knowledge graphs
PDF
Efficient graph learning: theory and performance evaluation
PDF
Smaller, faster and accurate models for statistical machine translation
PDF
Human motion data analysis and compression using graph based techniques
Asset Metadata
Creator
Chen, Fenxiao (Jessica)
(author)
Core Title
Effective graph representation and vertex classification with machine learning techniques
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Publication Date
03/05/2020
Defense Date
01/17/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
artificial intelligence,artificial neural networks,graph embedding,machine learning,natural language processing,OAI-PMH Harvest
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Kuo, C.-C. Jay (
committee chair
), Sawchuk, Alexander (
committee member
), Xiang, Ren (
committee member
)
Creator Email
fenxiaoc@usc.edu,jessicachen62683@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-275219
Unique identifier
UC11673867
Identifier
etd-ChenFenxia-8221.pdf (filename),usctheses-c89-275219 (legacy record id)
Legacy Identifier
etd-ChenFenxia-8221.pdf
Dmrecord
275219
Document Type
Dissertation
Rights
Chen, Fenxiao (Jessica)
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
artificial intelligence
artificial neural networks
graph embedding
machine learning
natural language processing