Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Learning distributed representations from network data and human navigation
(USC Thesis Other)
Learning distributed representations from network data and human navigation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
LEARNING DISTRIBUTED REPRESENTATIONS FROM NETWORK DATA
AND HUMAN NAVIGATION
by
Hao Wu
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
December 2017
Copyright 2017 Hao Wu
Acknowledgements
My study at USC was a great intellectual journey because of wonderful faculty
members and incredible fellow students.
Most of all, I would like to thank my advisor Kristina Lerman for her immense
support and guidance. It was my pleasure to work with her. I also thank David
Kempe, Kevin Knight, Greg Ver Steeg and Florenta Teodoridis, who served as
my qualication and dissertation committee, for providing valuable and insightful
comments.
I would like to thank my group members and colleagues at USC and Information
Sciences Institute for many interesting discussions. I also thank my mentors at
Baidu Research, Yahoo Labs and NEC Labs for hosting my summer internships
and inspiring my research.
Last but not least, I am grateful to my family members for their encouragement
during my study over the years.
2
Contents
Acknowledgements 2
List of Tables 5
List of Figures 6
Abstract 7
1 Introduction 9
1.1 Motivations and Applications . . . . . . . . . . . . . . . . . . . . . 9
1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Distributed Representation . . . . . . . . . . . . . . . . . . . . . . . 14
1.4 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.6 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2 Related Work 21
2.1 Feature Representations of Networks . . . . . . . . . . . . . . . . . 21
2.2 Neural Language Models . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 Semantic and Link Analysis . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Human Navigation in Networks . . . . . . . . . . . . . . . . . . . . 27
2.5 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 Modeling Network Structure 30
3.1 Network Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.1 A Neural Architecture . . . . . . . . . . . . . . . . . . . . . 33
3.1.2 Learning with Negative Sampling . . . . . . . . . . . . . . . 36
3.1.3 An Inverse Architecture . . . . . . . . . . . . . . . . . . . . 37
3.1.4 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . 38
3.1.5 The Property of Network Vector . . . . . . . . . . . . . . . . 39
3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2.1 Role Discovery . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2.2 Concept Analogy . . . . . . . . . . . . . . . . . . . . . . . . 47
3
3.2.3 Multi-label Classication . . . . . . . . . . . . . . . . . . . . 48
3.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4 Modeling Networked Documents 57
4.1 Deep Context Vector . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.1.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . 61
4.1.2 Optimization with Negative Sampling . . . . . . . . . . . . . 64
4.1.3 Inference and Prediction . . . . . . . . . . . . . . . . . . . . 65
4.1.4 An Alternative Architecture . . . . . . . . . . . . . . . . . . 67
4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2.2 Methods for Comparison . . . . . . . . . . . . . . . . . . . . 69
4.2.3 Document Classication . . . . . . . . . . . . . . . . . . . . 70
4.2.4 Link Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2.5 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5 Modeling Human Navigation 77
5.1 Hierarchical Document Vector . . . . . . . . . . . . . . . . . . . . . 78
5.1.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . 79
5.1.2 Model Variants . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.1.3 Model Optimization . . . . . . . . . . . . . . . . . . . . . . 82
5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2.1 Methods for Comparison . . . . . . . . . . . . . . . . . . . . 84
5.2.2 Movie Genre Classication on MovieLens . . . . . . . . . . . 85
5.2.3 Applications on Yahoo News . . . . . . . . . . . . . . . . . . 87
5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6 Discussion and Future Directions 93
6.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4
List of Tables
2.1 Notations and descriptions. . . . . . . . . . . . . . . . . . . . . . . 29
3.1 A subset of most cited legal decisions with two distinct patterns of
ego-networks. Two-dimensional embeddings of the legal decisions
are learned using Network Vector with their ego-networks as input,
and mapped to the right gure. Citations of the top 6 cases in the
table have a few giant hubs (red dots in the gure), while that of
the bottom 6 cases are well connected (blue dots in the gure). . . 41
3.2 Performance of role discovery task. Precision of top rankedk (P @k)
results is reported. . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Percentage of analogy tuples for which the answers are hit within
top k results (Hit@k). . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4 Macro-F1 scores for multi-label classication with a balanced 50%
train-test split between training and testing data. Results of Spec-
tral Clustering, DeepWalk, LINE and node2vec are reported in [Grover
and Leskovec, 2016]. . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.1 Statistics of data sets. . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Performance of document classication. . . . . . . . . . . . . . . . 72
4.3 Performance of link prediction. . . . . . . . . . . . . . . . . . . . . 74
4.4 Word throughput rate and times of speed-up when training DCV
using dierent number of threads. . . . . . . . . . . . . . . . . . . 76
5.1 Accuracy (%) on movie genre classication . . . . . . . . . . . . . . 86
5.2 Nearest neighbors of selected keywords . . . . . . . . . . . . . . . . 88
5.3 Most similar news stories for a given keyword . . . . . . . . . . . . 89
5.4 Titles of retrieved news articles for given news examples . . . . . . 90
5.5 Top related words for news stories . . . . . . . . . . . . . . . . . . 91
5.6 Relative average accuracy improvement (%) over LDA method. . . . 91
5
List of Figures
3.1 A neural network architecture of Network Vector. It learns the
global network representation as a distributed memory evolving with
the sliding window of node sequences. . . . . . . . . . . . . . . . . 34
3.2 Network Vector is trained on Zachary's Karate network to learn
two-dimensional embeddings for representing nodes (green circles)
and the entire graph (orange circle). The size of each node is pro-
portional to its degree. Note the global representation of the graph
is close to these high-degree nodes. . . . . . . . . . . . . . . . . . . 40
3.3 Performance evaluation of Network Vector and node2vec on varying
the parameter q when xing p =1 to avoid revisiting the source
nodes. Macro-F1 and Micro-F1 scores for multi-label classication
with a balanced 50% train-test split are reported. . . . . . . . . . . 53
3.4 Performance evaluation of Network Vector and node2vec on varying
the faction of labeled data for training. . . . . . . . . . . . . . . . 54
4.1 Wikipedia pages Cat and Dog with representative notions they refer
to. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2 The general model architecture of Deep Context Vector. We refer
to it as DCV-vLBL. . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3 An alternative model architecture. We refer to it as DCV-ivLBL as
it performs inverse language modeling. . . . . . . . . . . . . . . . . 66
4.4 Performance w.r.t number of vector dimensions . . . . . . . . . . . 73
4.5 Performance w.r.t. number of hops . . . . . . . . . . . . . . . . . . 75
5.1 Model architecture with two embedded neural language models (yel-
low: document vectors; green: word vectors). . . . . . . . . . . . . 79
6
Abstract
The increasing growth of network data such as online social networks and linked
documents on the Web, has imposed great challenges on automatic feature gen-
eration for data analysis. We study the problem of learning representations from
network data, which is of critical for real applications including document search,
personalized recommendation and role discovery. Most existing approaches do not
characterize the surrounding network structure that serves as the context of each
data point, or are not scalable to large-scale data in real world scenarios.
We present novel neural network algorithms to learn distributed representations
of network data by exploiting network structures and human navigation trails.
The algorithms embed data points into a common low-dimensional continuous
vector space, which facilitates predictive tasks such as classication, relational
learning and analogy. To improve the scalability, we adopt ecient optimization
and sampling methods in our algorithms.
Firstly, we propose a neural embedding algorithm to learn distributed repre-
sentations of generic graphs with global context. In order to represent the global
network structure, a global vector is used to learn the structural information col-
lectively from all local neighborhoods of nodes in a network. Our algorithm is
scalable and the learned global representations can be used for similarity measure-
ment between two networks. We evaluate our algorithm against the state-of-the-art
7
methods, and present empirical results on node classication, role discovery and
analogy tasks.
Secondly, we present a neural language model to learn text representations from
networked documents. This is based on the intuition that authors are in
uenced
by words in the documents they cite and readers usually comprehend the words in
paragraphs by referring to those cited concepts or documents. The model can cap-
ture both the local context of word sequences and the semantic in
uence between
linked documents. We show improved performance on document classication and
link prediction tasks with our model.
Thirdly, the information of how users navigate the network data online provide
clues about missing links between cognitively similar concepts. Learning from
human navigation trails online can also help characterizing human behavior and
improving recommendation. We devise another neural network algorithm to model
human navigation of text documents in a hierarchical manner. The model can learn
representations of text documents by accounting for human navigation patterns,
and hence is better linked to human cognition. We present empirical results of our
algorithm on real news and movie review datasets, and show its eectiveness on
document classication, search and recommendation tasks.
8
Chapter 1
Introduction
The networks are in business to give people exactly what they want.
{ Steve Jobs
We live in an era of richer network data and information than ever. The
Web serves as an ever-growing information medium, which brings users increasing
freedom to access and navigate documents or other resources with hyperlinks. The
popularity of social media has not only greatly improved the transparency and
diusion of information through the social networks, but also built ego-centered
information networks where users can create, search for and select personalized
information.
1.1 Motivations and Applications
Satisfying user information needs on the Web is a fundamental problem with a long
history. Traditional technologies, such as search engines, mainly aim at delivering
relevant information by matching documents to user queries. In text search, for
example, the relevance of a document to a query is determined mostly by their
similarities in words. A typical approach is to represent each document on the Web
as a vector of its word counts. However, text data usually has high dimensionality,
and the vocabulary size of words can easily go beyond a million for a large-scale text
corpus. Methods for earning low-dimensional representations of data by capturing
their structure can reduce the dimensionality, and serve as a basic means to store,
9
index, classify and retrieve Web network data in an ecient way. Nevertheless, user
information needs can go beyond what a basic search engine can satisfy. We name
a few applications which require intelligent systems to understand relationships
between network data.
Contextual Recommendation
Traditional search methods emphasize much on the relevance of retrieved infor-
mation, but they are mostly not able to satisfy personalized information needs.
Recommendation systems have been developed to feed personalized information
without specied queries from user side. To be successful, a recommendation sys-
tem needs to deliver an exact piece of information to the right user at the right
place and in a timely manner. To achieve this, understanding user interests, behav-
ior and context is key. For example, consider a researcher who is writing a paper.
She needs to cite relevant literature, but sometimes it could be dicult to track
and identify good references. Searching for papers with a set of keywords may
have a low recall rate, and scanning through all papers at relevant conferences
requires tremendous eort. In this case, a citation recommendation system that
can automatically list relevant papers to cite is quite useful. Given written text,
the system needs to be intelligent enough to understand the semantics, topics and
area of the paper in order to recommend references. To be attractive, the system
should also be able to tell where (e.g., which sentence or paragraph) to embed the
right citation. A real-time citation recommendation is more desirable if it can pop
up high-quality references on the
y while the paper is being written. It requires
learning high-quality representations of text that captures the context. This is the
main focus of this thesis.
10
Prediction of Next Click
A fundamental problem related to recommendation is predicting whether a user
will click on an article or an item online. This plays a central role in many business-
driven applications, such as news feed system, search advertisement and online
shopping. A good prediction of next click needs to rely on models that can deeply
capture the user context. The context can be described as the interconnected
mixture of user interests, navigation patterns, the matching of document semantics
and so on. For example, consider a news feed system that needs to present fresh
stories to interest a user after she has read a few. To approach this, a simple way
is to mine the user interests by looking at what kinds of news she read in the past,
and return similar ones as recommendations. However, novel and serendipitous
recommendations can not be easily triggered, and the user might have already read
the recommended news before. Another downside is that quite a few users may be
new to the system, and there is very limit log data to characterize them. This is a
typical problem of data sparsity in recommendation systems. This problem can be
addressed using the collective navigation patterns of other users in real data. The
idea behind this is a cold-start user will likely click on news just like others did
given the same context, e.g., given a set of similar news have been read temporally
beforehand. In this thesis, we explore collective human navigation patterns on
network data and aim at learning representations of network data that are better
aligned with human cognition.
Role Discovery in Social Networks
In online social network services, information seeking tasks, such as searching
for people, data and asking questions, can be signicantly enhanced, resulting
in increased transparency of legitimate information and knowledge sharing. In
11
enterprise scope, the usage of social media also improves collaboration between
employees, enhance cohesion of organization structure, and promote enterprise
culture. The availability of social interactions data and connections between users
oers a unique opportunity for identifying individual roles across social circles.
Roles can be dened in various ways and are dependent on the dierent contexts.
For example, it is of interest to nd how in
uential the role of a researcher is in
a research community. The measurement of in
uence can rely on various factors,
such as the citations of research work, professional services and collaborations with
other researchers. Another example is identifying what roles individuals play in
the enterprise to make the team more productive; whether a member is a main
contributor in a business component with a specic expertise, or plays the role of
\glue" in making the team cohesive. In this thesis, we consider learning distributed
representations of nodes in social networks to discover their roles. We also would
like to extend to the problem of identifying the correspondence of roles in dierent
social networks.
1.2 Challenges
The increasing availability of large scale network data has imposed new challenges
to present exact information that is of relevance or attractiveness to users. It
requires sophisticated and ecient algorithms for learning representations of data
points. The learned representations can be used as features in various real world
tasks including classication, search and personalized recommendation. In order
to learn good representations of network data, there are several key challenges we
need to address.
12
Rich Context in Network
Modeling each data point alone without the intrinsic connections between data
points in a network is not desired. The surrounding network structure serves
as the context of each data point. In order to characterize the data point of
interest, we need to consider the concept of neighborhood. A traditional solution
is to extract human-engineered and network related features for each data point,
such as degree, link weight and clustering coecient. However, this may result in
heterogeneous feature sets which are not scale-invariant. Hence, there is typically
no natural similarity metric for comparing two data points accurately.
Composition of Semantics
Text attributes are common in network data. Natural language text is a typical
kind of user-generated content online. For example, users edit concept explanation
in Wikipedia and post comments in social media. In modeling network data,
methods need to learn the compositional semantics of text in each data point
in addition to the network structure. Most existing algorithms either ignore the
similarity between words, or cannot capture the eect of word order in expressing
semantics in sentences.
Human Navigation Trails
The links between Web documents serve as a means for human navigation. How-
ever, the links in existing data are far from perfect for information navigation,
with missing or redundant links. To make networks more ecient for navigation,
we may take advantage of user click-through data from search engines and news
publishers to create new links. Nevertheless, it is often not clear how to devise
13
eective algorithms that capture the cognitive factors in modeling human naviga-
tion on network data.
Scalability
To be useful in real world applications which require prompt response to queries,
methods for learning representations need to be scalable in practice and ecient
in processing data. However, most existing models are intractable to learn and
computationally costly to apply to large scale data. Algorithms have to be designed
so as to exploit the power of distributed processing.
1.3 Distributed Representation
To overcome the above-mentioned challenges, one appealing data modeling paradigm
is learning distributed representations with neural networks [Hinton, 1986]. Dis-
tributed representation is a way of encoding information (objects or concepts) into
vectors with memory units (e.g., neurons). It distributes a piece of information
over all components of the vector, each representing an aspect of the information.
The idea behind this is that in the real world, data usually has decompositional
structure. For example, an article may cover a few independent topics; A person
may have dierent roles in participating in disjoint social circles. The concept of
distributed representation has been successfully applied in neural network algo-
rithms. Each piece of information is represented by a set of neurons, and each
neuron represents many dierent pieces of information.
In contrast, a traditional and simplest encoding scheme is to dedicate each neu-
ron to one dierent piece of information. This is known as one-hot encoding and
14
falls into the category of local representations where there is a one-to-one map-
ping between data points and neurons. However, local representation is extremely
inecient, since the number of required neurons is proportional to the number of
data points. Another downside is that, by construction, each data point is repre-
sented that only similar to itself and dissimilar to all other data points in one-hot
encoding, there is no natural metric to measure the similarity between two data
points. Distributed representation has no such problems: by mapping data points
into a latent space, a small set of neurons is able to collectively represent many
data points eciently. There are two typical choices of conguring neurons in
distributed representations: the code of a neuron can either be discrete-valued or
real-valued. Compared to discrete-valued code, where each neuron can only have
limited number of integer-valued congurations, real-valued code of distributed
representation can encode similar data points with vectors that may vary slightly
in their component values. In a geometric perspective, real-valued code maps data
points into a continuous space where each point can be located at an arbitrary
position, while discrete-valued code separates each point block-wise in discrete
space.
1.4 Research Questions
In this thesis, we present methods to learn distributed representations of network
data. Real-valued code is employed for the purpose of accurately measuring simi-
larities of data points. Correlated data points can be assigned similar representa-
tions by mapping them nearby to each other in low-dimensional continuous space.
This helps in learning good predictive models even with a large number of rare
observations in long-tailed data distributions. However, it is still dicult to devise
15
neural network algorithms to map sparse data points into low-dimensional vectors
in continuous space, while preserving the intrinsic structure of network data. In
particular, there are three main research questions we need to address:
Q1. Can we learn distributed representations of networks that encode their global
link structure, as well as local connectivity?
Most applications on networks require feature generation to compare networks
in terms of structural similarity and network connectivity. Conventional approaches
extract hand-crafted features [Gallagher and Eliassi-Rad, 2010, Henderson et al.,
2011, Berlingerio et al., 2012] such as node degrees, edge weights or clustering
coecients, and suer from the heterogeneity of dierent feature sets. We aim at
automating the feature generation in a representation learning framework. Prior
work [Perozzi et al., 2014, Tang et al., 2015b, Cao et al., 2015, Grover and Leskovec,
2016] focus on learning node level feature representations by respecting local neigh-
borhood structure. However, these approaches exploit little information of the
global connectivity of the network.
We introduce global representation for the entire network, which serves as the
global context to predict each observed node using its neighbors in the network.
The global representation is shared across all neighborhood structures and learned
to capture the structural properties of the network. The learned global representa-
tions can be used as signatures for measuring similarity between networks and in
predictive tasks such as role discovery. This avoids ad-hoc methods for aggregating
node representations to describe the network.
Q2. How to extend the algorithm to model rich structure of the networks that
includes textual attributes of network nodes?
16
Network data, such as Web documents and linked academic publications, usu-
ally contain natural language text. Most approaches for text representations are
based on bag-of-words. There is no obvious way to explore word order, which
serves as one typical context information for modeling words. The classic n-gram
language model is an alternative in that it can explore temporal information from
word sequences. The downside is its poor generalization to unseen data: a rela-
tively long word sequence (e.g., with ve or six grams) in test data may probably
have never been seen in the training data. To solve this problem, neural language
models [Bengio et al., 2003] have been proposed to learn distributed representations
of words in continuous space, which allows to take advantage of longer contexts to
estimate the distribution of the next word.
However, most existing distributed representation approaches do not take into
account the network structure of data. In most scenarios, network structure re
ects
latent semantics, and is essential for interpreting the content of data. For example,
researchers in the same academic community usually exchange ideas, read, and
cite each other's papers. The in
uence between peers within the underlying social
network may have impact on research ideas, paper writing styles and terminology.
As another example, consider the citation network of legal opinions. Lawyers study
existing case laws and cite them as evidence to support their opinions. In order to
understand and interpret a case, judges need to trace all the cited evidence to make
conclusions. We explore network structure to learn distributed representations
of data points with language, in the hope of capturing a deeper context of text
generation.
Q3. How to extend the algorithm to model dierent processes on networks, such
as human navigation?
17
Hyperlinks are essential to provide navigational paths for humans to browse infor-
mation on the Web. Suppose a user only gets local access to the pages which
are immediate neighbors of the one that she is currently browsing. The way
people navigate Web data can be modeled as random walks on graphs. To be
eciently navigable, a network needs to contain short paths, a property known
as small world [Milgram, 1967, Watts and Strogatz, 1998, Kleinberg, 2000, 2002].
In a realistic setting, users tend to be good at selecting clues to nd informa-
tion through short paths [West and Leskovec, 2012a,b], e.g., hierarchical struc-
ture of social networks [Watts et al., 2002], and compression heuristics [Brashears,
2013]. On the other hand, research shows there are existing constraints that regu-
late human behaviors of information propagation, navigation and discovery on the
Web [Huberman et al., 1998, Craswell et al., 2008, Lerman et al., 2017] and social
media as well [Steeg et al., 2011, Hodas and Lerman, 2012]. Understanding human
behaviors in navigation is of importance to discovery of missing links [West et al.,
2015], and to creation of navigational aids for information seekers with user-friendly
links [Geigl et al., 2016].
However, the dynamics of human navigation is dicult to quantify in general.
The human factors mainly governing the process vary when facing with dierent
network data [Singer et al., 2015]. The underlying network topology plays a funda-
mental role in aiding navigation on online encyclopedia such as in Wikipedia, while
users prefer to navigate similar items of their interests on hobby sharing websites
such as Last.fm. In this thesis, instead of explicitly distinguishing specic naviga-
tion patterns and reasoning about underlying human motivations, we seek to learn
distributed representations of networks by exploiting human navigation data. By
implicitly capturing human navigation patterns, our approach can be used to learn
18
missing links between network data. Another application is to predict the trails
that information seekers will take on networks using the learned representations.
1.5 Contributions
We present novel neural network algorithms to learn distributed representations
from network data. Random walks and human navigation trails on networks are
exploited in representation learning. Our algorithms embed data points into a
common low-dimensional vector space, and hence facilitate similarity comparison
and a variety of predictive tasks. In short, we make the following contributions:
We present a neural embedding algorithm called Network Vector for repre-
senting generic graphs. The algorithm captures the global context of network
structure, and predicts nodes based on their local neighborhoods.
To model the text in networked documents, we propose an algorithm called
Deep Context Vector. The algorithm incorporates the local context of word
sequences and semantic in
uence between linked documents.
We describe Hierarchical Document Vector, which leverages human naviga-
tion patterns from user logs to learn representations of text documents.
To speed up model training, we adopt ecient optimization and sampling
methods to scale our algorithms for large-scale data. We evaluate our algo-
rithms on real world datasets, and compare them with the state-of-the-art
representation learning algorithms on tasks including classication, link pre-
diction, role discovery and concept analogy.
19
1.6 Thesis Organization
The rest of this thesis is organized as follows: In Chapter 2, we discuss the back-
ground and related work which inspire our work on learning representations of net-
work data. In Chapter 3, we introduce Network Vector, which learns distributed
representations of generic graphs or networks with global context. In Chapter 4,
we describe Deep Context Vector (DCV) in detail, which are proposed to model
deep context of networked documents with text. In Chapter 5, we present Hierar-
chical Document Vector (HDV). The model takes the human navigation data on
Web documents as input, and learn representations of documents. In Chapter 6,
We discuss our work in comparison to previous work in several aspects, and also
present several lines of interesting future work.
20
Chapter 2
Related Work
We begin with a review of related work, which provides the background of our
models. Related work can be roughly divided into four categories: feature repre-
sentations of networks, neural language models, semantic and link analysis, and
human navigation in networks.
2.1 Feature Representations of Networks
Feature representations of networks is of critical importance to many real world
applications, including node classication [Getoor, 2007, Sen et al., 2008], anomaly
detection [Chandola et al., 2009], social recommendations [Fouss et al., 2007]
and link prediction [Liben-Nowell and Kleinberg, 2007]. The conventional tech-
niques for generating features from networks typically involve the computation
of hand-engineered features [Gallagher and Eliassi-Rad, 2010, Henderson et al.,
2011, Berlingerio et al., 2012]. They suer from the heterogeneity of features from
dierent methodology and computational ineciency. In this chapter, we cast fea-
ture representations of networks as a unsupervised learning problem [Bengio et al.,
2013]. Our work is mainly related to classic approaches of graph embedding and
distributed learning of node representations in networks.
21
Classic Graph Embedding
Embedding networks into low-dimensional vector space has attracted much research
attention in machine learning domain [Kruskal and Wish, 1978, Tenenbaum et al.,
2000, Roweis and Saul, 2000, Belkin and Niyogi, 2001, Tang and Liu, 2009b,a,
2011]. Most approaches to graph embedding are based on the adjacency matrix
representations of graphs and utilize linear or non-linear dimensionality reduction
to produce the graph embeddings. Representative methods include MDS [Kruskal
and Wish, 1978], IsoMap [Tenenbaum et al., 2000] and Laplacian eigenmap [Belkin
and Niyogi, 2001]. These methods may perform well in similarity measurement of
nodes since the resulting embeddings preserve the local neighborhoods of nodes in
a network. However, most approaches are computationally expensive and do not
scale well for networks of large size.
Distributed Node Representations
Our algorithm builds on the foundation of learning distributed representations of
concepts [Hinton, 1986]. Distributed representations encode structural relation-
ships between concepts and are typically learned using back-propagation through
neural networks. Recent advances in natural language processing have successfully
adopted distributed representation learning and introduced a family of neural lan-
guage models [Bengio et al., 2003, Mnih and Hinton, 2007, Mikolov et al., 2010,
2013a,b] to model word sequences in sentences and documents. These approaches
embed words such that words in similar contexts tends to have similar represen-
tations in latent space. Distributed representations of words stand out due to
their property of capturing many linguistic regularities of human language, such
as syntactic, semantic similarity and logical analogy. Neural language models have
spurred renewed AI research and signicant improvements have been reported
22
recently in many tasks, including speech recognition [Amodei et al., 2015], machine
translation [Wu et al., 2016] and image caption [Vinyals et al., 2015].
By exchanging the notions of nodes in a network and words in a document,
recent research [Perozzi et al., 2014, Tang et al., 2015b, Cao et al., 2015, Chang
et al., 2015, Grover and Leskovec, 2016, Wang et al., 2016] attempt to learn node
representations in a network in a similar way of learning word embeddings in neural
language models. Our work follows this line of approach in general purpose and
methodology. Dierent node sampling strategies are explored in these approaches.
For example, DeepWalk [Perozzi et al., 2014] samples node sequences from network
using a stream of short random walks, and model them just like word sequences in
documents using neural embeddings. LINE [Tang et al., 2015b] samples nodes in
pairwise manner and models the rst-order and second-order proximity between
them. GrapRep [Cao et al., 2015] extends LINE to exploit structural information
beyond second-order proximity. To oer a
exible node sampling from a network,
node2vec utilizes a combination of Depth-First Search (DFS) and Breath-First
Search (BFS) strategies to explore the local neighborhood structure of a network.
However, existing approaches only consider the local network structures (i.e.,
the neighborhoods of nodes) in learning node embeddings, but exploit little infor-
mation of the global network structure. Although recent approach GrapRep [Cao
et al., 2015] attempts to capture long distance relationship between two dierent
vertices, it limits scope to a xed number of hops. More importantly, existing
approaches focus on node representations and require additional eort to com-
pute the representation of the entire network. The simple scheme of averaging
the representations over all nodes to represent the network is by no means a good
choice, as it ignores the statistics of node frequency and their roles in the network.
In contrast, our approach introduces a notion called global vector of a network
23
to represent the structural properties of the network. The global vector repre-
sentation of the network acts as a memory which is asked to contribute to the
prediction of a node accompanying with the node's neighbors, and updated to
maximize the predictive likelihood. As a result, our algorithm can simultaneously
learn the global representation of a network and the representations of nodes in
the network. The learned representations can be used as the signatures of the net-
works for comparison, or as features for predictive tasks. Our approach is inspired
by Paragraph Vector [Le and Mikolov, 2014], which learns a continuous vector to
represent a piece of text with variable-length, such as sentences, paragraphs and
documents.
2.2 Neural Language Models
Neural language models [Bengio et al., 2003] are based on the idea of learning
distributed representations [Hinton, 1986] for words. Instead of \one-hot" dis-
crete representations for words in vocabulary, neural language models use contin-
uous variables to represent words in vector space, in the hope of improving the
generalization of classic n-gram language models for modeling word sequences.
Distributed word representations have been successfully applied to many tasks in
natural language processing and data mining. More recently, the concept of dis-
tributed representations have been extended beyond modeling pure unigram words
to phrases [Mikolov et al., 2013b], sentences and documents [Le and Mikolov, 2014],
relational entities [Bordes et al., 2013, Socher et al., 2013], and general text-based
attributes [Kiros et al., 2014b]. Representative applications of neural language
models also include learning social representations [Perozzi et al., 2014], automatic
24
generation of image captions [Kiros et al., 2014a] and semi-supervised text embed-
ding [Tang et al., 2015a].
Neural language models take into account word order in sentences, and assume
that temporally closer words are statistically more dependent. The \similarities"
between words can be learned within and across sentences [Bengio et al., 2003] and
this helps improve generalization. For example, having observed the sentences \the
cat sat on the couch" and \the dog lay on the
oor", the models can learn word
vectors which capture semantically similar words in the same grammatical roles,
e.g., the word pairs \cat" and \dog", \sat" and \lay", \couch" and \
oor", from
their same surrounding words \the ... on the ...". Typically, a neural language
model learns the probability distribution of the next word given a xed number of
preceding words, which act as the context for the next word. The neural network is
trained by projecting the concatenation of vectors for context words into a latent
space with multiple non-linear hidden layers and the output softmax layer, and
attempts to predict next word with high probability. However, a neural network of
large size is dicult to train, and the word vectors are computationally expensive
to learn from large scale data comprising millions or billions of words. Recent
approaches with dierent versions of log-bilinear models [Mnih and Hinton, 2007]
or log-linear models [Mikolov et al., 2013a] attempt to modify the model archi-
tecture in order to reduce the computational complexity. The use of hierarchical
softmax [Morin and Bengio, 2005], noise contrastive estimation [Mnih and Teh,
2012] or negative sampling [Mikolov et al., 2013b] can also help speed up train-
ing. In our work, we propose models based on neural language models for learning
distributed representations for network data and also take into account human
navigation information.
25
2.3 Semantic and Link Analysis
Modeling underlying structure of text documents and learning semantic repre-
sentations is critical to various applications including information retrieval, text
summarization and personalized recommendation. For instance, the quality of
document retrieval in modern search engineers mostly relies on good measure-
ment of the semantic similarity between query and documents. Text documents
are usually represented using the normalized occurrences of words in them. The
word frequencies in a document are expressed in vector space, such as the popular
TFIDF [Salton and McGill, 1983] scheme. Although the TFIDF scheme has been
successfully adopted in practical search engines due to its simplicity and eciency,
the approach captures little statistical structure of the text corpus, such as the
connections of words that are similar in semantics or same word with ambigu-
ous meaning. In other words, the document representations ignore the similarity
between synonyms and cannot distinguish polysemy.
To address above problems, approaches of learning latent semantics, notably
topic modeling methods such as probabilistic Latent Semantic Analysis (PLSA [Hof-
mann, 1999]) and Latent Dirichlet Allocation (LDA [Blei et al., 2003]) have been
proposed. Both PLSA and LDA are directed graphical models. Each document is
modeled as a mixture of latent \topics", and a \topic" is modeled as a probability
distribution over words to represent a particular semantic aspect. LDA adopts a
Dirichlet prior to model the topic mixture weights and improves over PLSA in gen-
eralization to new documents. More recent approaches [Salakhutdinov and Hinton,
2009, Larochelle and Lauly, 2012, Srivastava et al., 2013] are undirected graphical
models which use Restricted Boltzmann Machines (RBMs) to factorize data dis-
tributions into a particular form of the product of experts, rather than a mixture
of latent aspects. RBMs have a complementary prior over hidden units and hence
26
address the explaining-away problems. These models generally outperform PLSA
and LDA in learning document representations. However, most of these models
do not account for data with the network structure.
Research in link analysis [Ho et al., 2002, Kemp et al., 2004, Airoldi et al.,
2009, McAuley and Leskovec, 2012] attempt to model network structure and nd
\membership" or \communities" in latent space. However, these approaches can-
not be easily applied to nodes with text. Most of our models are close to recent
work in joint content and link analysis. Several models [McCallum et al., 2005,
Dietz et al., 2007, Nallapati et al., 2008, Mei et al., 2008, Chang et al., 2009, Chang
and Blei, 2009, Wang and Blei, 2011] have been proposed based on topic modeling,
and links between nodes are explained by the similarity of topic mixtures. However,
most existing models are based on bag-of-words and there is little word order infor-
mation considered. It's also worth to note that most existing algorithms cannot
scale for real world large scale networked documents due to intractable inference.
In this thesis, we propose scalable models that can go beyond the bag-of-words
representations to exploit word sequences in sentences. Most importantly, we aim
to devise probabilistic models to predict a data point based on contexts such as
neighboring network structure, preceding words in text and global semantics of a
document.
2.4 Human Navigation in Networks
Estimating user navigation and ranking Web pages has been studied extensively
in information retrieval. Early notable algorithms include PageRank [Brin and
Page, 1998] and HITS [Kleinberg, 1999], which are mainly based on hyperlink
structure to estimate the probability that a user stops in a Web page or select
27
an authoritative information source. Research has focused on analyzing massive
click trails or query logs [Lempel and Moran, 2000, Teevan et al., 2004, Deshpande
and Karypis, 2004, Bilenko and White, 2008, Singla et al., 2010] to improve Web
search results. Our thesis is mostly related to this line of work in terms of leveraging
human search trails on the Web. However, our work diers in methodology that
considers learning distributed representations with high-order contexts, and goes
beyond Markov chain models as in [Brin and Page, 1998, Kleinberg, 1999, Lempel
and Moran, 2000, Deshpande and Karypis, 2004].
Surng information networks is fundamentally similar to navigation in social
networks. The study of navigation in social networks can be traced back to Stanley
Milgram's sociological experiment of letter forwarding in 1960s [Milgram, 1967],
which illustrates the small world property. It has been shown small world property
is critical for a network to be eciently navigable [Watts and Strogatz, 1998,
Kleinberg, 2000, 2002]. In real information networks, short paths among all (or
most) pairs of nodes are not usually observable. Nevertheless, human navigation
in information networks are mostly ecient when proper strategies are used [West
and Leskovec, 2012a,b]. In order to explain the underlying mechanisms of human
navigation for information seeking, descriptive theories and models use metaphors,
such as information scent [Chi et al., 2001] and information foraging [Olston and
Chi, 2003]. To access dierent hypotheses of human trails, recent studies [White
and Huang, 2010, Singer et al., 2015] have also examined various types of human
trails online including Web blogs, navigation trails over Wikipedia, business reviews
and online music listening logs. Although it is important to quantify the human
factors of navigation in information networks, in this thesis, however, we mainly
focus on modeling human trails to help learning representations of network data.
28
Table 2.1: Notations and descriptions.
Notation Description
v a vertex or node in a network
w a word in a sentence
d a document
w
1:n
a word sequence in a sentence, short for (w
1
; ;w
n
)
v
1:n
a node sequence sampled from a network, short for (v
1
; ;v
n
)
w
tc:t+c
word sequence (w
tc
; ;w
t+c
), excluding w
t
v
tc:t+c
node sequence (v
tc
; ;v
t+c
), excluding v
t
v
i
real-valued vector representation for node i
v
w
real-valued vector representation of a word w
v
d
real-valued vector representation of a document d
v
G
real-valued vector representation of a graph G
^ v predicted word or node vector of context
N
m
d
the set of neighbors of d within m hops
d
1:m+1
sampled document sequence along links of m hops
K vocabulary size of words
N number of distinct documents in a corpus
D dimensionality of a vector
Z normalization term in log-bilinear model
2.5 Notation
Table 2.1 lists a few important notations in our models and their descriptions,
which will be used throughout this thesis.
29
Chapter 3
Modeling Network Structure
The networks have a particular agenda,
a particular model and structure.
{ Joss Whedon
Applications in network analysis, including network pattern recognition, clas-
sication, role discovery, and anomaly detection, among others, critically depend
on the ability to measure similarity between networks or between individual nodes
within networks. For example, given an email communications network within
some enterprise, one may want to classify individuals within that network accord-
ing to their functional roles, or given one individual, nd another one playing
the same role. As another example, given a citation network of scientic papers,
one may want to identify dierent research elds and in
uential papers that spur
subsequent research within each eld.
Computing network similarity requires going beyond comparing networks at
node level to measuring their structural similarity. To characterize network struc-
ture, traditional approaches extract features such as node degrees, clustering coe-
cients, eigenvalues, the lengths of shortest paths, and so on [Berlingerio et al., 2012].
However, these hand-crafted features are usually heterogeneous and it is often not
clear how to integrate them within a learning framework. In addition, some graph
features, such as eigenvalues, are computationally expensive and do not scale well
in tasks involving large networks. Recent advances in distributed representation of
nodes [Perozzi et al., 2014, Tang et al., 2015b, Cao et al., 2015, Chang et al., 2015,
30
Grover and Leskovec, 2016, Wang et al., 2016] in networks created an alternate
framework for unsupervised feature learning of nodes in networks. These meth-
ods are based on the idea of preserving local neighborhoods of nodes with neural
network embeddings. However, these embeddings used for feature representations
limit the scope to the local context of nodes, without exploiting the global context
of the network as a whole to learn embeddings. In order to represent the entire
network, these approaches require us to integrate the representations of all network
nodes, for example, by averaging all node representations. However, not all nodes
should contribute equally to the global representation of the network, and in order
to account for their varying importance, aggregation schemes would need to weigh
dierent nodes, which adds an extra layer of complexity to the learning task.
In this chapter, we address the above-mentioned challenge by introducing a
neural embedding algorithm called Network Vector [Wu and Lerman, 2017b]. Our
algorithm learns distributed representations of networks that account for their
global context. Our algorithm is scalable to real world networks with large numbers
of nodes. Our algorithm can be applied to generic networks such as social networks,
knowledge graphs, and citation networks. The algorithm compresses networks into
real-valued vectors that preserve the network structure, in hopes that the learned
distributed representations can be used to eectively measure network similarity.
Specically, given two networks, even those with dierent size and topology, the
distance between the learned vector representations can be used to measure the
structural similarity between the networks. In addition, this approach allows us to
compare individual nodes by looking at the similarity of their ego-networks. An
ego-network contains the focal node and all the nodes it is connected to, plus the
connections between them.
31
For empirical evaluation, we experiment with several real world datasets from
diverse domains, including knowledge concepts in Wikipedia, email interaction net-
work, citation networks of legal opinions, social network in blogs, protein-protein
interaction network and word co-occurrence network. Our rst task is role dis-
covery in networks, which aims to retrieve knowledge networks with analogous
relations, or identify individuals serving similar roles within the social network.
Secondly, we perform concept analogy task on Wikipedia. Thirdly, we focus on
a classic predictive task in networks, which is to predict multiple possible labels
of nodes using classier trained on node representations. In this way, we may
understand how discriminative the vector representations are. We compare Net-
work Vector with state-of-the-art feature learning algorithm node2vec [Grover and
Leskovec, 2016] and baselines which compute node degrees, clustering coecients
and eigenvalues for representing networks. Experiments demonstrate the superior
performance of Network Vector, which considers the global context of the network
for learning neural embeddings of networks.
3.1 Network Vector
In this section, we describe our algorithm Network Vector in detail. We consider
the problem of embedding nodes of a network and the entire network into low-
dimensional vector space. Let G =fV;Eg denote a graph, and V is the set of
vertices and E = V V is the set of edges with weights W . The goal of our
approach is to map the entire network to a low-dimensional vector, represented by
v
G
2 R
D
, and map each node i to a unique vector v
i
2 R
D
in the same vector
space. Although the dimensionality of network representation v
G
can be dierent
from node representations v
i
in theory, we adopt the same dimensionality D for
32
the ease of computation in real world applications. Suppose that there are M
graphs given (e.g., ego-networks of M persons of interest in a social network) and
N distinct nodes in the corpus, then there are (M +N)D parameters to be
learned.
3.1.1 A Neural Architecture
Our approach of learning network representations is inspired by learning dis-
tributed representations of variable-length texts (e.g., sentences, paragraphs and
documents) [Le and Mikolov, 2014]. We consider a network as a generalized \doc-
ument" that consists of distinct nodes from a vocabulary as its \word" content.
The goal is to predict a node given other nodes in its local context as well as the
global context of the network.
Text has a linear property that the local context of a word can be naturally
dened by surrounding words in ordered sequences. However, networks are not lin-
ear. In order to characterize the local context of a node, without loss of generality,
we sample node sequences from the given network with second-order random walks
in [Grover and Leskovec, 2016], which oer a
exible notion of a node's local neigh-
borhood by combining Depth-First Search (DFS) and Breadth-First Search (BFS)
strategies. Our learning framework can easily adopt higher-order random walks,
but with higher computation cost. Each random walk starts from an arbitrary
root node and generates an ordered sequence of nodes with second-order Markov
chains. Specically, consider node v
a
that has been visited in the previous step,
and the random walk currently reaches node v
b
. Consecutively, the next node v
c
will be sampled in random walks, with probability:
P (v
c
jv
a
;v
b
) =
1
Z
M
a
bc
W
bc
(3.1)
33
Figure 3.1: A neural network architecture of Network Vector. It learns the global
network representation as a distributed memory evolving with the sliding window
of node sequences.
where M
a
bc
is the unweighted transition probability of moving from node v
b
to v
c
given v
a
, W
bc
is the weight of edge (v
b
;v
c
), and Z is the normalization term. We
dene M
a
bc
as:
M
a
bc
=
8
>
>
>
>
>
<
>
>
>
>
>
:
1
p
if d(v
a
;v
c
) = 0
1 if d(v
a
;v
c
) = 1
1
q
if d(v
a
;v
c
) = 2
(3.2)
whered(v
a
;v
c
) is the shortest path distance between v
a
andv
c
. The parameters p
and q control how the random walk biases toward visited nodes in previous step
and nodes that are further away. The random walk terminates when l vertices are
sampled, and the procedure repeats r times for each root node.
Figure 3.1 illustrates a neural network architecture for learning the global net-
work vector and node vectors simultaneously. A sliding window with xed-length
n is used to repeatedly sample node sequences (v
1
; ;v
n
) from random walk
paths. The algorithm predicts the target node v
n
given preceding nodes v
1:n1
34
as local context and the entire network G as the global context, with probability
P (v
n
jv
1:n1
;G). Formally, the probability distribution of a target node is dened
as:
P (v
n
jv
1:n1
;G) =
1
Z
c
exp[E(G;v
1:n1
;v
n
)] (3.3)
where Z
c
=
P
vm2V
exp[E(G;v
1:n1
;v
m
)] is the normalization term. We extend
the scalable version of Log-Bilinear model [Mnih and Hinton, 2007], called vector
Log-Bilinear model (vLBL) [Mnih and Kavukcuoglu, 2013]. In our model, the
energy function E(G;v
1:n1
;v
n
) is specied as:
E(G;v
1:n1
;v
n
) =^ v
>
v
n
(3.4)
where ^ v is the predicted representation of the target node:
^ v = v
G
+
n1
X
i=1
c
i
v
i
!
(3.5)
Here denotes the Hadamard (element-wise) product, and c
i
is the weight vector
for the context node in position i. c
i
parameterizes the context nodes at dierent
hops away from the target node in random walks. The global network vector v
G
is shared across all sliding windows of node sequences. After being trained, the
global network vector preserves the structural information of the network, and
can be used as feature input for the network. In our model, in order to impose
symmetry in feature space of nodes, and activate more interactions between the
feature vector v
G
and the node vectors, we use the same set of feature vectors for
both the target nodes and the context nodes. This is dierent from [Mnih and
Kavukcuoglu, 2013], where two separated sets of representations are used for the
35
target node and the context nodes respectively. In practice, we nd our approach
improves the performance of Network Vector.
3.1.2 Learning with Negative Sampling
The global network vector v
G
, the node vectors v
i
and the position-dependent
context parameters c
i
are initialized with random values, and optimized by maxi-
mizing the objective in Eq. (3.3). Stochastic gradient ascent is performed to update
the set of parameters =fv
G
; v
i
; c
i
g:
=r
logP (v
n
jv
1:n1
;G)
=
@
@
log
"
exp(^ v
>
v
n
)
P
N
vm=1
exp(^ v
>
v
m
)
#
(3.6)
where is the learning rate. The computation involves the normalization term
and is proportional to the number of distinct nodes N. The complexity of com-
putation is expensive and impractical in real applications. In our approach, we
adopt negative sampling [Mikolov et al., 2013b] for optimization. Negative sam-
pling represents a simplied version of noise contrastive estimation [Mnih and Teh,
2012], and trains a logistic regression to distinguish between data samples of v
n
from \noise" distribution. Our objective is to maximize
log(^ v
>
v
n
) +
k
X
m=1
E
vm
P
n
(v)
log(^ v
>
v
m
)
(3.7)
where(x) = 1=(1+exp(x)) is the sigmoid function. P
n
(v) is the global unigram
distribution of the training data acting as the noise distribution where we draw
k negative samples of nodes. Negative sampling allows us to train our model
36
eciently that it no longer requires explicit normalization in Eq. (3.6), and hence
is more scalable.
3.1.3 An Inverse Architecture
The architecture in Figure 3.1 utilizes the linear combination of the global network
vector and the context node vectors to predict the target node in a sliding window.
Another way of training the global network vector is to model the likelihood of
observing a sampled node v
t
from the sliding window conditioned on the feature
vector v
G
, given by
P (v
t
jG) =
1
Z
G
exp[E(G;v
t
)] (3.8)
where Z
G
is the normalization term specic to the feature representation of G.
The energy function E(G;v
t
) is:
E(G;v
t
) =v
>
G
v
t
(3.9)
This architecture is a counterpart of the Distributed Bag-of-Words version of Para-
graph Vector [Le and Mikolov, 2014]. However, this architecture ignores the order
of the nodes in the sliding window and perform poorly in practice when it is used
alone. We extend the framework by simultaneously training network and node
vectors using a Skip-gram [Mikolov et al., 2013a,b] like model. The model addi-
tionally maximizes the likelihood of observing the local contextv
tn:t+n
(excluding
v
t
) for the target node v
t
, conditioned on the feature representation of v
t
. Unfor-
tunately, modeling the joint distribution of a set of context nodes is not tractable.
37
This problem can be relaxed by assuming the node in dierent context positions
are conditionally independent given the target word:
P (v
tn:t+n
jv
t
) =
t+n
Y
i=tn
P (v
i
jv
t
) (3.10)
where P (v
i
jv
t
) =
1
Zt
exp[E(v
t
;v
i
)]. The energy function is:
E(v
t
;v
i
) = v
>
t
(c
i
v
i
) (3.11)
The objective is to maximize the log-likelihood of the product of the probabilities,
P (v
t
jG) and P (v
tn:t+n
jv
t
).
3.1.4 Complexity Analysis
The computation of Network Vector consists of two key parts: sampling of node
sequences with random walks and optimization of vectors. For each node sequence
of xed length l, we start from a randomly chosen root node. At each step, the
walk visits a new node based on the transition probabilitiesP (v
c
jv
a
;v
b
) in Eq. (3.1).
The transition probabilitiesP (v
c
jv
a
;v
b
) can be precomputed and stored in memory
usingO(jEj
2
=jVj) space. Sampling a new node in the walk can be eciently done
in O(1) time using alias sampling [Walker, 1977]. Given a network G =fV;Eg,
the overall time complexity is O(rjVjl) for repeating r times of random walks of
xed length l by taking each node as root.
The time complexity of optimization with negative sampling in Eq. (3.7) is
proportional to the dimensionality of vectors D, the length of context window n
and the number of negative samplesk. It takesO(nkD) time for nodes within the
sliding window (v
1
; ;v
n
). The introduced global vector v
G
requiresO(kD) time
38
to optimize, same as any other node vectors in the sliding window. Givenr random
walks of xed length l starting from every node, the overall time complexity is
O(nkDrjVjl). To store the node vectors and the global network vector, it requires
O(DjVj +D) space.
3.1.5 The Property of Network Vector
The property of the global network vector v
G
in the architecture (as shown in
Figure 3.1) can be explained by looking at the objective in Eq. (3.3). v
G
is part
of the input to the neural network, and can be viewed as a term that helps to
represent the distribution of the target nodev
n
. The relevant part v
>
G
v
n
is related
logarithmically to the probability P (v
n
jv
1:n1
;G). Therefore, the more frequently
a particular v
n
is observed in the data, the larger the value v
>
G
v
n
will have, and
hence v
G
will be closer to v
n
in vector space. The training objective is to maximize
the logarithm of the product of all probabilities P (v
n
jv
1:n1
;G), and the value is
related to v
>
G
v
n
, where v
n
is the expected vector that can be obtained by averaging
all observed v
n
in the data. It is also true for Eq. (3.8) in the inverse architecture
where the global network vector v
G
is the only input to the neural network, in
order to predict every node v
t
.
Karate Network
As an illustrative example, we apply Network Vector to the classic Karate net-
work [Zachary, 1977]. The nodes in the network represent members in a karate
club, and the edges are social links between the members outside the club. There
are 34 nodes and 78 undirected edges in total. We use the inverse architecture to
train the vectors. Figure 3.2 shows the output of our method in two-dimensional
space. We use green circles to denote nodes and orange circle to denote the entire
39
Figure 3.2: Network Vector is trained on Zachary's Karate network to learn two-
dimensional embeddings for representing nodes (green circles) and the entire graph
(orange circle). The size of each node is proportional to its degree. Note the global
representation of the graph is close to these high-degree nodes.
graph. The size of a node is proportional to its degree in the graph. We can see
that the learned global network vector is close to these high-degree nodes, such as
node 1 and 34, which serve as the hubs of two splits of the club. The resulting
global vector mostly represent the backbone nodes (e.g., hubs) in the network and
compensates the lack of global information in local neighborhoods.
Legal Citation Networks
Given a citation network of documents, for example, scientic papers or legal opin-
ions, we want to identify similar documents. These could be groundbreaking works
that serve to open new elds of discourse in science and law, or foundational works
that span disciplines but have less of an impact on discourse, such as \methods"
papers in science.
40
Table 3.1: A subset of most cited legal decisions with two distinct patterns of ego-
networks. Two-dimensional embeddings of the legal decisions are learned using
Network Vector with their ego-networks as input, and mapped to the right gure.
Citations of the top 6 cases in the table have a few giant hubs (red dots in the
gure), while that of the bottom 6 cases are well connected (blue dots in the gure).
ID Title (with url) Page Year
93272 Chicago & Grand Trunk Ry. Co. v. Wellman 143 U.S. 339 1892
99622 F. S. Royster Guano Co. v. Virginia 253 U.S. 412 1920
103222 Coleman v. Miller 307 U.S. 433 1939
109380 Buckley v. Valeo 424 U.S. 1 1976
118093 Arizonans for Ocial English v. Arizona 520 U.S. 43 1997
110578 Ridgway v. Ridgway 454 U.S. 46 1981
91704 Yick Wo v. Hopkins 118 U.S. 356 1886
98094 Weeks v. United States 232 U.S. 383 1914
101741 Stromberg v. California 283 U.S. 359 1931
101957 Powell v. Alabama 287 U.S. 45 1932
103050 Johnson v. Zerbst 304 U.S. 458 1938
105547 Roth v. United States 354 U.S. 476 1957
For the purpose of case study, we collected a large digitized record of federal
court opinions from the CourtListener project
1
. The most cited legal decisions
from the United States Supreme Court are selected and ego-networks of citations
are constructed for these legal cases. Two distinct graph patterns are observed.
One is \Citations have a few giant hubs" and \Citations are well connected". We
list a few examples in Table 3.1, where the titles of the cases with dierent citation
1
https://www.courtlistener.com/
41
patterns are colored as red and blue, respectively. The ego-networks of the rst six
cases listed in Table 3.1 have just a few giant hubs which are linked by many other
cases. For example, the case \Buckley v. Valeo, 424 U.S. 1 (1976)" is a landmark
decision in American campaign nance law. The case \Coleman v. Miller, 307
U.S. 433 (1939)" is a landmark decision centered on the Child Labor Amendment,
which was proposed for ratication by Congress in 1924. These cases are generally
centered on a specic topic, and their citations may have a narrowed topic. There
are only a few hubs cited frequently by others and the citations generally do not
cite each other. On the other side, the ego-networks of the last six cases listed
in Table 3.1 have citations that are well connected. For example, \Yick Wo v.
Hopkins, 118 U.S. 356 (1886) " was the rst case where the United States Supreme
Court ruled that a law that is race-neutral on its face, but is administered in a
prejudicial manner; The case \Stromberg v. California, 283 U.S. 359 (1931)" is
a landmark in the history of First Amendment constitutional law to include a
protection of the substance of the First Amendment. These cases are in
uential
in the history and cited by many diverse subsequent legal decisions, which usually
cite each other.
Our Network Vector algorithm is used to learn two-dimensional embeddings
from the ego-networks of the legal cases, and their projections are shown as open
dots in the right gure of Table 3.1. The structures of the ego-networks for four
sampled Supreme Court legal cases (Case IDs: 110578, 93272,101957,105547) are
also illustrated in the gure. Note that the ego network includes the case itself
(does not show in the gure), and all the cases it cites (smaller circle on the right)
as well as all the cases that cite it (larger circle on the left). Lines represent
citations among these cases. The two groups of ego-networks contrast each other.
Compared to the ego-networks in red boxes, which is cited by unrelated legal
42
cases, there is clearly more coherence in discourse related to the cases in blue
boxes, as indicated by citations among other Supreme Court cases that cite this
one. Although the dierences between these two ego-networks could be captured in
a standard way, by features related to the degree distribution of the ego-networks,
or their clustering coecients, the distinctions between other ego-networks may be
more subtle necessitating a new approach for evaluating their similarity. In this
representation, the position of the case in the learned space captures the similarity
of the structure of their ego-networks. Cases that are more similar to the ego-
networks in red boxes fall in the top half of the two-dimentsonal plane (red open
dots); while cases similar to those in blue boxes fall in the bottom half (blue open
dots). Thus, distances between the learned representations of the ego-networks of
legal cases can be used to quantitatively capture their similarity.
3.2 Experiments
Network Vector learns representations of network nodes and the entire network
simultaneously. We evaluate both representations on predictive tasks. First, we
apply Network Vector to a setting where only local information about nodes,
such as their immediate neighbors, is available. We learn representations for ego-
networks of a few nodes using Network Vector, and evaluate on role discovery in
social networks and concept analogy in encyclopedia. Second, when information
about node connectivities in the entire network is available, we learn node represen-
tations using Network Vector, where the additional global vector for the network
is used to help in learning high-quality node representations. The resulting node
representations are evaluated on multi-label classication.
43
3.2.1 Role Discovery
Roles re
ect individuals' functions within social networks. For example, email
communication or social media within an enterprise re
ects employees' responsi-
bilities and organizational hierarchies [Wu et al., 2013, Chelmis et al., 2013]. An
engineer's interactions with her team are dierent from those of a senior man-
ager's. In the Wikipedia network, each article cites other concepts that explain
the meaning of the article's concept. Some concepts may \bridge" the network by
connecting dierent concept categories. For example, the concept Bat belongs to
the category \Mammals", however since a bat resembles a bird, it refers to many
similar articles about the category \Birds".
Datasets
We use the following datasets in our experiments:
Enron Email Network: It contains email interaction data from about 150
users, mostly senior management of Enron. There are about half million
emails communicated by 85,601 distinct email addresses
2
. We have 362,467
links left after removing duplicates and self links. Each of the email addresses
belonging to Enron employees has one of the 9 dierent positions: \CEO",
\President", \Vice President", \Director", \Managing Director", \Manager",
\Employee", \In House Lawyer" and \Trader". We use the positions as
roles. This categorization is ne-grained. In order to understand how the
feature representations can re
ect the properties of dierent stratum in the
corporation, we also use coarse-grained labels \Leader" (aggregates \CEO",
2
http://www.cs.cmu.edu/
~
enron/
44
\President", \Vice President"), \Manager" (aggregates \Director", \Manag-
ing Director", \Manager") and \Employee" (includes \Employee", \In House
Lawyer" and \Trader") to divide the users into 3 roles.
Wikipedia for Schools Network: We use a subset of articles available at
Wikipedia for Schools
3
. This datasetcontains 4,604 articles and 119,882 links
between them. The articles are categorized by subjects. For example, the
article about Cat is categorized as \subject.Science.Biology.Mammals". We
use one of 15 second-level category names (e.g., \Science" in the case of Cat)
as the role label.
Methods for Comparison
For real-world networks, such as email, information about all connectivities of
nodes may not be fully available, e.g., for privacy reasons. For this reason, we
explore prediction task with local information (i.e., immediate neighbors). For each
node, we rst generate its ego-network, which represents the induced subgraph of
its immediate neighbors, and learn global vector representations for the set of ego-
networks through Network Vector. We use the architecture as in Eq. (3.3). In
our experiments, we repeat
= 10 times for root node initialization in random
walks and the length of each random walks is xed as l = 80. For comparison, we
evaluate the performance of Network Vector against the following network feature-
based algorithms [Berlingerio et al., 2012]:
Degrees: number of nodes and edges, average node degree, maximum \in"
and \out" node degrees. The degree features are aggregated to form the
representations of the ego-networks.
3
http://schools-wikipedia.org/
45
Table 3.2: Performance of role discovery task. Precision of top ranked k (P @k)
results is reported.
Wiki - 15 roles Email - 9 roles Email - 3 roles
Method P@1 P@5 P@10 P@1 P@5 P@10 P@1 P@5 P@10
Degrees+Clustering+Eigens 0.160 0.149 0.146 0.090 0.102 0.083 0.210 0.200 0.196
node2vec 0.231 0.224 0.218 0.290 0.280 0.268 0.500 0.498 0.474
Network Vector 0.607 0.560 0.522 0.290 0.298 0.281 0.520 0.498 0.483
Clustering Coecients: measure the degree to which nodes tend to cluster.
We compute global clustering coecient and average clustering coecient of
nodes for representing each ego-network.
Eigens: For each ego-network, we compute 10 largest eigenvalues of its adja-
cency matrix.
node2vec [Grover and Leskovec, 2016]: This approach learns low-dimensional
feature representations of nodes in a network by interpolating between BFS
and DFS for sampling node sequences. A parameter p and q is introduced
to control the likelihood of revisiting a node in walks, and to dis/encourage
outward exploration, resulting in BFS/DFS like sampling strategy. It's inter-
esting to note whenp = 1 andq = 1, node2vec boils down to DeepWalk [Per-
ozzi et al., 2014], which utilizes uniform random walks. We adapt node2vec,
and use the mean of learned node vectors to represent each ego-network.
Results
Given a node's ego-network, we rank other nodes' ego-networks by their distance to
it in vector space of feature representations. Table 3.2 shows the average precision
of retrieved nodes with the same roles (class labels) at cut-ok = 1; 5; 10. For sim-
plicity, Cosine similarity is used to compute the distance between two nodes. From
the result, we can see how the global context allows Network Vector outperform
46
Table 3.3: Percentage of analogy tuples for which the answers are hit within top
k results (Hit@k).
Method Hit@1 Hit@5 Hit@10
Degrees 1.47 4.23 7.17
Clustering Coecients 0.06 0.43 0.86
Eigens 0.25 0.74 1.16
Degrees+Clustering+Eigens 1.53 4.53 8.03
node2vec 24.50 50.98 61.50
Network Vector 28.49 56.19 69.30
Gain over ndoe2vec 16.28% 10.21% 12.68%
node2vec in role discovery. However, the performance gain is dependent on dier-
ent datasets. We observe Network Vector performs slightly better than node2vec
on Enron email interaction network, while the improvement of performance is over
150% on Wikipedia network. Compared to the combination of Degrees, Cluster-
ing Coecients and Eigenvalues, the improvement of the two learning algorithms
Network Vector and node2vec are outstanding, with over 100% performance gain
in all cases.
3.2.2 Concept Analogy
We also evaluate the feature representations of ego-networks on the analogy task.
For Wikipedia network, we follow the word analogy task dened in [Mikolov et al.,
2013a]. Given a pair of Wikipedia articles describing two concepts (a;b), and an
article describing another conceptc. The task aims to nd a conceptd such thata
is tob asc is tod. There is only one correct answerd given each tuple (a;b;c). For
example, Europe is to euro as USA is to dollar. This analogy task can be solved
by nding the concept that is closest to v
b
v
a
+ v
c
in vector space, where the
distance is computed using Cosine similarity. Because there is only one possible
concept answer d for each analogy tuple(a;b;c), we use percentage of tuples for
which the algorithm hits the correct answer d in the top results. Specically, we
rank the other vectors by the similarity and determine whether the answer d is
47
hit within cut-o k = 1; 5; 10 positions (referred to as Hit@k). In this task, we
cross-validate the parameters, and empirically x the dimensionality of vectors as
100 and context window as 10.
There are 1,632 semantic tuples in Wikipedia network matched for the semantic
pairs in [Mikolov et al., 2013a]. We use them as evaluation benchmark. Table 3.3
shows results. From the results, we can see Network Vector performs much better
than the baselines, which use degree, clustering coecients and eigenvalues of the
adjacency matrix. The combination of heterogeneous features (degrees, clustering
coecients and eigenvalues) in dierent scale causes the diculty to utilize an e-
cient distance metric. However, Network Vector does not suer from this problem
because of automation of the feature learning using an objective function.
3.2.3 Multi-label Classication
Multi-label classication is a challenging task, where each node may have one or
multiple labels. A classier is trained to predict multiple possible labels for each
test node. In our Network Vector algorithm, the global representation of entire
network serves as additional context along with local neighborhood in learning
node representations.
Datasets
To understand whether the global representation helps learning better node rep-
resentation, we perform multi-label classication with the same benchmarks and
experimental procedure as [Grover and Leskovec, 2016] using the same datasets:
BlogCatalog [Zafarani and Liu, 2009, Tang and Liu, 2009a]: This is a network
of social relationships provided by bloggers on the BlogCatalog website. The
labels represent the interests of bloggers on a list of topic categories. There
48
are 10,312 nodes, 333,983 edges in the network and 39 distinct labels for
nodes.
Protein-Protein Interactions (PPI) [Breitkreutz et al., 2008]: This is a sub-
graph of the entire PPI network for Homo Sapiens. The node labels are
obtained from hallmark gene sets [Liberzon et al., 2011] and represent bio-
logical states. There are 3,890 nodes, 76,584 edges in the network and 50
distinct labels for nodes.
Wikipedia Cooccurrences [Mahoney, 2009, Grover and Leskovec, 2016]: This
is a network of words appearing in the rst million bytes of the Wikipedia
dump. The edge weight is dened by the cooccurrence of two words within a
sliding window of length two. The Part-of-Speech (POS) tags [Marcus et al.,
1993] inferred using the Stanford POS-Tagger [Toutanova et al., 2003] are
used as labels. There are 4,777 nodes, 184,812 edges in the network and 40
distinct labels for nodes.
Methods for Comparison
We compare the node representations learned by Network Vector against the fol-
lowing feature learning methods for node representations:
Spectral clustering [Tang and Liu, 2011]: This method learns theD-smallest
eigenvectors of the normalized graph Laplacian matrix, and utilize them as
the D-dimensional feature representations for nodes.
DeepWalk [Perozzi et al., 2014]: This method learns D-dimensional fea-
ture representations using Skip-gram [Mikolov et al., 2013a,b] from node
sequences, that are generated by uniform random walks from the source
nodes on a graph.
49
Table 3.4: Macro-F1 scores for multi-label classication with a balanced 50%
train-test split between training and testing data. Results of Spectral Clustering,
DeepWalk, LINE and node2vec are reported in [Grover and Leskovec, 2016].
Algorithm Dataset
Blogcatalog PPI Wikipedia
Spectral Clustering 0.0405 0.0681 0.0395
LINE 0.0784 0.1447 0.1164
DeepWalk 0.2110 0.1768 0.1274
node2vec (p*, q*) 0.2581 0.1791 0.1552
Network Vector (p=1, q=1) 0.2473 0.1938 0.1388
Network Vector (p*, q*) 0.2607 0.1985 0.1765
settings(p*, q*) 0.25, 0.25 4, 1 4, 0.5
Gain over DeepWalk 12.4% 9.6% 8.9%
Gain over ndoe2vec 1.0% 9.7% 13.7%
LINE [Tang et al., 2015b]: This method learnsD-dimensional feature repre-
sentations by sampling nodes at 1-hop and 2-hop distance from the source
nodes in BFS-like manner.
node2vec [Grover and Leskovec, 2016]: We use the original node2vec algo-
rithm with optimal parameter settings of (p;q) reported in [Grover and
Leskovec, 2016].
Network Vector employs second-order random walks, and utilizes only rst-
order or second-order proximity between nodes in two-layer neural embedding
framework. The rst layer computes the context feature vector, and the second
layer computes the probability distribution of target nodes. It is similar to other
neural embedding based feature learning methods DeepWalk, LINE and node2vec.
For fair comparison, we exclude recent approaches GraRep [Cao et al., 2015],
HNE [Chang et al., 2015] and SDNE [Wang et al., 2016]. It is because GraRep
utilizes information from network neighborhoods beyond second-order proximity,
and both HNE and SDNE employ deep neural networks that have multiple layers
(more than two). GraRep, HNE and SDNE are less ecient in computation and
cannot scale well, as compared to DeepWalk, LINE, node2vec and our algorithm
Network Vector.
50
For a fair comparison, we use the inverse architecture of Network Vector, which
is similar to node2vec. The parameter settings for Network Vector are in favor of
node2vec, and exactly the same as in [Grover and Leskovec, 2016]. Specically,
we set D = 128, r = 10, l = 80, and a context size n = 10, and are aligned with
typical values used for DeepWalk and LINE. A single pass of the data (one epoch)
is used for optimization. In order to perform multi-label classication, the learned
node representations from each approach are used as feature input to a one-vs-rest
logistic regression with L2 regularization. Our experiments are repeated for 10
random equal splits of train and test data, and average results are reported.
Results
Macro-F1 scores are used as evaluation metrics. Table 3.4 shows the results,
where the performance of Spectral clustering, DeepWalk, LINE and node2vec
are reported in [Grover and Leskovec, 2016]. We run Network Vector with node
sequences generated by biased random walks from node2vec. The default param-
eter setting (p = 1;q = 1) in DeepWalk and the optimal parameter settings of
node2vec reported in [Grover and Leskovec, 2016] are used.
From the results, we can see Network Vector outperforms node2vec using the
same biased random walks, and DeepWalk using the same uniform random walks.
It is evident that the global representation of the entire network allows Network
Vector to exploit the global structure of the networks to learn better node represen-
tations. Network Vector achieves a slight performance gain, 1:0% over node2vec,
and a signicant 12:4% gain over DeepWalk in BlogCatalog. As we can see in
PPI, The gain of Network Vector over node2vec and DeepWalk are signicant and
similar, 9:6% and 9:7% respectively. In the case of word cooccurrence network in
Wikipedia, Network Vector outperforms node2vec with a decent margin, achieving
51
13:7% performance gain, while with a less gain, 8:9% over DeepWalk. Overall, sam-
pling strategies even with optimal parameter settings (p;q) in node2vec are limited
in exploration of local neighborhood of the source nodes, but cannot exploit the
global network structure well. Network Vector overcomes the limitation of local-
ity. By utilizing an additional vector to memorize the collective information from
all the local neighborhoods of nodes even within 2-hops, Network Vector learns
improved node representations by respecting the global network structure.
Parameter Sensitivity
In order to understand how Network Vector improves in learning node representa-
tions with biased random walks in ne-grained settings, we evaluate performance
while varying the parameter settings of (p;q). We x p =1 to avoid revisit-
ing the source node immediately while sampling, and varying the value q in the
range from 2
4
to 2
4
to perform BFS or DFS-like sampling in various degrees.
Figure 3.3 shows the comparison results for Network Vector and node2vec in both
Macro-F1 and Micro-F1 scores. As we can see, Network Vector consistently outper-
forms node2vec in dierent parameter settings ofq in all three datasets. However,
we observe in BlogCatalog, Network Vector achieves relatively larger gains over
node2vec when q is so large that the random walks is biased towards BFS-like
sampling, as compared to that when q is small and sampling is more DFS-like.
It is mainly because when the random walks are biased towards nodes close to
the source nodes, the global information of network structure that are exploited
by Network Vector can better compensate for locality information using BFS-like
sampling. However, when q is small, the random walks are biased towards sam-
pling nodes far away from the source nodes, and explore information close to the
global network structure. Hence, Network Vector is not quite helpful in this case.
52
Figure 3.3: Performance evaluation of Network Vector and node2vec on varying
the parameterq when xingp =1 to avoid revisiting the source nodes. Macro-F1
and Micro-F1 scores for multi-label classication with a balanced 50% train-test
split are reported.
We can see similar patterns of performance margin between Network Vector and
node2vec when q tends to be large in word cooccurrence network of Wikipedia.
However, in the case of PPI, the performance gains achieved by Network Vector
over node2vec are stable even various values ofq are used. The reason is probably
because the biological states of proteins in a protein-protein interaction network
exhibit a high degree of homophily, since proteins in local neighborhood usually
53
Figure 3.4: Performance evaluation of Network Vector and node2vec on varying
the faction of labeled data for training.
organize together to perform similar functions. Hence, the global network struc-
ture is not quite informative to predict the biological states of proteins as we set a
large value of q.
Eect of Training Data
To see the eect of training data, we compare performance while varying the
fraction of labeled data from 10% to 90%. Figure 3.4 shows the results in PPI. The
parameters (p;q) is xed using optimal values (4; 1). As we can see, when using
more labeled data, the performance of node2vec and Network Vector generally
increases. Network Vector achieves the largest gain over node2vec of 9:0% in
Macro-F1 score and 10:3% at 40% labeled data. When only 10% labeled data is
used, Network Vector only yields 1:5% gain in Macro-F1 score, and 7:1% in Micro-
F1 score. We have similar observations on BlogCatalog and Wikipedia datasets,
and the results are not shown.
54
3.3 Conclusion
We presented Network Vector, an algorithm for learning distributed representa-
tions of networks. By embedding the networks in a lower-dimensional vector space,
our algorithm allow for quantitative comparison of networks. It also allows for the
comparison of individual nodes in networks, since each node can be represented by
its ego-network|a network containing the node itself, its network neighbors, and
all connections between them.
In contrast to existing embedding methods, which learn a network representa-
tion by aggregating the learned representations of its component nodes, Network
Vector directly learns the representation of an entire network. Learning a repre-
sentation of a network allows us to evaluate the similarity between two networks
or two individual nodes, which enables us to answer questions that were dicult
to address with existing methods. Given an individual in a social network, for
example, a manager within an organization, can we identify other people serving
a similar role within that organization? Also, given a connection, denoting some
relationship between two people within a social network, can we nd another pair
in an analogous relationship? Beyond social networks, we can also answer new
questions about knowledge networks that connect concepts or documents to each
others, for example, Wikipedia and citations networks. This can be useful espe-
cially in cases where the contents of documents is not available for privacy or other
reasons, but the network of interactions exists.
For the networks in which content is available for the nodes, the learning
method could be extended to account for it. For example, for knowledge networks,
the approach could be combined with text to learn representations of networks
that will give a more ne-grained view of their similarity. Additionally, other non-
textual attributes could also be included in the learning algorithm. The
exibility
55
of such learning algorithms make them ideal candidates for applications requiring
similarity comparison of dierent types of objects.
56
Chapter 4
Modeling Networked Documents
In a world of innite choice, context { not content { is king.
{ Chris Anderson
Text mining applications use quantitative representations of documents to ana-
lyze and compare them to one another. One popular approach to text modeling
represents each document as a vector of its word frequencies. Due to its conceptual
simplicity and computational eciency, this approach is used widely in informa-
tion retrieval, text summarization, and personalized recommendation. However,
representing documents by their word frequencies has signicant disadvantages
that limit the utility of this representation. The principal of these is that word
frequencies fail to capture word meaning. Individual words may be highly ambigu-
ous: the same word can often mean dierent things, and dierent words frequently
have the same meaning. To address this challenge, modern text analysis meth-
ods (e.g., n-grams and topic models) take advantage of the context of words in
document representations, where a word's context is provided by the neighboring
words, phrases in sentences and co-occurred words in same documents. Then they
use quantitative methods to nd statistical dependencies among words across doc-
uments. The intuition is that surrounding words in a document, though themselves
ambiguous, collectively help to pin down a given word's meaning.
However, words often have a deeper context that extends beyond nearby words,
phrases, and sentences in the same document to other relevant documents and con-
cepts. For example, the online encyclopedia Wikipedia is composed of a network of
57
Figure 4.1: Wikipedia pages Cat and Dog with representative notions they refer
to.
Web pages describing interconnected concepts and entities. Figure 4.1 illustrates
a small portion of Wikipedia related to the concept Cat. The text of the page
describing Cat references other pages describing concepts Felidae, Mammal and
Vermin. Similarly, the Wikipedia page describing the concept Dog links to related
pages describing Wolf, Pet, Carnivora. In order to get a complete picture of what
Cat and Dog are, one has to read the descriptions in the linked pages. Similarly,
in order to understand a scientic article, a reader must rely not only on the text
of the article, but also on the background knowledge and supporting evidence that
is described in other articles. Thus, the meaning the reader perceives is not simply
derived from the words and sentences appearing in the article, but is the inferen-
tial result of the connections the article makes to other articles and the concepts
expressed in them. These connections, and the text used by them to express con-
cepts and themes of the documents, provide a deeper context for understanding
the current document. Automatically learning these deep contexts from data will
58
create better models of text documents that not only help to better solve existing
tasks, such as nding documents that are similar to a given document, but also
address novel text mining tasks that existing tools cannot solve.
In this chapter, we address the problem of learning the deep contexts of net-
worked documents with neural language models [Bengio et al., 2003]. We focus
on leveraging information in document text and the links that exist between doc-
uments in document networks. We describe a model that captures the generative
process of document creation in which language and semantics are intricately linked
across document networks. Authors read existing documents to draw inspiration
for the vocabulary, grammar, and style and learn how to describe the concepts
and ideas in their own works. Language in
uences propagate between documents
through citations and cross-references, providing a deeper context for understand-
ing the semantics of text. The question is how to capture the hierarchy of contexts
of a word in a sentence, including the surrounding words, the sentence semantics,
underlying theme of the article and in
uence from cited articles. To this end, we
devise a new language model that accounts for these contexts in text documents.
4.1 Deep Context Vector
Previous work [Wu et al., 2014] do not take into account word order and is compu-
tational inecient due to deep architecture of neural networks. To consider word
order and reduce the computation, we further devise another algorithm called Deep
Context Vector (DCV) [Wu and Lerman, 2017a], which is based on shallow archi-
tecture of neural networks. DCV models word sequences in networked documents
using a unied neural language model. Like other neural language models, ours
59
takes into account the order of words in a document's sentences and learns sim-
ilarities between closer words across sentences. However, our model goes beyond
traditional language models to also take into account semantics in other docu-
ments in a document network that are linked to the current document. The basic
intuition of our model is that semantics in linked documents provide a deeper con-
text for understanding the statistical dependencies between words in the current
document.
To begin with, we review the log-bilinear language model [Mnih and Hinton,
2007], which serves as the foundation of DCV.
The Log-Bilinear Model
We represent each word w using an D-dimensional real-valued feature vector,
v
w
2 R
D
. The log-bilinear model (LBL [Mnih and Hinton, 2007, Mnih and
Kavukcuoglu, 2013]) species the bilinear energy function of a sequence of con-
text words (w
1
; ;w
n1
) and the predicted next word w
n
:
E(w
1:n1
;w
n
) =^ v
>
v
wn
(4.1)
where ^ v is the predicted representation of next word, dened as
^ v =
n1
X
i=1
c
i
v
w
i
(4.2)
where denotes element-wise multiplication, and c
i
is the weight vector for the
context word in position i. For symmetry, we use the same set of word represen-
tations for both the words being predicted and the context words.
60
The resulting probabilistic distribution of next word is given by a softmax
function:
P (w
n
jw
1:n1
) =
1
Z
c
exp[E(w
1:n1
;w
n
)] (4.3)
where Z
c
=
P
K
wn=1
exp[E(w
1:n1
;w
n
)] is the context dependent normalization
factor, and K is the vocabulary size.
4.1.1 Model Architecture
Our model assumes that the generation of the next word in a word sequence
depends not only on the preceding words, but also on the global context of the
document and the other documents it references (and possibly the documents these
references cite, and so forth). Figure 4.2 illustrates a general architecture, where
we use the word sequence example \dogs are our best " in the document related
to the concept Dog. The concepts Pet and Cat are also used to predict the next
word \friends", as they are neighbors of Dog in the document network. More
precisely, we rstly consider there is global semantic context for word sequences in
each documentd, which we also represent as a real-valued feature vector, v
d
. This
idea is same as in [Le and Mikolov, 2014]. We also consider the semantic in
uence
from the neighborhood ofd in the document network (i.e., documents that can be
reached fromd within a few hops), and jointly model text in documents and their
link structure. For simplicity, we learn word and document representations in the
same D-dimensional vector space, and consider v
w
2R
D
; v
d
2R
D
.
Given a sequence of words (w
1
; ;w
n
) in a documentd
1
, we dene the energy
function as:
E(N
m
d
1
;w
1:n1
;w
n
) =^ v
>
v
wn
(4.4)
61
whereN
m
d
1
is the set of neighbors of d
1
that can be reached within m hops along
the links in the document network. We take ^ v to be the predicted representation
of the next word:
^ v =
n1
X
i=1
c
(w)
i
v
w
i
!
+
1
jN
m
d
1
j
0
@
jN
m
d
1
j
X
p=1
c
(d)
p
v
dp
1
A
; (4.5)
where d
p
2N
m
d
1
, and c
(d)
p
2R
F
is the context parameters dening the weights of
d
p
. We may reuse c
(d)
p
for anyd
p
that can be reached in the same number of hops.
The probabilistic distribution of the next word is dened as
P (w
n
jw
1:n1
;N
m
d
1
)
=
1
Z
c
exp[E(N
m
d
1
;w
1:n1
;w
n
)]
(4.6)
By learning next word's representation using features from its global document
vector and the linked documents, our model captures the deep context for gener-
ating each word in a document. In this sense, we call the vectors learned by our
model Deep Context Vector (DCV). However, the model in Eq. ( 4.6) is compu-
tationally expensive due to potentially exponentially large number of documents
within m hops,N
m
d
1
. To reduce computation, we introduce a practical algorithm
that draws node sequences to characterize the neighborhoods.
Randomized Document Sequence
Given a sequence of words (w
1
; ;w
n
) within a document d
1
, we draw an asso-
ciated sequence of documents (d
1
; ;d
m+1
) fromN
m
d
1
by following the directed
links from documentd
1
. The document sequence can be sampled based on a prob-
ability distribution P (d
1:m+1
) =
Q
m+1
j=1
P (d
j
jd
1:j1
). For simplicity, we adopt rst
62
Figure 4.2: The general model architecture of Deep Context Vector. We refer to
it as DCV-vLBL.
order random walk scheme. In the rst step, we start from the original document
d
1
. In each following step, we uniformly sample d
j+1
from the set of documents
that the current documentd
j
links to. Although higher order random walk is more
powerful, we leave studying this direction as future work.
We hence redene the energy function in Eq. (4.4) as:
E(d
1:m+1
;w
1:n1
;w
n
) =^ v
>
v
wn
(4.7)
where ^ v is the predicted representation of the next word, and is learned from the
feature vectors of preceding words as well as associated documents:
^ v =
n1
X
i=1
c
(w)
i
v
w
i
!
+
m+1
X
j=1
c
(d)
j
v
d
j
!
: (4.8)
Here c
(d)
j
2R
D
is hop-dependent and specifying the weights of the feature vector
of j-th document in the document sequence.
63
The conditional word distribution is given by
P (w
n
jw
1:n1
;d
1:m+1
)
=
1
Z
c
exp[E(d
1:m+1
;w
1:n1
;w
n
)]
(4.9)
where Z
c
=
P
K
wn=1
exp[E(d
1:m+1
;w
1:n1
;w
n
)] is the normalization term.
4.1.2 Optimization with Negative Sampling
Given a training corpus, the word sequences (w
1
; ;w
n
) with xed length are
sampled with a sliding window over each document. The word and document
vectors v are initialized with random values and trained using maximum likelihood
learning. The parameter updates are performed using stochastic gradient ascent
in the log-likelihood, and can be derived from Eq. (4.9):
=r
logP (w
n
jw
1:n1
;d
1:m+1
)
=
@
@
log
"
exp(^ v
>
v
wn
)
P
K
wn=1
exp(^ v
>
v
wn
)
#
(4.10)
where =fv
w
; v
d
; c
(w)
; c
(d)
g is the set of parameters to be learned, and is the
learning rate. The computation of gradients obtained in Eq. (4.10) involves the
normalization term and is proportional to the vocabulary size K. The complexity
is expensive and impractical in real applications.
We use negative sampling [Mikolov et al., 2013b] to eciently optimize DCV.
The objective in our model is to maximize:
log(^ v
>
v
wn
) +
k
X
i=1
E
w
i
P
n
(w)
log(^ v
>
v
w
i
)
(4.11)
64
where(x) = 1=(1+exp(x)) is the sigmoid function. P
n
(w) is the global unigram
distribution of the training data acting as the noise distribution where we draw
k negative samples. Negative sampling allows us to train our model eciently
that no longer requires explicitly normalized (e.g., calculation of gradients ofZ
c
in
Eq. (4.9), and hence more scalable.
4.1.3 Inference and Prediction
After the word and document vectors v
w
and v
d
are trained, we can use them
as features for word clustering, or document classication. For these tasks, the
document vectors can be learned for the whole corpus and feed as features to
clustering algorithms or classiers. No inference procedure is required in such
transductive learning setting (which is preferable in real applications). However,
to deal with a new document that is unseen in training data, we need to perform
inference to learn its vector representation. In this case, we x the vectors v
w
, v
d
and the context parameters c
(w)
, c
(d)
that have been learned from training data.
Then we perform gradient ascent that can be obtained from Eq. (4.11) to learn the
vector of the new document. If the new document d has outgoing links to other
unseen documents, our model jointly learn their vectors similarly using gradient
ascent.
In real applications, one might be also interested in predicting what kind of
documents a new document would link to. This is useful in applications such as
citation recommendation when an author is writing an article. One approach is
to rst learn the vector for the new document without any links but with word
content. Then we add an imaginary document d
x
which is one hop from the
new document d and learn it using gradient ascent which can be obtained from
r
v
dx
logP (w
n
jw
1:n1
;d
1
;d
x
). After learning the vector v
dx
, cosine similarity can be
65
Figure 4.3: An alternative model architecture. We refer to it as DCV-ivLBL as it
performs inverse language modeling.
used to search existing documents in vector space, and nd those nearest neighbors
of d
x
as citation recommendations for d.
One might be also interested in automatic text generation. Given the piece of
text in a \incomplete" documentd
1
, or document with only a few initial sentences,
what words would possibly come next? In this task, we may use the evidence of
what documents d
1
linked to and perform inference of the next words given the
current word sequence. The predicted word index k is obtained by solving
argmax
k
P (w
n
=kjw
1:n1
;d
1:m+1
)/ exp(^ v
>
v
wn
) (4.12)
Once we predict a new word, the vector representation v
d
1
needs to be updated
using gradient ascent. Then the sliding window move on to next new word, and
this prediction process repeat many times until we generate enough new words.
66
4.1.4 An Alternative Architecture
We explore dierent model architectures, in the hope of improving the performance
with ensemble of feature vectors learned with them separately. We present an
alternative architecture which uses similar scheme of Skip-gram [Mikolov et al.,
2013b]. Figure 4.3 shows the architecture, which utilizes the current word to
predict its word context (including preceding and following words), along with
a weighted vector learned from the document sequence and also used to predict
the word context. As it performs inverse language modeling, we refer to it as
DCV-ivLBL and it's a counterpart of inverse vector LBL [Mnih and Kavukcuoglu,
2013]. The objective is to maximize the log-likelihood for a given word sequence
(w
tb
; ;w
t
; ;w
t+b
)
X
bib;i6=0
logP (w
t+i
jw
t
) +
X
bib;i6=0
logP (w
t+i
jd
1:m+1
) (4.13)
where b is the context size of words. The current word w
t
and d
1:m+1
are used
separately to predict the context words. The probability distributions is formulated
as
P (w
t+i
jw
t
) =
1
Z
w
c
exp[v
wt
>(c
i
v
w
t+i
)] (4.14)
P (w
t+i
jd
1:m+1
)
=
1
Z
d
c
exp
"
(
m+1
X
j=1
c
(d)
j
v
d
j
)>(c
(w)
i
v
w
t+i
)
#
(4.15)
where Z
w
c
and Z
d
c
are the normalization term. In real applications, we can con-
catenate the vectors produced by both the architecture DCV-vLBL in Figure 4.2
and DCV-ivLBL in Figure 4.3, and the performance generally improves the vectors
learned using an individual architecture.
67
4.2 Experiments
In this section, we present evaluation results of our model. We begin with a
description of our data sets.
4.2.1 Data Description
We select data sets from dierent domains including Wikipedia pages, scientic
papers and legal opinions:
Wikipedia: A dump of Wikipedia pages
1
in October 2015 is used in our
experiments. Wikipedia data provides rich text with interlinked knowledge
concepts. The document network is dense, with around 25 out-links per
document on average.
DBLP: We download the DBLP dataset
2
[Tang et al., 2008], which contains
a collection of papers with titles and citation links. This represents a type of
documents with extremely short text, with only about 9 words per document.
Legal: We collect a large digitized record of federal court opinions from the
CourtListener
3
project in our study. The text in each case of legal opinion
is mainly about the discourse of a legal case. The citation network is sparse
and the lengths of documents are much larger than that of Wikipedia and
DBLP, with more than 2.5K words on average.
The statistics of the three datasets are shown in Table 4.1.
1
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.
bz2
2
http://arnetminer.org/lab-datasets/citation/DBLP_citation_2014_May.zip
3
https://www.courtlistener.com/
68
Table 4.1: Statistics of data sets.
Dataset Wikipedia DBLP Legal
# docs 4,863,781 2,244,021 2,485,004
# links 120,532,205 4,354,534 4,682,528
# links/doc 24.8 1.9 1.8
# words 2,911,172,221 20,741,535 6,391,118,754
# words/doc 598.5 9.2 2,571.8
vocab size 8,350,936 248,376 2,593,557
4.2.2 Methods for Comparison
We compare our model with the following baselines and the state-of-art approaches
for learning document representations:
BOW. We use bag-of-words representations as baseline, which is simply the
frequency of unigrams in a document. We report performance on TFIDF
weighted BOW.
LDA [Blei et al., 2003, Homan et al., 2010]: Latent Dirichlet Allocation
is used to learn document-specic topic distributions. By comparing our
models with LDA, we may understand how important it is to modeling word
order and the context of linked documents.
Skip-gram [Mikolov et al., 2013b]: We learn word vectors using Skip-gram
model. Each document is represented by averaging vectors of all words in it.
PV [Le and Mikolov, 2014]: Paragraph Vector (PV) learns distributed rep-
resentations for variable length pieces of text, such as sentences, paragraph
and documents. By comparing our model with PV, we may examine how
informative the links are.
DeepWalk [Perozzi et al., 2014]: DeepWalk is used for learning distributed
representations of nodes in a network. In order to understand how informa-
tive the word content are, we consider the word-word co-occurrence network
69
G
ww
and word-document network G
wd
. In addition, we also use document-
document network G
dd
to understand the eects of links.
LINE [Tang et al., 2015b]: LINE models rst-order and second-order prox-
imity between nodes. LINE is optimized with edge-sampling and trained
on word-word network G
ww
, word-document network G
wd
and document-
document network G
dd
.
We perform experiments on a single machine with 64 CPU cores at 2.3 GHz,
and 256G memory. Asynchronous stochastic gradient descent algorithm is used
with 40 threads to optimize the models.
4.2.3 Document Classication
In this experiment, we perform document classication and evaluate how discrim-
inative the learned feature vectors are.
Experiment Setup
For each dataset, all text, links of all documents or the combination (listed in
Table 4.1) are used for learning document representations in unsupervised man-
ner. For evaluation, we sample a subset of documents with balanced category dis-
tribution for each dataset. (1) For Wikipedia dataset, we randomly select 10,000
samples from each category among the seven diverse ones \Arts", \History", \Peo-
ple", \Nature", and \Sports"; (2) For DBLP papers, we select ve research elds
including \Articial Intelligent", \Computer Graphics", \Computer Networks",
\High Performance Computing", \Theory" as class labels. For each eld, we ran-
domly sample 10,000 papers; (3) In legal opinions, legal code are cited to support
the judgement of a case. Each statute has a standard format, with title, code and
70
section/subsection numbers. We use the statute title cited most by each case as
its category label. We choose the eight most cited statute titles, namely \Title
8: Aliens and Nationality", \Title 11: Bankruptcy ", \Title 15: Commerce and
Trade", \Title 18: Crimes and Criminal Procedure", \Title 21: Food and Drugs",
\Title 28: Judiciary and Judicial Procedure", \Title 29: Labor" and \Title 42:
The Public Health and Welfare". We randomly select 10,000 samples for each
statue title.
The dimensionality of word and document vectors are xed as 400 for all learn-
ing models. The number of negative sampling is xed as 5 for Skip-gram, PV,
LINE and DCV. We set the word context window size n = 5 in DCV-vLBL and
b = 5 in DCV-ivLBL. The context window of document sequence m is xed as
1 by which we only consider the immediate neighbors that the current document
links to. Increasing m, DCV may yield performance gain, but with higher cost of
computation. To construct word-word network G
ww
, we consider co-occurrences
of words with context window size 5. We also test model performance using con-
catenation of document vectors trained using PV-DBOW and PV-DM, as well
as DCV-vLBL and DCV-ivLBL. For LINE, we use concatenation of the vectors
trained on rst-order and second-order proximity [Tang et al., 2015b].
Results of Document Classication
We use one-vs-rest logistic regression for document classication, and feed it with
the resulting vector representations of documents.Table 4.2 shows the Macro-F1
and Micro-F1 scores. The results are averaged over 10 random instances of an equal
split between training and testing data on the sampled data. The baseline BOW
works well on all datasets, but with features in dimensionality of millions. Learning
models with 400 dimensional vectors reduce the document feature space by at least
71
Table 4.2: Performance of document classication.
Wikipedia DBLP Legal
Modal Method Macro-F1 Micro-F1 Macro-F1 Micro-F1 Macro-F1 Micro-F1 Time
text
BOW 0.8912 0.8917 0.7192 0.7197 0.8472 0.8481 {
LDA 0.8625 0.8628 0.5798 0.5800 0.7861 0.7894 40.2h
Skip-gram 0.8834 0.8832 0.7207 0.7212 0.8134 0.8148 15.1h
PV-DM 0.8666 0.8674 0.6403 0.6409 0.8204 0.8223 9.8h
PV-DBOW 0.8657 0.8662 0.6820 0.6824 0.8126 0.8142 18.9h
PV-(DM+DBOW) 0.8816 0.8822 0.6876 0.6881 0.8250 0.8263 {
DeepWalk(G
ww+wd
) 0.8634 0.8639 0.6680 0.6685 0.8016 0.8029 40.8h
LINE(G
ww+wd
) 0.8752 0.8757 0.6603 0.6608 0.8144 0.8158 20.3h
links
DeepWalk(G
dd
) 0.8028 0.8055 0.5545 0.5422 0.5666 0.5525 2.2h
LINE(G
dd
) 0.8136 0.8164 0.5533 0.5408 0.5682 0.5540 1.2h
DeepWalk(G
ww+wd+dd
) 0.8719 0.8723 0.7285 0.7289 0.8155 0.8168 41.3h
text LINE(G
ww+wd+dd
) 0.8903 0.8908 0.7236 0.7241 0.8228 0.8237 20.5h
+ DCV-vLBL 0.8740 0.8745 0.6850 0.6854 0.8354 0.8372 10.2h
link DCV-ivLBL 0.8897 0.8901 0.7223 0.7227 0.8451 0.8467 23.2h
DCV-(vLBL+ivLBL) 0.8932 0.8936 0.7311 0.7313 0.8511 0.8526 {
99.8% for any dataset. Our DCV consistently performs better than Skip-gram, PV,
LINE, DeepWalk and LDA. The performance gain of DCV is signicant on DBLP
where the documents (paper titles) are extremely short. The performance of PV
degrades in modeling short text as the vectors learned will overt the few words
in each document. However, DCV does not suer from this problem as the links
provide discriminative information for learning better document representations.
On Legal dataset, it's worth noticing that most of documents are long documents
with thousands of words. DCV is not easily tuned in this case, but still perform
well. LDA performs worse than other models (e.g, Skip-gram, PV and DCV) that
exploit word sequences, indicating word order provides discriminative information
for document classication. In most cases, models that use text features only
can achieve good performance. The link information can slightly help document
classication, but models using link information solely cannot perform well. We
also present the running time to train vectors for each method on Legal dataset in
Table 4.2. The optimization is performed for 20 epochs. As we can see, DCV can
eciently train vectors on 2M documents with 6.3B words and 4M links in about
10 to 23 hours, which is quite scalable.
72
Figure 4.4: Performance w.r.t number of vector dimensions
Dimension of Vectors
We vary the dimensionality of vectors in DCV, and report the classication perfor-
mance on Wikipedia and Legal datasets in Figure 4.4. With larger dimensionality
of vectors, the vectors are trained at a higher computational cost. We can see the
performance of DCV increases when dimensions of vectors increase to 200 or 400,
which is suitable for both datasets. However, when dimensions increase to 800,
the performance gain is marginal.
4.2.4 Link Prediction
In this experiment, we investigate how much information the learned vectors can
provide for predicting unseen links between documents.
Experiment Setup
In link prediction task, we hold out 50% of document links which are sampled
randomly on each dataset, and use them as positive samples for testing. For
negative samples, we randomly sample an equal number of document pairs which
have no link between them. DCV are trained using the remaining 50% of links
73
Table 4.3: Performance of link prediction.
Wikipedia DBLP Legal
Modal Method AP AUC AP AUC AP AUC
text
BOW 0.9592 0.9566 0.7911 0.7440 0.9422 0.9380
LDA 0.9256 0.9253 0.7037 0.7038 0.9163 0.9145
Skip-gram 0.8541 0.8378 0.8455 0.8181 0.8647 0.8461
PV-(DM+DBOW) 0.7142 0.6866 0.9146 0.9027 0.8955 0.8703
DeepWalk(G
ww+wd
) 0.7885 0.7465 0.8971 0.8809 0.9005 0.8962
LINE(G
ww+wd
) 0.8423 0.8189 0.9123 0.9027 0.9299 0.9190
links
DeepWalk(G
dd
) 0.9234 0.9217 0.9550 0.9569 0.9269 0.9212
LINE(G
dd
) 0.9125 0.9034 0.9179 0.9040 0.9218 0.9034
text + links
DeepWalk(G
ww+wd+dd
) 0.9474 0.9351 0.9861 0.9855 0.9405 0.9357
LINE(G
ww+wd+dd
) 0.9315 0.9246 0.9793 0.9781 0.9344 0.9254
DCV-(vLBL + ivLBL) 0.9676 0.9645 0.9882 0.9869 0.9663 0.9648
and all text. We compute similarity score for each document pair using cosine
similarity in vector space.
Results of Link Prediction
Table 4.3 shows the comparison results for all methods. We compute Average
Precision (AP) and Area Under the ROC Curve (AUC) for evaluation. As we can
see, DCV consistently performs better than other methods. The performance gains
are signicant on Wikipedia and Legal datasets which have a dense and a loose
document-document network, respectively. DCV also performs well on DBLP
dataset. This demonstrates the capacity of DCV in modeling short text with
link information. Exploiting both text and link information, DeepWalk, LINE
and DCV generally perform better than models using text or link solely. The
performance of BOW and LDA is better than Skip-gram and PV in link prediction
for long documents on Wikipedia and Legal datasets, but degrades in modeling
short documents on DBLP dataset. Using link information, DeepWalk, LINE
and DCV are able to boost the performance in predicting links between short
documents.
74
Figure 4.5: Performance w.r.t. number of hops
Number of Hops
We examine the eect of the number of hops m, i.e., the context window of docu-
ment sequence. Figure 4.5 shows the link prediction results using DCV on varying
the number of hops. We observe for both DBLP and Legal datasets, small number
of hops (such as in the range of 1 3) are preferable for the task of link predic-
tion. Sampling long document sequences is neither quite ecient nor eective in
real applications.
4.2.5 Scalability
Our DCV model is optimized using asynchronous stochastic gradient descent, and
the training can be speeded up signicantly using multiple threads. Table 4.4
illustrates the word throughput rate and how many times the algorithm can speed
up using multiple threads on Legal data, as well as if there is any loss of the
performance. Optimization is run for a single epoch on Legal data using a single
machine with 64 CPU cores at 2.3 GHz, and 256G memory. The speed-up is
around 10 times for DCV-iBLB, and 22 times for DCV-ivBLB using 40 threads,
compared to a single thread. The classication performance is quite stable as we
use more threads.
75
Table 4.4: Word throughput rate and times of speed-up when training DCV using
dierent number of threads.
DCV-vBLB DCV-ivBLB
Threads Words/thread/sec Speed-up Macro-F1 Words/thread/sec Speed-up Macro-F1
1 328.99k 1.0x 0.7648 68.50k 1.0x 0.7850
2 276.77k 1.6x 0.7666 63.15k 1.8x 0.7873
4 262.80k 3.1x 0.7635 60.28k 3.52x 0.7858
8 228.41k 5.5x 0.7659 57.79k 6.7x 0.7871
16 170.78k 8.3x 0.7635 53.29k 12.44x 0.7880
24 131.18k 9.5x 0.7641 49.35k 17.2x 0.7884
32 105.20k 10.2x 0.7630 44.12k 20.6x 0.7881
40 84.46k 10.3x 0.7671 38.57k 22.4x 0.7884
4.3 Conclusion
We have described a neural language model called Deep Context Vector (DCV) for
linked documents in document networks. We introduced two model architectures
and showed how to simultaneously learn word and document vectors in our frame-
work. By modeling the deep contexts of words in linked documents, our model can
learn better document representations in a low-dimensional vector space. Experi-
mental results showed that our model outperforms other neural language models,
such as Paragraph Vector [Le and Mikolov, 2014], that do not take into account
the links between documents. Interestingly, the vectors learned by DCV, which
have on the order of few hundreds of features, can perform better or comparably
to the bag-of-words model, which uses millions of features.
76
Chapter 5
Modeling Human Navigation
If one tries to navigate unknown waters,
one runs the risk of shipwreck.
{ Albert Einstein
Recent research [West and Leskovec, 2012a,b] shows human waynding in net-
works such as the Web pages in Wikipedia is mostly very ecient. However, with-
out good strategies, network navigation may not be optimal [Scaria et al., 2014].
In this thesis, we focus on learning from human navigation trails on the Web, in
the hope of leveraging the information how people cognitively represent networks.
In order to eciently navigate a network, people use mental representations of
network structure [Watts et al., 2002, Brashears, 2013]. We may be able to learn
these cognitive representations from navigation data to improve our algorithms.
Human navigation trails on the Web can arise in various contexts. For example,
consider Web pages surfed by users along the hyperlinks, streams of click-through
URLs associated with a query in search engine, and movie reviews written by a
user in temporal order, to name a few. The co-occurrences of documents navigated
by a user in a temporal sequence may reveal the relatedness between them, such
as their semantic and topical similarity. In addition, sequence of words within the
documents introduce another rich and complex source of data, which can be lever-
aged together with the document stream information to learn useful and insightful
representations of both documents and words.
77
In this chapter, we introduce a hierarchical neural language algorithm called
Hierarchical Document Vector (HDV) [Djuric et al., 2015]. HDV can simulta-
neously model document streams from human navigation trails as well as their
residing natural language in one common lower-dimensional vector space. More
precisely, we propose hierarchical models where document vectors act as units in
a context of document sequences, and also as global contexts of word sequences
contained within them. The probability distribution of observing a word depends
not only on some xed number of surrounding words, but is also conditioned on the
specic document. Meanwhile, the probability distribution of a document depends
on the surrounding documents in stream data from human navigation trails.
5.1 Hierarchical Document Vector
We present HDV for joint modeling of navigated documents and the words con-
tained within. Our approach is inspired by the methods of learning word vectors
which take advantage of a word order observed in a sentence [Mikolov et al., 2013a].
In addition, and unlike similar work presented in [Le and Mikolov, 2014], we exploit
the temporal neighborhood of sequential documents that are navigated by users,
and model the probability of a document based on its temporally-neighboring doc-
uments, in addition to the document content. HDV is trained to predict words
and documents in a sequence with maximum likelihood. We optimize the models
using stochastic gradient learning, a
exible and powerful optimization framework
suitable for the considered large-scale data in an online setting where new samples
arrive sequentially.
78
Figure 5.1: Model architecture with two embedded neural language models (yellow:
document vectors; green: word vectors).
5.1.1 Model Architecture
We assume the training documents are given in a sequence. For example, if the
documents are news articles, a document sequence can be a sequence of news arti-
cles sorted in an order in which the user read (or clicked) them. More specically,
given a document sequence, which consists of M documents (d
1
;d
2
;:::;d
M
), each
has a sequence of words (w
1
;w
2
;:::;w
T
) with variousT . We aim to simultaneously
learn distributed representations of contextual documents and language words in
a common vector space, and represent each document and word as a continuous
feature vector with a dimensionality D. Suppose there are N unique documents
in the training corpus, K unique words in the vocabulary, there are (N +K)D
model parameters to learn.
The context of document sequence and the natural language in documents are
modeled using the proposed hierarchical neural language model, where document
vectors act not only as the units to predict their surrounding documents, but also
the global context of word sequences within them. Without loss of generality, we
79
present a typical architecture as shown in Figure 5.1. The architecture consists of
two embedded layers, shown in Figure 5.1. The upper layer learns the temporal
context of document sequence, based on the assumption that temporally closer
documents in the document stream are statistically more dependent. The bottom
layer makes use of the contextual information of word sequences. Lastly, we con-
nect these two layers by adopting the idea of paragraph vectors [Le and Mikolov,
2014], and consider each document token as global context for all words within the
document.
More formally, given a sampled documentd
m
in the sliding window, the objec-
tive of the hierarchical model is to maximize the log-likelihood,
X
bib;i6=0
logP (d
m+i
jd
m
) +
T
X
t=1
logP (w
t
jw
tc:t+c
;d
m
) (5.1)
where is the weight that trades o between optimization of the log-likelihood
of the document sequence and the log-likelihood of word sequences (set to 1 in
our experiments), b is the length of the training context for document sequences,
and c is the length of the training context for word sequences. In this particular
architecture, we are using the Continuous Bag-of-Words (CBOW) model in the
lower, sentence layer, and the Skip-gram (SG) model in the upper, document layer
(note that either SG or CBOW models can be used in any level and the choice
depends on the modalities of the problem at hand). The probability of observing a
surrounding document based on the current documentP (d
m+i
jd
m
) is dened using
a softmax function,
P (d
m+i
jd
m
) =
exp(v
>
dm
v
0
d
m+i
)
P
N
d=1
exp(v
>
dm
v
0
d
)
; (5.2)
where v
d
and v
0
d
are the input and output vector representations of document d.
The probability of observing a word not only depends on its surrounding words, but
80
also the specic document that the word belongs to. More precisely, probability
P (w
t
jw
tc:t+c
;d
m
) is dened as
P (w
t
jw
tc:t+c
;d
m
) =
exp( v
>
v
0
wt
)
P
W
w=1
exp( v
>
v
0
w
)
; (5.3)
where v
0
wt
is the output vector representation of w
t
, and v is the averaged vector
representation of the context (including the specic d
m
), dened as
v =
1
2 c + 1
(v
dm
+
X
cjc;j6=0
v
w
t+j
): (5.4)
5.1.2 Model Variants
In the previous section we presented a typical architecture with specied language
models in each layer of the hierarchical model. In real applications, we could vary
the language models for dierent purposes. For example, a news website would
be interested in predicting on the
y which news article a user would read after a
few clicks on some other news stories, in order to personalize the news feed. Then,
it would be more reasonable to use directed, feed-forward models which estimate
P (d
m
jd
mb
:d
m1
) with the formulation:
P (d
m
jd
mb:m1
) =
exp( v
>
dm
v
0
dm
)
P
N
d=1
exp( v
>
dm
v
0
d
)
; (5.5)
where we have the averaged vector representation for the context ofd
m
, computed
as
v
dm
=
1
b
m1
X
j=mb
v
d
j
(5.6)
and v
d
and v
0
d
is the input and the output vector representations for d.
81
Moreover, if we have data with documents specic to a dierent set of users
(or authors), we may build more complex models to additionally learn distributed
representations of users by adding additional user layer on top of document layer.
Then, the user vector could serve as a global context of contextual documents
pertaining to that specic user, much like a document vector serves as a global
context to words pertaining to that specic document. More precisely, we predict
a document based on the surrounding documents, which also conditioning on a
specic user. This variant modelsP (d
m
jd
mb:m1
;u), whereu denotes the indicator
for the user (or the author) and we calculate the input vector representation for the
context of d
m
with additional input of vector representation v
u
for the individual
user:
v
dm
=
1
b + 1
(v
u
+
m1
X
j=mb
v
d
j
) (5.7)
In a setting with user interactions or group behaviors, we may also model
P (u
n
ju
ne:n1
) for the group of e users u
ne:n
(without a particular order) in a
similar manner:
P (u
n
ju
ne:n1
) =
exp( v
>
un
v
0
un
)
P
n1
j=ne
exp( v
>
un
v
0
u
j
)
; (5.8)
Learning vector representations of users would be useful for further improvement
of personalization and social representation.
5.1.3 Model Optimization
The model is optimized using stochastic gradient ascent. However, the computa-
tions of gradientsr logP (d
m+i
jd
m
) in Eq. (5.2) is proportional to the number of
distinct documents N, andr logP (w
t
jw
tc:t+c
;d
m
) in Eq. (5.3) is proportional to
the vocabulary sizeK. This is computational expensive in practice, since both W
and N could easily reach millions.
82
An ecient alternative that we use is hierarchical softmax [Morin and Bengio,
2005], which reduces the time complexity to O
R log(K) + 2bM log(N)
in our
case for each document sequence, where R is the total number of words in the
document sequence, and M is the total number of documents in the sequence.
Instead of evaluating each distinct word or document in dierent entries in the
output, hierarchical softmax uses two binary trees, one with distinct documents
as leaves and the other with distinct words as leaves. For each leaf node, there is
unique path assigned and the path is encoded using binary digits. To construct
the tree structure the Human tree is typically used, where more frequent words
(or documents) in data have shorter codes. The internal tree nodes are represented
as real-valued vectors, of the same dimensionality as word and document vectors.
More precisely, hierarchical softmax expresses the probability of observing the
current document (or word) in the sequence as a product of probabilities of the
binary decisions specied by the Human code of the document,
P (d
m+i
jd
m
) =
Y
l
P (h
l
jq
l
;d
m
); (5.9)
where h
l
is the l
th
bit in the code with respect to q
l
, which is the l
th
node in the
specied tree path of d
m+i
. The probability of each binary decision is dened as
follows,
P (h
l
= 1jq
l
;d
m
) =(v
>
dm
v
q
l
); (5.10)
where(x) is the sigmoid function, and v
dm
is the only one representation ofd
m
and
v
q
l
is the vector representation of node q
l
. It can be veried that
P
N
d=1
P (d
m+i
=
djd
m
) = 1, and hence the property of probability distribution is preserved. Sim-
ilarly, we can also express P (w
t
jw
tc:t+c
;d
m
) in the same manner, but with con-
struction of a separate, word-specic Human tree.
83
5.2 Experiments
In this section, we describe our experimental evaluation and present empirical
results. First, we validated the learned representations on a public movie rating
dataset from MovieLens
1
, where the task was to classify movies into dierent
genres based on the movie reviews. Then, we used a dataset of user click-through
logs on news stories which were collected from Yahoo News
2
to showcase a wide
application potential of the proposed approach.
5.2.1 Methods for Comparison
We evaluate our model HDV and compare it against the following methods for
document representations:
LDA [Blei et al., 2003, Homan et al., 2010]. The popular topic model
LDA assumes a Dirichlet prior over topic distributions of documents. By
comparing our models with LDA, we may understand how important it is
to modeling word order and the context of document streams from human
navigation trails.
PV [Le and Mikolov, 2014]. Paragraph vector extends [Mikolov et al., 2013b]
to learn distributed representations for variable length pieces of text, such as
sentences, paragraph and documents. if we do not consider the context of
document sequences (without yellow vectors in Figure 5.1, our HDV model
reduces to the distributed memory version of PV [Le and Mikolov, 2014]. By
comparing with this reduced model, we can understand how informative the
word content is to learning distributed representations of documents. It's
1
https://movielens.org/
2
http://news.yahoo.com/
84
worth to notice the original CBOW or Skip-gram [Mikolov et al., 2013b] can
be used to learn an averaged vector of all words to represent documents, but
its performance is not comparable to PV as shown in [Le and Mikolov, 2014].
SG [Mikolov et al., 2013b]. We consider Skip-gram to model the document
sequences only without the word content (without green vectors in Figure
5.1. In this reduced model, each document is modeled as an token without
word content. By comparing with this reduced model, we can understand
how informative the document sequences are to learning distributed repre-
sentations of documents.
5.2.2 Movie Genre Classication on MovieLens
In the rst set of experiments we validated quality of the obtained distributed doc-
ument representations on a classication task. To this end, we combined a public
movie ratings data set MovieLens 10M
3
, consisting of movie ratings for around
10;000 movies generated by more than 71; 000 users, with a movie synopses data
set found online
4
. Each movie is tagged as belonging to one or more genres, such
as \action" or \horror". Then, following terminology from the earlier sections, we
viewed movies as \documents" and synopses as \document content". The docu-
ment streams were obtained by taking for each user movies rated 4 and above, and
ordering them by the timestamp of the rating. This resulted in 69;702 document
sequences comprising 8; 565 movies.
We learned movie vector representations for the described data set using all
above-mentioned methods, where movie sequences are used as \documents" and
3
http://grouplens.org/datasets/movielens/
4
ftp://ftp.fu-berlin.de/pub/misc/movies/database/
85
Table 5.1: Accuracy (%) on movie genre classication
Method Drama Comedy Thriller Romance Action Crime Adventure Horror
LDA 55.44 58.56 81.58 81.73 87.45 86.85 87.65 90.63
PV 63.67 67.67 79.58 79.19 81.93 85.37 85.24 86.99
SG 71.72 74.49 81.02 82.04 86.27 86.92 87.68 92.31
HDV 72.74 74.87 82.01 82.33 88.14 87.28 88.54 98.72
movies as \words". Dimensionality of the embedding space was set to 100 for all
low-dimensional embedding methods, and the neighborhood of the neural language
modeling methods was set to 5. We used a linear SVM [Chang and Lin, 2011] to
predict a movie genre in order to reduce the eect of variance of non-linear methods
on the results.
The classication results with 5-fold cross validation are shown in Table 5.1.
We report results on eight binary classication tasks for eight most frequent movie
genres in the data set. We can see that neural language models achieved higher
accuracy than LDA on average, although LDA achieved very competitive results on
the last six tasks. It is interesting to observe that Skip-gram (modeling document
sequences only without word content) obtained higher accuracy than PV (with
word content but without document sequences) despite the fact that the latter
was specically designed for document representation, which indicates that the
users have strong genre preferences that were exploited by document sequences.
We can see that the proposed method achieved higher accuracy than the competing
method, obtaining on average 5:62% better performance over the state-of-the-art
PV and 1:52% over the Skip-gram. This can be explained by the fact that the
method successfully exploited both the document content and the relationships in
a stream between them.
86
5.2.3 Applications on Yahoo News
In this section we show a wide potential of the proposed method for online appli-
cations, using a large-scale data set collected at servers of Yahoo News. The data
consists of about 200;000 distinct news stories, viewed and clicked through by 80
million users from March to June, 2014. We consider news articles temporally
clicked through by each user as a long document sequence. Finer-granularity pro-
cessing such as clipping the long document sequence into short ones based on each
user's activities within a time session (e.g., half an hour) can improve modeling per-
formance, but at the cost of information loss since the connections between some
documents in the session boundaries might be cut o. We didn't observe improved
performance using ner-grained document sequence, and hence keep it long (one
per-user). After preprocessing where we removed the stopwords, we trained our
model on 80 million document sequences generated by users, containing a total of
100 million words and with a vocabulary size of 161 thousands. In all experiments
we used cosine distance to measure the similarity of two vectors (either document
or word) in the common embedding space.
Keyword Suggestion
Given an input word as a query we aim to nd nearest words in vector space.
This is useful in the setting of, for example, search retargeting, where advertisers
bid on search keywords related to or describing their product or service, and may
use the proposed model to expand the list of targeted keywords. Table 5.2 shows
example keywords from the vocabulary, together with their nearest word neighbors
in the embedding space. Interpretable semantic relationships and associations can
be observed within the closest distance of the input keywords.
87
Table 5.2: Nearest neighbors of selected keywords
university movies batman woods hijack tennis messi boxing
school characters superman tiger hijacked singles neymar welterweight
college lms superhero masters whereabouts masters ronaldo knockouts
california studio gotham holes transponders djokovic barca ghts
center audiences comics golf autopilot nadal iniesta middleweight
students actors trilogy hole radars federer libel ufc
national feature avenger pga hijackers celebration atletico heavyweight
medical pictures sci classic turnback sharapova cristiano bantamweight
american drama sequel par hijacking atp benzema greats
institute comedy marvel doral decompression slam argentine wrestling
professor audience prequel mcilroy baing roger barcelona amateur
Document Retrieval
Given a query word, one may be interested in nding the most relevant documents,
which is a typical task an online search engine performs. We consider the same
keywords used in the previous section, and nd the titles of the closest document
vectors. As can be seen in Table 5.3, the retrieved documents are semantically
related to the input keyword. The proposed approach diers from the traditional
information retrieval due to the fact that the retrieved document does not need to
contain the query word, as seen in the example of keyword \boxing".
Document Recommendation
In this task, we search for the nearest news articles for a given news story. The
returned articled can be provided as a reading recommendations for users viewing
the query news story. We give the examples in Table 5.4, where we can see that
relevant and semantically related documents are located nearby in the latent vector
space. For example, for the article focusing on Galaxy S5, all nearest documents
are related to the smartphone industry, while for the foods for healthier skin, the
closest articles are about women's fashion and nance.
88
Table 5.3: Most similar news stories for a given keyword
movies
MTV Awards: `American Hustle,' `Wolf of Wall Street' Lead Nominations
3 Reasons `Jurassic World' Is Headed in the Right Direction
Irish Film and Television Academy Lines Up Stellar Guest List for Awards
10 things the Academy Awards won't say
I saw Veronica Mars, thanks to a $35 donation, 2 apps and an $8 movie ticket
tennis
Tennis-Venus through to third round, Li handed walkover
Nadal rips Hewitt, Serena and Sharapova survive at Miami
Williams battles on at Sony Open in front of empty seats
Serena, Sharapova again on Miami collision course
Wawrinka survives bumpy start to Sony Open
hijack
Thai radar might have tracked missing plane
Criminal probe under way in Malaysia plane drama
Live: Malaysia asks India to join the expanding search
Malaysia dramatically expands search for missing jet
Malaysia widening search for missing plane, says minister
university
The 20 Public Colleges With The Smartest Students
Spring storm brings blizzard warning for Cape Cod
U.S. News Releases 2015 Best Graduate Schools Rankings
No Friday Night Lights at $60 Million Texas Stadium: Muni Credit
New Orleans goes all in on charter schools. Is it showing the way?
boxing
World Series of Fighting: Yushin Okami's Debut for WSOF 9 in Las Vegas
UFC Champ Jon Jones Denies Daniel Cormier Title Shot Request
UFC contender Alexander Gustafsson staring at a no-win situation
Alvarez back as Molina, Santa Cruz defend boxing titles
Anthony Birchak Creates MFC Championship Ring, Promotion to Follow Suit
entertainment
MTV Movie Awards: `American Hustle,' `Wolf of Wall Street' Lead Nominations
Madison Square Garden Company Buys 50 Percent Stake in Tribeca Enterprises
Renaissance Technologies sells its Walt Disney Company position
News From the Advertising Industry
10 things the Academy Awards won't say
Document Tag Recommendation
In this task, we nd nearest words given a news story as an input. The retrieved
keywords can act as tags for a news article, or can be further used to match display
ads to be shown alongside the article. In Table 5.5, we show titles of example news
stories, together with the list of nearest words. We can see that the retrieved key-
words often summarize and further explain the documents. For example, in the rst
example related to Individual Savings Account (ISA) the keywords include \pen-
sioners" and \taxfree", while in the mortgage-related example keywords include
several nancial companies and advisors (e.g., Nationstar, Moelis, Berkowitz).
89
Table 5.4: Titles of retrieved news articles for given news examples
This year's best buy ISAs
Furniture sales might nally, actually be ending
How to use an Isa to invest in property
Pensions: now you can have a blank canvas - not an annuity
Ed Balls' Budget Response Long on Jokes, a Bit Short on Analysis
Half a million borrowers to be repaid interest and charges
Galaxy S5 will get o to a slow start in Samsung's home market
New specs revealed for one of 2014's most intriguing Android phones
LG G Pro 2 review: the evolutionary process
[video] HTC wins smartphone of the year
Samsung apparently still has a major role in Apple's iPhone 6
Samsung's new launch, the Galaxy S5, lacks innovative features
Western U.S. Republicans to urge appeals court to back gay marriage
Ky. to use outside counsel in gay-marriage case
Disputed study's author testies on gay marriage
Texas' Primary Color Battle Begins
Eyes on GOP as Texas holds nation's 1st primary
Michigan stumbles in court defending same-sex marriage ban
5 Foods for Healthier Skin
17 Ways to Skinny Up Your Fridge
Ways to Save Money on Your Wedding
How to Protect Your Finances in a Divorce
Lululemon for Less: How to Buy Workout Clothes on a Budget
Melissa Gilbert, Natalie Grant Come Forward as Hoax Victims
Uncle Sam buying mortgages? Who knew?
Putin's leash yanked by restless Russia investors
Puerto Rico's travails hit muni bond rm that bet big
EBay worst-run company I have ever seen: Carl Icahn
Statements From People Who Lost Money On Mt. Gox Are Seriously Sad
Bernanke says Fed could have done more during crisis
Dwyane Wade on pace to lead all guards in shooting
Bobcats beat Bucks 101-92, win 4th straight
Thunder-Bulls Preview
Grin leads Clips past Cavs to 11th straight win
Shawn Marion, Mavericks easily get by Thunder, 109-86
Clippers win 11th straight, crush Cavaliers
News Catergorization
Lastly, we used the learned representations to label news documents with the 19
rst-level topic tags from the company's internal hierarchy (e.g., \home & garden",
\science"). We used linear SVM to predict each topic separately, and the aver-
age improvement over LDA after 5-fold cross-validation is given in Table 5.6. We
see that the proposed method outperformed the competition on this large-scale
problem, strongly conrming the benets of our HDV approach for contextual
document representation. We also observe Skip-gram (without words) obtained
90
Table 5.5: Top related words for news stories
This year's best buy ISAs
isas pensioners savers oft annuity
isa pots taxfree nomakeupsele allowance
Galaxy S5 will get o to a painfully slow start in Samsung's home market
mwc quadcore snapdragon oneplus ghz
appleinsider samsung lumia handset android
`Star Wars Episode VII': Actors Battle for Lead Role (EXCLUSIVE)
reboot mutants anthology sequels prequel
liv helmer vfx villains terminator
Western U.S. Republicans to urge appeals court to back gay marriage
lesbian primaries rowse beshear legislatures
schuette heterosexual gubernatorial stockman lgbt
Pope marks anniversary with prayer and a tweet
catholics ponti curia dignitaries papacy
xvi halal theology seminary bishops
Uncle Sam buying mortgages? Who knew?
berkowitz moelis erbey gse cios
lode ocwen nationstar fairholme subsidizing
5 Foods for Healthier Skin
carbs coconut cornstarch bundt vegetarian
butter tablespoons tsp dieters salad
Dwyane Wade on pace to lead all guards in shooting
beverley bynum vogel spoelstra shootaround
dwyane nuggets westbrook bledsoe kobe
Table 5.6: Relative average accuracy improvement (%) over LDA method.
Method Relative Improvement
LDA 0.00
PV 0.27
SG 2.26
HDV 4.39
signicantly higher accuracy than PV (without document sequences). This indi-
cates the informativeness of the context of document streams.
5.3 Conclusion
We described a general unsupervised learning framework to uncover the latent
structure of streaming documents navigated by users on the Web, where feature
vectors are used to represent documents and words in the same latent space. Our
models exploit the context of documents in streams and learn representations that
can capture temporal co-occurrences of documents and statistical patterns such as
91
in users' online news navigation. The approach was validated on a movie genre
classication task. Experiments on a large-scale click-through logs of Yahoo! News
also demonstrated eectiveness and wide application area of the proposed neural
language models.
92
Chapter 6
Discussion and Future Directions
In previous chapters, we introduced novel neural network algorithms to model
generic networks, networks with text attributes and human navigation trails on
networked documents. In this chapter, we compare our models with prior work,
and show the connections between them. We also present a few interesting future
directions.
6.1 Discussion
Our approaches of network embeddings build on the foundations of recent mod-
els [Mikolov et al., 2013b, Mnih and Kavukcuoglu, 2013, Le and Mikolov, 2014],
which mainly focus on modeling text in documents. Recent approaches of learning
network embeddings represent a network as a \document" [Perozzi et al., 2014,
Tang et al., 2015b, Grover and Leskovec, 2016]. These approaches t a node in
local neighborhood, which is represented by nearby nodes that can be reached
within a few hops in the network. The basic assumption is that nodes in similar
local neighborhoods will have similar representations. However, sampling neigh-
boring nodes beyond a few hops in a network is costly in terms of computation. We
extend previous models to consider the neighborhoods of nodes within \innite"
hops, which \collapse" into the global network structure. A global network vector
is hence used to represent the global network structure. The global network vector
acts as a global memory to update and collect pieces of the structural information
93
from all local neighborhoods. This allows our approach to perform better in real
world application with a low additional computation cost.
In order to model the networks with text attributes, we go further and explore
the document networks and human navigation trails in real data to regularize neu-
ral embeddings. This results in more smoothed distributed representations. The
dierence between our models and previous models such as Paragraph Vector [Le
and Mikolov, 2014] is also analogous to the dierence between LDA [Blei et al.,
2003] and PLSA [Hofmann, 1999], the two classic topic modeling models which
are based on bag-of-words representations. LDA assumes a Dirichlet prior on the
topic distributions of documents, while PLSA does not and suers from overtting
due to the linear growth of parameters in the number of documents. Our mod-
els take advantage of the contexts of documents (dened by the existing links or
human navigation trails) to overcome the overtting problem. The information
of document contexts serves as \empirical prior" which helps to learn smoothed
representations for documents. This is especially useful for modeling short docu-
ments with just a few words, for which [Le and Mikolov, 2014] tends to learn poor
document representations since each separately learned document vector overt to
the few words within the document. In contrast, our models capture the inherent
connections between documents and are not dependent on the word content as
much.
6.2 Future Directions
The following research directions may improve our proposed algorithms.
94
Research shows that people [Watts et al., 2002] categorize their social networks
hierarchically into cognitive groups based on their characteristics, such as relation-
ship, geography and occupation. People also tend to be good at selecting clues to
recall social network information through compression heuristics [Brashears, 2013]
such as triadic closure and kin labels. An interesting future direction is to leverage
these human strategies to learn distributed representations of social networks.
We can also extend our approaches to address network dynamics and learn
dynamic representations of networks as well. Our approaches also can be applied
to identifying the correspondence of nodes in two dierent networks, for example,
creating alignment of same users using dierent social networking services.
When modeling local neighborhoods of nodes in networks, rst-order or second-
order random walks are used in our models. In order to improve performance,
we may extend to adopt higher-order random walks with longer-term memory,
although it's an open question whether it's practical in real world applications.
In our models, a sliding window with a xed length is used to sample local con-
texts. However, it may only capture a limited dependency of objects in sequences.
Another interesting direction is to use recurrent neural network [Mikolov et al.,
2010, Hochreiter and Schmidhuber, 1997] to model the long-term dependency
between words, nodes or documents in sequences, although it is often not clear
whether it will result in performance gain on particular tasks.
The availability of human navigation data on knowledge concepts in Wikipedia
such as Wikispeedia
1
, provides an opportunity to learn distributed representations
of concepts by respecting human inference of semantic links between concepts.
Comparing the learned representations to that with existing hyperlinks of concepts
1
http://cs.mcgill.ca/
~
rwest/wikispeedia/
95
can shed insights on discovery of missing links and reorganization of Wikipedia
pages. We may also investigate this direction as future work.
96
Reference List
Edoardo M Airoldi, David M Blei, Stephen E Fienberg, and Eric P Xing. Mixed
membership stochastic blockmodels. In Advances in Neural Information Pro-
cessing Systems, pages 33{40, 2009.
Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan
Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos,
et al. Deep speech 2: End-to-end speech recognition in english and mandarin.
arXiv preprint arXiv:1512.02595, 2015.
Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps and spectral techniques
for embedding and clustering. In NIPS, volume 14, pages 585{591, 2001.
Yoshua Bengio, R ejean Ducharme, Pascal Vincent, and Christian Jauvin. A neural
probabilistic language model. Journal of Machine Learning Research, 3:1137{
1155, 2003.
Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning:
A review and new perspectives. IEEE transactions on pattern analysis and
machine intelligence, 35(8):1798{1828, 2013.
Michele Berlingerio, Danai Koutra, Tina Eliassi-Rad, and Christos Faloutsos.
Netsimile: a scalable approach to size-independent network similarity. arXiv
preprint arXiv:1209.2684, 2012.
Mikhail Bilenko and Ryen W White. Mining the search trails of surng crowds:
identifying relevant websites from user activity. In Proceedings of the 17th inter-
national conference on World Wide Web, pages 51{60. ACM, 2008.
David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation.
the Journal of machine Learning research, 3:993{1022, 2003.
Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and
Oksana Yakhnenko. Translating embeddings for modeling multi-relational data.
In Advances in Neural Information Processing Systems, pages 2787{2795, 2013.
97
Matthew E Brashears. Humans use compression heuristics to improve the recall
of social networks. Scientic reports, 3, 2013.
Bobby-Joe Breitkreutz, Chris Stark, Teresa Reguly, Lorrie Boucher, Ashton Bre-
itkreutz, Michael Livstone, Rose Oughtred, Daniel H Lackner, J urg B ahler,
Valerie Wood, et al. The biogrid interaction database: 2008 update. Nucleic
acids research, 36(suppl 1):D637{D640, 2008.
Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual web
search engine. In Proceedings of the Seventh World Wide Web Conference, 1998.
Shaosheng Cao, Wei Lu, and Qiongkai Xu. Grarep: Learning graph represen-
tations with global structural information. In Proceedings of the 24th ACM
International on Conference on Information and Knowledge Management, pages
891{900. ACM, 2015.
Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A
survey. ACM computing surveys (CSUR), 41(3):15, 2009.
Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector
machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1{
27:27, 2011.
Jonathan Chang and David M Blei. Relational topic models for document net-
works. In International Conference on Articial Intelligence and Statistics, pages
81{88, 2009.
Jonathan Chang, Jordan Boyd-Graber, and David M Blei. Connections between
the lines: augmenting social networks with text. In Proceedings of the 15th
ACM SIGKDD international conference on Knowledge discovery and data min-
ing, pages 169{178. ACM, 2009.
Shiyu Chang, Wei Han, Jiliang Tang, Guo-Jun Qi, Charu C Aggarwal, and
Thomas S Huang. Heterogeneous network embedding via deep architectures.
In Proceedings of the 21th ACM SIGKDD International Conference on Knowl-
edge Discovery and Data Mining, pages 119{128. ACM, 2015.
Charalampos Chelmis, Hao Wu, Vikram Sorathia, and Viktor K Prasanna. Seman-
tic social network analysis for the enterprise. Computing and Informatics{Special
Issue on Computational Intelligence for Business Collaboration, 2013.
Ed H Chi, Peter Pirolli, Kim Chen, and James Pitkow. Using information scent
to model user information needs and actions and the web. In Proceedings of
the SIGCHI conference on Human factors in computing systems, pages 490{497.
ACM, 2001.
98
Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. An experimental
comparison of click position-bias models. In Proceedings of the 2008 Interna-
tional Conference on Web Search and Data Mining, pages 87{94. ACM, 2008.
Mukund Deshpande and George Karypis. Selective markov models for predicting
web page accesses. ACM Transactions on Internet Technology (TOIT), 4(2):
163{184, 2004.
Laura Dietz, Steen Bickel, and Tobias Scheer. Unsupervised prediction of cita-
tion in
uences. In Proceedings of the 24th international conference on Machine
learning, pages 233{240. ACM, 2007.
Nemanja Djuric, Hao Wu, Vladan Radosavljevic, Mihajlo Grbovic, and Narayan
Bhamidipati. Hierarchical neural language models for joint representation of
streaming documents and their content. In Proceedings of the 24th International
Conference on World Wide Web, pages 248{255. International World Wide Web
Conferences Steering Committee, 2015.
Francois Fouss, Alain Pirotte, Jean-Michel Renders, and Marco Saerens. Random-
walk computation of similarities between nodes of a graph with application to
collaborative recommendation. IEEE Transactions on knowledge and data engi-
neering, 19(3):355{369, 2007.
Brian Gallagher and Tina Eliassi-Rad. Leveraging label-independent features for
classication in sparsely labeled networks: An empirical study. In Advances in
Social Network Mining and Analysis, pages 1{19. Springer, 2010.
Florian Geigl, Kristina Lerman, Simon Walk, Markus Strohmaier, and Denis Helic.
Assessing the navigational eects of click biases and link insertion on the web. In
Proceedings of the 27th ACM Conference on Hypertext and Social Media, pages
37{47. ACM, 2016.
Lise Getoor. Introduction to statistical relational learning. MIT press, 2007.
Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for net-
works. In Proceedings of the 22nd ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pages 855{864, 2016.
Keith Henderson, Brian Gallagher, Lei Li, Leman Akoglu, Tina Eliassi-Rad, Hang-
hang Tong, and Christos Faloutsos. It's who you know: graph mining using
recursive structural features. In Proceedings of the 17th ACM SIGKDD inter-
national conference on Knowledge discovery and data mining, pages 663{671.
ACM, 2011.
99
Georey E Hinton. Learning distributed representations of concepts. In Proceedings
of the eighth annual conference of the cognitive science society, volume 1, page 12.
Amherst, MA, 1986.
Sepp Hochreiter and J urgen Schmidhuber. Long short-term memory. Neural com-
putation, 9(8):1735{1780, 1997.
Nathan Oken Hodas and Kristina Lerman. How limited visibility and divided
attention constrain social contagion. In In SocialCom. Citeseer, 2012.
Peter D Ho, Adrian E Raftery, and Mark S Handcock. Latent space approaches
to social network analysis. Journal of the american Statistical association, 97
(460):1090{1098, 2002.
Matthew Homan, Francis R Bach, and David M Blei. Online learning for latent
dirichlet allocation. In advances in neural information processing systems, pages
856{864, 2010.
Thomas Hofmann. Probabilistic latent semantic indexing. In Proceedings of the
22nd annual international ACM SIGIR conference on Research and development
in information retrieval, pages 50{57. ACM, 1999.
Bernardo A Huberman, Peter LT Pirolli, James E Pitkow, and Rajan M Lukose.
Strong regularities in world wide web surng. Science, 280(5360):95{97, 1998.
Charles Kemp, Thomas L Griths, and Joshua B Tenenbaum. Discovering latent
classes in relational data. 2004.
Ryan Kiros, R Zemel, and Ruslan Salakhutdinov. Multimodal neural language
models. In Proceedings of the 31th International Conference on Machine Learn-
ing, 2014a.
Ryan Kiros, Richard S Zemel, and Ruslan Salakhutdinov. A multiplicative model
for learning distributed text-based attribute representations. arXiv preprint
arXiv:1406.2710, 2014b.
Jon Kleinberg. Small-world phenomena and the dynamics of information. Advances
in neural information processing systems, 1:431{438, 2002.
Jon M Kleinberg. Authoritative sources in a hyperlinked environment. Journal of
the ACM (JACM), 46(5):604{632, 1999.
Jon M Kleinberg. Navigation in a small world. Nature, 406(6798):845{845, 2000.
Joseph B Kruskal and Myron Wish. Multidimensional scaling, volume 11. Sage,
1978.
100
Hugo Larochelle and Stanislas Lauly. A neural autoregressive topic model. In
NIPS, pages 2717{2725, 2012.
Quoc Le and Tomas Mikolov. Distributed representations of sentences and docu-
ments. In Proceedings of the 31st International Conference on Machine Learning
(ICML-14), pages 1188{1196, 2014.
Ronny Lempel and Shlomo Moran. The stochastic approach for link-structure
analysis (salsa) and the tkc eect. Computer Networks, 33(1):387{401, 2000.
Kristina Lerman, Nathan Hodas, and Hao Wu. Bounded rationality in scholarly
knowledge discovery. arXiv preprint arXiv:1710.00269, 2017.
David Liben-Nowell and Jon Kleinberg. The link-prediction problem for social net-
works. Journal of the American society for information science and technology,
58(7):1019{1031, 2007.
Arthur Liberzon, Aravind Subramanian, Reid Pinchback, Helga Thorvaldsd ottir,
Pablo Tamayo, and Jill P Mesirov. Molecular signatures database (msigdb) 3.0.
Bioinformatics, 27(12):1739{1740, 2011.
Matt Mahoney. Large text compression benchmark. URL: http://www. mattma-
honey. net/text/text. html, 2009.
Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a
large annotated corpus of english: The penn treebank. Computational linguistics,
19(2):313{330, 1993.
Julian J McAuley and Jure Leskovec. Learning to discover social circles in ego
networks. In Advances in neural information processing systems, pages 539{547,
2012.
Andrew McCallum, Andr es Corrada-Emmanuel, and Xuerui Wang. Topic and role
discovery in social networks. 2005.
Qiaozhu Mei, Deng Cai, Duo Zhang, and ChengXiang Zhai. Topic modeling with
network regularization. In Proceedings of the 17th international conference on
World Wide Web, pages 101{110. ACM, 2008.
Tomas Mikolov, Martin Kara at, Lukas Burget, Jan Cernock y, and Sanjeev Khu-
danpur. Recurrent neural network based language model. In INTERSPEECH
2010, 11th Annual Conference of the International Speech Communication Asso-
ciation, Makuhari, Chiba, Japan, September 26-30, 2010, pages 1045{1048, 2010.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jerey Dean. Ecient estimation
of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013a.
101
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Je Dean. Dis-
tributed representations of words and phrases and their compositionality. In
Advances in Neural Information Processing Systems, pages 3111{3119, 2013b.
Stanley Milgram. The small world problem. Psychology today, 2(1):60{67, 1967.
Andriy Mnih and Georey Hinton. Three new graphical models for statistical lan-
guage modelling. In Proceedings of the 24th international conference on Machine
learning, pages 641{648. ACM, 2007.
Andriy Mnih and Koray Kavukcuoglu. Learning word embeddings eciently with
noise-contrastive estimation. In Advances in Neural Information Processing Sys-
tems, pages 2265{2273, 2013.
Andriy Mnih and Yee Whye Teh. A fast and simple algorithm for training neural
probabilistic language models. arXiv preprint arXiv:1206.6426, 2012.
Frederic Morin and Yoshua Bengio. Hierarchical probabilistic neural network lan-
guage model. In AISTATS, volume 5, pages 246{252, 2005.
Ramesh M Nallapati, Amr Ahmed, Eric P Xing, and William W Cohen. Joint
latent topic models for text and citations. In Proceedings of the 14th ACM
SIGKDD international conference on Knowledge discovery and data mining,
pages 542{550. ACM, 2008.
Christopher Olston and Ed H Chi. Scenttrails: Integrating browsing and searching
on the web. ACM Transactions on Computer-Human Interaction (TOCHI), 10
(3):177{197, 2003.
Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of
social representations. In Proceedings of the 20th ACM SIGKDD international
conference on Knowledge discovery and data mining, pages 701{710. ACM, 2014.
Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally
linear embedding. Science, 290(5500):2323{2326, 2000.
Ruslan Salakhutdinov and Georey E Hinton. Replicated softmax: an undirected
topic model. In NIPS, volume 22, pages 1607{1614, 2009.
Gerard Salton and Michael J McGill. Introduction to modern information retrieval.
1983.
Aju Thalappillil Scaria, Rose Marie Philip, Robert West, and Jure Leskovec. The
last click: Why users give up information network navigation. In Proceedings
of the 7th ACM international conference on Web search and data mining, pages
213{222. ACM, 2014.
102
Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and
Tina Eliassi-Rad. Collective classication in network data. AI magazine, 29(3):
93, 2008.
Philipp Singer, Denis Helic, Andreas Hotho, and Markus Strohmaier. Hyptrails:
A bayesian approach for comparing hypotheses about human trails on the web.
In Proceedings of the 24th International Conference on World Wide Web, pages
1003{1013. International World Wide Web Conferences Steering Committee,
2015.
Adish Singla, Ryen White, and Je Huang. Studying trailnding algorithms for
enhanced web search. In Proceedings of the 33rd international ACM SIGIR
conference on Research and development in information retrieval, pages 443{
450. ACM, 2010.
Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. Reasoning
with neural tensor networks for knowledge base completion. In Advances in
Neural Information Processing Systems, pages 926{934, 2013.
Nitish Srivastava, Ruslan R Salakhutdinov, and Georey E Hinton. Modeling
documents with deep boltzmann machines. In UAI, 2013.
Greg Ver Steeg, Rumi Ghosh, and Kristina Lerman. What stops social epidemics?
arXiv preprint arXiv:1102.1985, 2011.
Jian Tang, Meng Qu, and Qiaozhu Mei. Pte: Predictive text embedding through
large-scale heterogeneous text networks. In Proceedings of the 21th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining,
pages 1165{1174. ACM, 2015a.
Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei.
Line: Large-scale information network embedding. In Proceedings of the 24th
International Conference on World Wide Web, pages 1067{1077. International
World Wide Web Conferences Steering Committee, 2015b.
Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. Arnet-
miner: extraction and mining of academic social networks. In Proceedings of the
14th ACM SIGKDD international conference on Knowledge discovery and data
mining, pages 990{998. ACM, 2008.
Lei Tang and Huan Liu. Relational learning via latent social dimensions. In
Proceedings of the 15th ACM SIGKDD international conference on Knowledge
discovery and data mining, pages 817{826. ACM, 2009a.
103
Lei Tang and Huan Liu. Scalable learning of collective behavior based on sparse
social dimensions. In Proceedings of the 18th ACM conference on Information
and knowledge management, pages 1107{1116. ACM, 2009b.
Lei Tang and Huan Liu. Leveraging social media networks for classication. Data
Mining and Knowledge Discovery, 23(3):447{478, 2011.
Jaime Teevan, Christine Alvarado, Mark S Ackerman, and David R Karger. The
perfect search engine is not enough: a study of orienteering behavior in directed
search. In Proceedings of the SIGCHI conference on Human factors in computing
systems, pages 415{422. ACM, 2004.
Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric
framework for nonlinear dimensionality reduction. science, 290(5500):2319{2323,
2000.
Kristina Toutanova, Dan Klein, Christopher D Manning, and Yoram Singer.
Feature-rich part-of-speech tagging with a cyclic dependency network. In Pro-
ceedings of the 2003 Conference of the North American Chapter of the Associa-
tion for Computational Linguistics on Human Language Technology-Volume 1,
pages 173{180. Association for Computational Linguistics, 2003.
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and
tell: A neural image caption generator. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 3156{3164, 2015.
Alastair J Walker. An ecient method for generating discrete random vari-
ables with general distributions. ACM Transactions on Mathematical Software
(TOMS), 3(3):253{256, 1977.
Chong Wang and David M Blei. Collaborative topic modeling for recommend-
ing scientic articles. In Proceedings of the 17th ACM SIGKDD international
conference on Knowledge discovery and data mining, pages 448{456. ACM, 2011.
Daixin Wang, Peng Cui, and Wenwu Zhu. Structural deep network embedding. In
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pages 1225{1234. ACM, 2016.
Duncan J Watts and Steven H Strogatz. Collective dynamics of `small-world'
networks. nature, 393(6684):440{442, 1998.
Duncan J Watts, Peter Sheridan Dodds, and Mark EJ Newman. Identity and
search in social networks. science, 296(5571):1302{1305, 2002.
Robert West and Jure Leskovec. Automatic versus human navigation in informa-
tion networks. In ICWSM, 2012a.
104
Robert West and Jure Leskovec. Human waynding in information networks. In
Proceedings of the 21st international conference on World Wide Web, pages
619{628. ACM, 2012b.
Robert West, Ashwin Paranjape, and Jure Leskovec. Mining missing hyperlinks
from human navigation traces: A case study of wikipedia. In Proceedings of the
24th international conference on World Wide Web, pages 1242{1252. Interna-
tional World Wide Web Conferences Steering Committee, 2015.
Ryen W White and Je Huang. Assessing the scenic route: measuring the value of
search trails in web logs. In Proceedings of the 33rd international ACM SIGIR
conference on Research and development in information retrieval, pages 587{594.
ACM, 2010.
Hao Wu and Kristina Lerman. Deep context: a neural language model for large-
scale networked documents. IJCAI, 2017a.
Hao Wu and Kristina Lerman. Network vector: Distributed representations of
networks with global context. arXiv preprint arXiv:1709.02448, 2017b.
Hao Wu, Charalampos Chelmis, Vikram Sorathia, Yinuo Zhang, Om Prasad Patri,
and Viktor K Prasanna. Enriching employee ontology for enterprises with knowl-
edge discovery from social networks. In Proceedings of the 2013 IEEE/ACM
International Conference on Advances in Social Networks Analysis and Mining,
pages 1315{1322. ACM, 2013.
Hao Wu, Martin Renqiang Min, and Bing Bai. Deep semantic embedding. In Pro-
ceedings of Workshop on Semantic Matching in Information Retrieval co-located
with the 37th international ACM SIGIR conference on research and development
in information retrieval, SMIR@ SIGIR, pages 46{52, 2014.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi,
Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al.
Google's neural machine translation system: Bridging the gap between human
and machine translation. arXiv preprint arXiv:1609.08144, 2016.
Wayne W Zachary. An information
ow model for con
ict and ssion in small
groups. Journal of anthropological research, pages 452{473, 1977.
Reza Zafarani and Huan Liu. Social computing data repository at asu, 2009.
105
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Graph embedding algorithms for attributed and temporal graphs
PDF
Deep representations for shapes, structures and motion
PDF
Deep learning models for temporal data in health care
PDF
Understanding diffusion process: inference and theory
PDF
Human appearance analysis and synthesis using deep learning
PDF
Invariant representation learning for robust and fair predictions
PDF
Alleviating the noisy data problem using restricted Boltzmann machines
PDF
Modeling and predicting with spatial‐temporal social networks
PDF
Disentangling the network: understanding the interplay of topology and dynamics in network analysis
PDF
Fast and label-efficient graph representation learning
PDF
Efficient graph learning: theory and performance evaluation
PDF
Human motion data analysis and compression using graph based techniques
PDF
Multimodal representation learning of affective behavior
PDF
Neural sequence models: Interpretation and augmentation
PDF
Scaling up deep graph learning: efficient algorithms, expressive models and fast acceleration
PDF
Word, sentence and knowledge graph embedding techniques: theory and performance evaluation
PDF
Modeling social and cognitive aspects of user behavior in social media
PDF
Hashcode representations of natural language for relation extraction
PDF
Learning to diagnose from electronic health records data
PDF
Data-driven 3D hair digitization
Asset Metadata
Creator
Wu, Hao
(author)
Core Title
Learning distributed representations from network data and human navigation
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
12/13/2017
Defense Date
05/10/2017
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
analogy,classification,graph embeddings,human navigation trails,link prediction,linked documents,neural language models,OAI-PMH Harvest,representation learning,role discovery,social networks,text embeddings
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Lerman, Kristina (
committee chair
), Knight, Kevin (
committee member
), Teodoridis, Florenta (
committee member
)
Creator Email
haowu2012@gmail.com,hwu732@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-463578
Unique identifier
UC11268609
Identifier
etd-WuHao-5954.pdf (filename),usctheses-c40-463578 (legacy record id)
Legacy Identifier
etd-WuHao-5954.pdf
Dmrecord
463578
Document Type
Dissertation
Rights
Wu, Hao
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
analogy
graph embeddings
human navigation trails
link prediction
linked documents
neural language models
representation learning
role discovery
social networks
text embeddings