Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Scalable machine learning algorithms for item recommendation
(USC Thesis Other)
Scalable machine learning algorithms for item recommendation
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
SCALABLE MACHINE LEARNING ALGORITHMS FOR ITEM RECOMMENDATION
by
Kuan Liu
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER SCIENCE)
August 2018
Copyright 2018 Kuan Liu
Dedication
To Jinyuan and Xinyue
ii
Acknowledgments
As this thesis concludes my Ph.D. journey, a memorable academic undertaking mixed of joy,
excitement as well as occasional stress, and frustration, I feel great relief, satisfaction, and whole-
hearted gratitude. I would like to take the chance to express my sincere thanks to my dear
mentors and friends.
It has been my great fortune to work under the supervision of an excellent advisor, Prem
Natarajan. Prem is a visionary researcher, an acute analytical thinker, an inspiring leader, and
a good friend. This thesis can not be done without his full support, cheerful guidance, and
insightful discussions. As a researcher, I am constantly inspired by Prem to work on important
research challenges and be crystal clear in thinking through problems and delivering results. As
a friend, Prem is always enthusiastic to oer me help whenever needed, my job searching being
one example of many. Finally, as a big fan of Prem
0
s story telling and technical writing, I greatly
enjoyed our numerous conversations and paper co-edits, from which I benefited both inside and
outside my research.
I want to thank Aram Galstyan, Craig Knoblock, Kevin Knight, and Shri Narayanan, for
graciously serving as my committee members and providing feedback and comments on my
proposal and defense, which guide my research and allow me to improve the quality of the
thesis.
I am very grateful to my undergrad thesis advisor Jun Zhu for preparing me for my Ph.D.
study and introducing me to the fascinating world of statistical machine learning. I remain deeply
inspired by Jun
0
s research passion, talent, and self-discipline.
During my Ph.D. I have been fortunate to work with and alongside many incredible indi-
viduals: Ayush Jaiswal, Ekraam Sabir, Emily Sheng, Karishma Sharma, Linhong Zhu, Stephen
Rawls, Rex Yue Wu, Shuyang Gao, Wenzhe Li, Yuan Shi, Zekun Li from ISI, Aur´ elien Bel-
let, Boqing Gong, Dingchao Lu, Franziska Meier, Soravit Changpinyo, Wei-Lun Chao, Zhiyun
iii
Lu from USC. I want to particularly thank my early labmate, TA colleague, dear friend, Tomer
Levinboim, for his long term help with my research and Ph.D. life. I want to thank my close
collaborator, housemate, Karate partner, Xing Shi, for wonderful collaboration and help after
my transition to ISI. Additionally, my warm thanks go to Lizsl De Leon (USC), Karen Rawlins
(ISI), and Janice Wheeler (ISI), for their tremendous help throughout my Ph.D. years and our
countless entertaining conversations.
Finally, I could not have set out on this path without the confidence instilled in me by my
dear parents Jinhui Xue and Yi Liu. It is their endless love, support, and encouragement in every
stage of my life that allow me to keep exploring new frontiers of life. Likewise, I would not have
had the fortitude to survive this journey without the love, companion, and unwavering support
of my beloved wife Jennifer Jinyuan Guo, whose care, understanding, and joyfulness, make this
journey vivid and enjoyable.
iv
v
Table of Contents
Dedication ii
Acknowledgments iii
List of Tables ix
List of Figures xi
Abstract xiii
1: Introduction 2
1.1 Thesis statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 Learn to rank large item sets in a batch learning framework . . . . . . 4
1.1.2 Sequence modeling on implicit recommendation . . . . . . . . . . . 5
1.1.3 Temporal learning approaches to a job recommender system . . . . . 6
1.1.4 Incorporating heterogeneous attributes . . . . . . . . . . . . . . . . . 6
1.1.5 Eciently fuse signals from multiple modalities . . . . . . . . . . . 8
1.2 Thesis contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2: Background 11
2.1 Recommender systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Recommendation with implicit feedback . . . . . . . . . . . . . . . . . . . . 13
2.3 Domains and datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.2 Online businesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.3 Movies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3: A Batch Learning Framework for Scalable Personalized Ranking 21
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Position-dependent personalized ranking . . . . . . . . . . . . . . . . . . . . 25
3.3.1 Pointwise rank approximation . . . . . . . . . . . . . . . . . . . . . 25
3.3.2 Pairwise rank approximation . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.1 Batch rank approximations . . . . . . . . . . . . . . . . . . . . . . . 27
3.4.2 Smooth rank-sensitive loss functions . . . . . . . . . . . . . . . . . . 29
3.4.3 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
vi
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4: Sequence Modeling on Recommendation with Implicit Feedback 41
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Sequence modeling approaches . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3.1 Skip-gram-rec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3.2 CBOW-rec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3.3 LSTMs-rec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4 Comparison with non-sequence models . . . . . . . . . . . . . . . . . . . . 45
4.5 How sequence modeling helps . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.5.1 Should we sample feedback or not? . . . . . . . . . . . . . . . . . . 47
4.5.2 What if data scale changes? . . . . . . . . . . . . . . . . . . . . . . 47
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5: Temporal Learning Approaches to a Job Recommender System 50
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.1.1 Job recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1.2 Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.1.3 Evaluation metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2.1 Temporal observations . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2.2 Linear ranking model . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6: Incorporating Heterogeneous Attributes in Recommender Systems 59
6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.2 Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.2.1 Problem formulation and notation . . . . . . . . . . . . . . . . . . . 61
6.2.2 Identity embedding via sequential recommendation . . . . . . . . . . 63
6.2.3 Joint attribute embedding . . . . . . . . . . . . . . . . . . . . . . . . 63
6.2.4 A hierarchical attribute combination . . . . . . . . . . . . . . . . . . 64
6.2.5 Shared attribute embedding in output layer . . . . . . . . . . . . . . 65
6.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.4.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.4.2 Recommendation accuracy . . . . . . . . . . . . . . . . . . . . . . . 69
6.4.3 A qualitative analysis of learned embedding . . . . . . . . . . . . . . 71
6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7: Learn to Combine Modalities in Multimodal Deep Learning 74
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.2.1 Traditional approaches . . . . . . . . . . . . . . . . . . . . . . . . . 78
7.2.2 Multimodal deep learning . . . . . . . . . . . . . . . . . . . . . . . 78
vii
7.3 A multiplicative combination layer . . . . . . . . . . . . . . . . . . . . . . . 79
7.3.1 Combine in a multiplicative way . . . . . . . . . . . . . . . . . . . . 80
7.3.2 Boosted multiplicative training . . . . . . . . . . . . . . . . . . . . . 81
7.4 Select modal mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
7.4.1 Modality mixture candidates . . . . . . . . . . . . . . . . . . . . . . 83
7.4.2 Mixture selections . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.5 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.6.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
8: Conclusions 95
8.1 Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.1.1 Ecient rankers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.1.2 Content-based methods . . . . . . . . . . . . . . . . . . . . . . . . . 96
8.2 Conclusions and perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.3 Supporting publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Bibliography 99
viii
List of Tables
Table 2.1: Dataset statistics of RS16. . . . . . . . . . . . . . . . . . . . . . . . 17
Table 2.2: Dataset statistics of XING. . . . . . . . . . . . . . . . . . . . . . . . 18
Table 2.3: Dataset statistics of Yelp. . . . . . . . . . . . . . . . . . . . . . . . 19
Table 2.4: Dataset statistics of movies. . . . . . . . . . . . . . . . . . . . . . . 20
Table 3.1: Dataset statistics. U: users; I: items; S: interactions. . . . . . . . . . 33
Table 3.2: Recommendation accuracy comparisons (in %). Results are aver-
aged over 5 experiments with dierent random seeds. Best and
second best numbers are in bold and italic, respectively. . . . . . . . 34
Table 3.3: Dataset/model complexity comparisons. . . . . . . . . . . . . . . . 38
Table 3.4: Comparisons of objective values (obj) and recommendation accu-
racies (NDCG) on development set among full batch and sampled
batch algorithms. q =jZj=jYj, q = 1:0 means full batch. . . . . . . . 39
Table 4.1: Statistics of data sets. . . . . . . . . . . . . . . . . . . . . . . . . . 46
Table 4.2: Recommendation accuracy comparisons (in %). Best and second-
best single model results are in bold (e.g., 8.73) and italic (e.g.,
7.41), respectively. Scores on Yelp, ML-1m, ML-20m are calcu-
lated after removing history items from recommendation for all
models. (P:P@5; R:R@30; N:NDCG@30) . . . . . . . . . . . . . . . 46
Table 4.3: LSTMs-rec relative scores on original and “manipulated”XING datasets.
“Orig” denotes the complete sequence. x
N
denotes the manipulated data
set obtained by randomly sampling sub-sequence proportional to N times. . 47
Table 5.1: Scores in thousands (K) based on history interactions. . . . . . . . . 56
Table 5.2: Scores (K) and training time (h: hours) by both hybrid matrix fac-
torization and temporal reweighted matrix factorization models. . . . 57
Table 5.3: Scores (K) by THMF with some “impression”s as additional obser-
vation inputs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Table 6.1: An attribute division example. How attributes in Figure 6.1 are
divided into three kinds. . . . . . . . . . . . . . . . . . . . . . . . . 64
ix
Table 6.2: Recommendation accuracy and training time comparisons between
MIX and HET on dierent models. . . . . . . . . . . . . . . . . . . 68
Table 6.3: Recommendation accuracy comparisons of dierent attribute embed-
ding integrations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Table 6.4: Recommendation accuracy comparisons with other state-of-the-art
models. Best and second best single model scores are in bold and
italic, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Table 6.5: Examples of attributes and their nearest neighbors in the embed-
ding space (fromYelp andXING). . . . . . . . . . . . . . . . . . . . 73
Table 7.1: Test error rates/AUC comparisons on CIFAR100, HIGGS, and gen-
der tasks. MulMix uses default value 0.5. MulMix* tunes
between 0 and 1. Experimental results are from 5 random runs.
The best and second best results in each row are in bold and italic,
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Table 7.2: Error rate results of boosted training (MulMix) on HIGGS-full and
gender-22. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Table 7.3: Gender-22 error analysis: Mistakes that multimodal methods make
where individual modals do not (we call “over-learn”); Mul and
MulMix* improve Add on that. The overall improvement is very
close to the improvement from “over-learn”. . . . . . . . . . . . . . 92
Table 7.4: Test errors of attention models on gender tasks. . . . . . . . . . . . . 93
Table 7.5: Examples of MulMix* prediction results from single modals and
modality mixtures in gender prediction task. A, B, C denote sin-
gle modals. On every sample modal mixtures are sorted by their
prediction probabilities of the correct class (numbers in the table).
Blue (red) color indicates the probability leads to a correct (incor-
rect) prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
x
List of Figures
Figure 2.1: Attribute description inRS16 (U=user; I=item). . . . . . . . . . . 17
Figure 3.1: Illustrations of rank approximations and smooth rank-sensitive
loss functions. 3.2(a) shows dierent approximations of indicator
functions. 3.2(b) shows smooth loss functions used to generalize
the loss (3.4). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Figure 3.2: Relative standard deviations of two types of rank estimators at dif-
ferent item ranks. Simulation is done with item set size N=100,000.
Pairwise sample approximation uses estimator (3.3) and our mini-
batch approximation uses (3.8) where 0:001; 0:01; 0:1 refer to sam-
ple ratiojZj=jYj. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Figure 3.3: Approximated rank values compared to the true rank values. 3.4(b)
is a zoomed-in version with error bars. . . . . . . . . . . . . . . . . 37
Figure 3.4: Training time comparisons between WARP and BARS. Fig. 3.5(a)
plots how training time changes across datasets with dierent
scales; Fig. 3.5(b) plots how epoch time changes as the training
progresses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Figure 4.1: Results onXING when increasing proportion of data is used (pro-
portionf0.2, 0.44, 0.76, 1.0g). LSTMs-rec (red) performance improves
steadily while that of NHMF (blue) does not. . . . . . . . . . . . . . . 48
Figure 5.1: Weights learned in Model 5.2.2 for interactions 5.2(a) and impres-
sions 5.2(b). K=4 for INTS and K=1 for IMPS, where type 1
denotes user-item impression pairs, and type 2,3,4 denote click,
bookmark and reply, resp. . . . . . . . . . . . . . . . . . . . . . . 56
Figure 6.1: An illustrating example where a job seeker interacts with a sequence
of job posts. Rich heterogeneous attributes are associated with
both the job seeker and job posts. The system is asked to recom-
mend new posts to this user. . . . . . . . . . . . . . . . . . . . . . 60
xi
Figure 6.2: The overview of the proposed HA-RNN model. The base recom-
mendation model is recurrent neural networks where item sequence
(with the user) is fed as input and the model is trained to predict
the next item. The representations for the input (Q
U
and Q
I
) and
for computing item scores in the output layer (Q
I
) are combina-
tion of identity and attributes. To compute Q
I
, embedding of the
multi-hot attributes (e.g.,
M
1
) are first averaged before combined
with categorical ones (e.g.,
C
1
;
C
2
); identity is regarded as cate-
gorical attribute in the figure) and numerical ones (e.g.,
N
1
). The
same item representation is shared in both input and output layers
for enhanced signals and model regularization. Computation of
Q
U
is omitted in the figure. . . . . . . . . . . . . . . . . . . . . . . 62
Figure 6.3: 2-D visualization of embedded attributes of XING’s categorical
attributes by t-SNE . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Figure 7.1: Illustration of dierent deep neural network based multimodal
methods. (a) A gender prediction example with text (a fake userid)
and dense feature (fake profile information) modality inputs. (b)
Additive combination methods train neural networks on top of
aggregate signals from dierent modalities; Equal errors are back-
propagated to dierent modality models. (c) Multiplicative com-
bination selects a decision from a more reliable modality; Errors
back to the weaker modality are suppressed. (d) Multiplicative
modality mixture combination first additively creates mixture can-
didates and then selects useful modality mixtures with multiplica-
tive combination procedure. . . . . . . . . . . . . . . . . . . . . . 77
Figure 7.2: Comparisons to results from deeper networks. Error rates and
standard deviations from fusion networks with hidden layer struc-
tures are reported and compared to our models (i.e., MulMix, and
MulMix*). Simply going deep in networks does not necessar-
ily help improve generalization. Experimental results are from 5
random runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Figure 7.3: Error rates and standard deviations under dierent values. Opti-
mal results do not appear at either 0 or 1. Experimental results are
from 5 random runs. . . . . . . . . . . . . . . . . . . . . . . . . . 91
xii
Abstract
Recommendation with implicit feedback aims to propose items to users that are useful and rele-
vant. It has enormous applications in fields like e-commerce, social networks, music, television,
and so on. Existing recommendation methods face scalability challenges from increasingly large
data volumes and rich format signals. In particular, the challenges include how to eciently train
ranking models on a large user set or item set, how to capture complex user behavior patterns
from sparse input signals, and how to incorporate rich side information that is of dierent formats
and from dierent domains.
This thesis investigates scalable machine learning algorithms to improve recommendation
accuracy as well as model ability to handle large-scale data. It is innovative in designing novel
ranking algorithms to deal with recommendations from large item sets and to model sequential
and temporal properties of user feedback. It also advances state-of-the-art content-based rec-
ommendation approaches by modeling heterogeneous attributes and eciently fusing signals
from multiple modalities. Empirical studies on data from dierent domains show that scalable
and flexible learning approaches can eciently help to extract useful information from sparse
feedback and enhance recommendation performance.
xiii
1
Chapter 1
Introduction
In recent years, research on recommendation with implicit feedback (also known as implicit
recommendation or item recommendation) has been increasingly active [Hu et al., 2008,
Lee et al., 2009, Rendle et al., 2009, Johnson, 2014, Bayer et al., 2016, He et al., 2016b] and
is applied in various applications like video [Covington et al., 2016, Hidasi et al., 2015],
e-commerce [Linden et al., 2003, Grbovic et al., 2015], social networks [Chen et al., 2009,
Abel et al., 2011], location recommendation [Liu et al., 2014], etc. After observing user activ-
ities such as clicks, page views, and purchases—often called implicit feedback—the goal is to
recommend a ranked list of items that the user might prefer. Compared to explicit feedback (e.g.,
ratings, satisfactions), implicit feedback is not as obvious in terms of preferences. However, such
implicit feedback data can usually be collected in a passive way, thus in a faster and larger scale
compared to explicit feedback. The implicit recommendation setup also fits a broader range of
real-life recommendation scenarios by allowing input to be simply binary. Both of these make it
a plausible way for modern systems to analyze user sentiment and preferences.
Scalability is one key challenge that dierent implicit recommendation methods have been
trying to address. First, the item set from which personalized recommendations should be made
is often huge—it is common to see millions of items. This large pool size and the fine granularity
in prediction pose significant algorithmic and computational challenges to automated systems.
Second, the system is supposed to model and service a very large set of users, while many of
them provide very little feedback. The sparse (and noisy) data structure requires the system
to be flexible enough to capture personalized preferences, and at the same time to be carefully
regularized to generalize predictions to dierent users—such as those in cold start situations.
2
Moreover, as additional side information becomes prevalent, it is always tempting but nontrivial
to include data such as profile, timestamps, text and images, etc., to enhance system prediction
performance. Active research is being conducted in scalable model and algorithm design.
From matrix factorization to more flexible and scalable machine learning algorithms.
The most prominent method in the recommendation community is matrix factorization
[Weimer et al., 2007, Koren et al., 2009]. It is a family of collaborative filtering methods that
fit latent factors to users and items to preserve the user-item similarities that are derived from the
observed interactions (also called events). As it gained enormous popularity in explicit recom-
mendation (predicting ratings between user-item pairs) [Zhang et al., 2006, Koren et al., 2009,
Jamali and Ester, 2010, Baltrunas et al., 2011], it also became a popular base model with implicit
recommendation [Hu et al., 2008, Rendle et al., 2009, Usunier et al., 2009, Weston et al., 2010,
Shi et al., 2012a]. Matrix factorization models are simple and intuitive to interpret; they also pro-
vide relatively easy extensions to include other side information [Koren, 2010, Rendle, 2010].
To increase model flexibility and scalability, dierent algorithms are developed based
on matrix factorization. To address the fact that there is usually no negative feed-
back, confidence scores [Hu et al., 2008] and sampled negative items [Rendle et al., 2009,
Usunier et al., 2009] as contrast examples are used to train models. [Hidasi and Tikk, 2012,
Yu et al., 2012, Bayer et al., 2016] accelerate model training with fast optimization algorithms.
Tensor factorization [Chi and Kolda, 2012, Bhargava et al., 2015, Karatzoglou et al., 2010,
Xiong et al., 2010] extends matrix factorization to include additional information. Objective
functions are modified [Usunier et al., 2009, Weston et al., 2010, Shi et al., 2012a] to fit models
to focus on the quality of the top part of a recommendation list.
More recently, the recommendation community is beginning to see an increased interest
in flexible and scalable modeling and algorithms. Classification techniques are explored in
model training [Li et al., 2016, Covington et al., 2016]. Deep neural networks are used to extract
feature representations from items to improve recommendation [Van den Oord et al., 2013,
Wang et al., 2015b]. To go beyond matrix factorization, algorithms and models from the
3
natural language processing community are also borrowed and applied [Grbovic et al., 2015,
Liang et al., 2016, Vasile et al., 2016].
1.1 Thesis statement
The purpose of this thesis is to provide scalable machine learning solutions for item recom-
mendation. These include ranking algorithms for large item set size, deep learning methods to
account for sequential and temporal properties of user feedback, and modeling rich data signals
such as heterogeneous attributes and input features from multiple modalities.
1.1.1 Learn to rank large item sets in a batch learning framework
Item recommendation is challenging, partly due to its extremely large candidate pool. It is not
uncommon to see that a recommendation task involves an item candidate set with a size of up to a
million or even larger. This large scale makes algorithms hard to predict accurately and computa-
tion very expensive. First, the prediction problem is very unbalanced. Only a very few items out
of the entire set need to be selected for presentation to a user; this poses a dicult task for existing
classification or ranking algorithms. Second, large sizes of user and item sets make algorithms
slow or impractical. Eorts to overcome these diculties include tailoring factorization models
for implicit recommendation [Hu et al., 2008], designing sampling-based training algorithms for
ranking [Rendle et al., 2009, Rendle and Freudenthaler, 2014], improving ranking algorithms
to focus on top recommendations [Usunier et al., 2009, Weston et al., 2010, Shi et al., 2012a],
While these methods tackle the problem and alleviate the challenge, it remains dicult to train
an accurate model in large-scale scenarios.
We design a batch learning framework to address the scalability challenge. We argue that
a batch learning approach—rather than pointwise or pairwise approaches as adopted in popu-
lar algorithms—is more ecient in estimating ranks of target items, which is an critical step to
improving recommendation quality. Meanwhile, it uses modern parallel computation infrastruc-
tures in a natural way, thus oering opportunities to accelerate model training. In Chapter 3,
4
we propose a set personalized ranking algorithms to deal with the large item set challenge. We
demonstrate that the method improves system top-N recommendation performance as well as
shows time eciency advantages as data scale increases. The batch training idea has recently
been explored in the recommendation community [Hidasi et al., 2015, Covington et al., 2016].
Our algorithms are built on (more complex) ranking objective functions and show empirical
advantages in top-N recommendation accuracy.
1.1.2 Sequence modeling on implicit recommendation
Item recommendation suers from sparse and noisy observations in user feedback. Most
users have very limited interactions with extremely small portions of items and often no
negative feedback [Hu et al., 2008]. Moreover, user-item interactions are implicit and hard
to interpret [Hu et al., 2008, Liu et al., 2015]. It is challenging to build flexible models on
this feedback input. To this end, the recommendation community has extensively exploited
traditional matrix factorization models to perform collaborative filtering [Hu et al., 2008,
Rendle and Freudenthaler, 2014, Usunier et al., 2009, Shi et al., 2012a]. Despite simplicity and
great success, these models suer from limited expressiveness. Additionally, they do not con-
sider the correlations within the feedback.
However, we argue that the correlation within the feedback contains useful information to
capture user preferences. In Chapter 4 we empirically study three sequence models. For the first
time, we study recurrent neural networks in the general item recommendation domain. We also
combine our attribute techniques proposed in Chapter 6 to significantly improve state-of-the-art
model performances.
Furthermore, exploratory studies are conducted to answer the question of how sequence feed-
back helps recommender systems and how sequence models behave with change of data scale.
Our results indicate that the feedback sequences do contain additional discriminative informa-
tion rather than just item frequency, and that sequence models tend to have better performance
when data scale increases, while non-sequence models do not.
5
1.1.3 Temporal learning approaches to a job recommender system
It is important to model temporal dynamics in recommender systems [Rendle, 2010,
Koren, 2010, Koenigstein et al., 2011, Lee et al., 2009]. Temporal learning helps capture
important factors such as changes of item popularities, evolution of user preferences, user peri-
odic interests, etc. Time-aware models [Koren, 2010, Rendle, 2010, Koenigstein et al., 2011]
and tensor factorization-based models [Chi and Kolda, 2012, Bhargava et al., 2015,
Karatzoglou et al., 2010, Xiong et al., 2010] explicitly model temporal dynamics with addi-
tional time-associated parameters. To recommend recurrent services or items, sequence models
like the semi-Markov model [Kapoor et al., 2015] and Hawkes processes [Du et al., 2015] are
introduced. In designing our system for a job recommendation task, one drawback of the above
approaches is the computational cost. A traditional temporal reweighting technique is faster as
it does not introduce many additional parameters. However, the weights in those techniques are
often predefined based on domain-specific knowledge or tuned on a separate data split which
can quickly become impractical.
In Chapter 5 we develop a simple yet eective approach to learn temporal weight coe-
cients from the data. The approach learns the weights by carefully constructing a set of positive-
negative item pairs and fitting a linear ranking model which takes the historical temporal counts
as input. The learned weights are applied in two models. First, they are immediately used to
recommend recurrent items. Second, the weights are used in a hybrid matrix factorization model
training to place uneven penalties on model prediction errors at dierent times. Through empir-
ical study on a job recommendation task, this approach helps both models achieve significant
improvements. It also shows much better time eciency on the factorization model.
1.1.4 Incorporating heterogeneous attributes
Recommender systems always try to incorporate additional information on user feedback—
e.g., attributes (or metadata and profiles) of users and items as they become increasingly
prevalent. For instance, LinkedIn user profiles contain information that includes user loca-
tion, education, expertise, work experience, etc. Yelp business attributes contain business
6
name, hours, categories, services, etc. These attributes are often helpful and indicative of
when users choose to interact with items. To this end, dierent methods have been devel-
oped. Content-based approaches [Pazzani and Billsus, 2007] perform recommendation based
on similarities derived from these attributes. Factorization models, including matrix factor-
ization [Fang and Si, 2011, Saveski and Mantrach, 2014, Shmueli et al., 2012, Kula, 2015] and
tensor factorization [Chi and Kolda, 2012, Bhargava et al., 2015, Karatzoglou et al., 2010], are
extensively studied to model attributes together with user activities. Recently, more flexi-
ble models (e.g., topic models and pre-trained neural networks) have been used to model
attributes of text [Bansal et al., 2016, Kim et al., 2016], vision [He and McAuley, 2015], and
music [Van den Oord et al., 2013].
However, these methods do not address the challenge of attribute heterogeneity very
well. Attributes often come in dierent domains and data types—for example, a user
may have attributes describing his location and work experience. Attribute data types
include real numbers, categorical tokens, text tokens etc., and dierent users or items
may have dierent lengths of attributes. Matrix co-factorization models [Fang and Si, 2011,
Saveski and Mantrach, 2014] simultaneously model multiple relations, but some relations are
noisy and not highly relevant to the ultimate task. It is also hard to scale when there are many
types of attributes. Hybrid matrix factorization [Shmueli et al., 2012, Kula, 2015] learns task-
driven embedding, but does not deal well with variable lengths of attributes. It is also specif-
ically designed for matrix factorization models which, despite their great success, are limited
in their expressiveness and flexibility. Topic model and pre-trained neural network approaches
typically focus on a single type of attribute, and they may also suer from domain adaptation
issues because they are not learned end-to-end.
We address the challenge and design a generic way to incorporate heterogeneous attributes
in Chapter 6. The technique automatically infers embedding of attributes in the context of the
given recommendation task, and it eciently deals with variable lengths of attributes. Moreover,
the technique can be employed in simple matrix factorization as well as in more flexible mod-
els. In flexible models such as recurrent neural networks, the novel output layer helps enhance
7
and regularize attribute embedding. Empirical evaluation validates its eciency in helping to
improve recommendation accuracies.
1.1.5 Eciently fuse signals from multiple modalities
Signals from multiple modalities contain complementary information. It is an appeal-
ing approach to combining information from dierent modalities in recommender systems
[Mei et al., 2011, Wang et al., 2015c, Oramas et al., 2017]. However, in practice, combining dif-
ferent modalities requires overcoming challenges such as varying levels of noise and conflicts
between modalities. Existing methods do not adopt a joint approach to capturing synergies
between the modalities while simultaneously filtering noise and resolving conflicts on a per-
sample basis.
We address the challenge by fusing signals in general as well as on a per-sample basis. This
helps the model figure out the relative emphasis to be placed on each modality for each sam-
ple. In Chapter 7 we propose and investigate a novel deep neural network based technique that
multiplicatively combines information from dierent source modalities so that the model train-
ing process automatically focuses on information from more reliable modalities while reducing
emphasis on the less reliable modalities. Furthermore, we propose an extension that multiplica-
tively combines not only the single-source modalities, but a set of mixed-source modalities to
better capture cross-modal signal correlations. We demonstrate the eectiveness of our proposed
technique by presenting empirical results on three multimodal classification tasks from dierent
domains. The results show consistent accuracy improvements on all three tasks.
1.2 Thesis contributions
The contributions of this thesis can be summarized as follows:
We develop a personalized ranking algorithm to address the challenge of recommendation
from large item sets. It reduces item rank estimation error in stochastic optimization and
8
easily uses parallel computation infrastructures to accelerate model training. It consis-
tently outperforms state-of-the-art methods in recommendation accuracy. In addition, it
shows time eciency advantages when data scale increases—a two times speed-up on the
largest dataset among the three we evaluated.
We propose to model sequential properties of user feedback via sequence models in item
recommendation. For the first time, we study recurrent neural networks in the general item
recommendation domain. Our models empirically improve state-of-the-art model perfor-
mances significantly on four public datasets, giving 34% to 94% relative improvements on
NDCG scores.
We conduct exploratory studies to answer the question of how sequence feedback helps
recommender systems, and how sequence models behave with change of data scale. Our
results indicate that the feedback sequences do contain additional discriminative informa-
tion rather than just item frequency, and that sequence models tend to have better perfor-
mance when data scale increases, while non-sequence models do not.
We develop a data-driven approach to learning temporal weight coecients in item rec-
ommendation. The learned weights are applied in recommending recurrent items and in
a hybrid matrix factorization model (HMF) training. The approach helps both models
achieve significant accuracy improvements on a job recommendation task (20% scores
increase for HMF models in ACM Recsys Challenge 2016). It also shows much better
time eciency on the factorization model (2 to 5 times training speed up).
We develop novel techniques to incorporate heterogeneous attributes in implicit recom-
mendation. The techniques infer attribute embedding in the context of the given recom-
mendation task, and eciently deal with variable lengths of attributes and attribute spar-
sity. The techniques show considerable recommendation accuracy improvement in simple
matrix factorization, as well as in more flexible models. Our techniques bring 15% and
30% NDCG improvements over state-or-the-art methods on two public datasets.
9
We proposed novel deep neural network based techniques to combine information from
dierent source modalities. The methods automatically focus on information from more
reliable modalities (or modality mixtures) while reducing emphasis on the less reliable
modalities (or modality mixtures). The methods demonstrate better generalization perfor-
mance on three classification tasks.
10
Chapter 2
Background
2.1 Recommender systems
Recommender systems help users find interesting items (or services, products, etc.) after
they collect information on the preferences of those users for a set of items and use demo-
graphic user features, object content (e.g., text, images), social information, etc. As important
information filtering tools, recommender systems are currently widely used to facilitate
user online searching in diverse domains which range from movies [Bell and Koren, 2007],
music [Lee et al., 2010, Nanopoulos et al., 2010, Tan et al., 2011], television [Yu et al., 2006,
Barrag´ ans-Mart´ ınez et al., 2010], books [Crespo et al., 2011, N´ u˜ nez-Vald´ ez et al., 2012], doc-
uments [Serrano-Guerrero et al., 2011, Porcel et al., 2009, Porcel and Herrera-Viedma, 2010,
Porcel et al., 2012], to e-learning [Za´ ıane, 2002, Bobadilla et al., 2009], e-commerce
[Huang et al., 2007, Castro-Schez et al., 2011], social recommendations [Chen et al., 2009] and
so on.
The core of recommendation algorithms is that of modeling user preferences over a large
set of items based on dierent sources of information. The item set can include movies, hotels,
services, websites, online courses, etc. The information sources include user explicit feedback
(e.g., ratings, satisfaction), user activities (e.g., web page views, clicks, music selection and
degree of completion), user demographic information, item metadata, and even information
from the Internet of things (e.g., GPS location [Zheng et al., 2010], and real-time health sig-
nals [Liu et al., 2009]). It is a challenging problem from many perspectives. We list two major
problems below.
11
User-item interaction data exhibits sparse data structure. There are often a large number
of users and items. It is not uncommon to see that a typical recommender system deals with
millions of users and items. However, a user generally interacts with only a few items, and there
is often limited item overlapping between dierent users. Meanwhile, the activities between
users and items are not evenly distributed. Instead, they often follow a power-law distribution
[Newman, 2005]—a small portion of users consume the majority of events. and the same applies
to the items.
Scalability is another essential consideration. Recommendation algorithms need to take in
web-scale observation data as input, and this accounts for a large user and item set. The compu-
tation can be exceedingly costly. Model complexity is therefore often limited (such as bilinear
models) to reduce computation; incremental algorithms are also frequently used to train the
model when data scale is large and when additional data comes in a streaming way. Combin-
ing dierent sources of feedback proves to be useful in recommendation but adds to scalability
challenges. More sophisticated models and additional parameters lead to increased computation.
Recommendation algorithms can be classified into two major categories. The most tradi-
tional approach, applied to movie ratings, is collaborative filtering, [Sarwar et al., 2001], where
missing item ratings are predicted by combining the ratings of this item from similar users. There
are dierent ways of defining similarity and combining ratings. Memory-based approaches
[Delgado and Ishii, 1999, Yu et al., 2004] use predefined metrics such as Pearson correlation and
vector cosine to compute similarities between users or items. The final ratings are aggregated
ratings from similar users or items. Such approaches are easy to implement and widely used in
early systems.
Model-based collaborative filtering uses data-driven methods to find similarities and to
make predictions such as singular value decomposition [Paterek, 2007], Bayesian networks
[Ono et al., 2007, Huang and Bian, 2009], etc. A well-known example is matrix factorization
[Koren et al., 2009], where ratings or preferences are organized into a matrix whose dimensions
are users and items. Latent factors of users and items are inferred such that their inner products
12
preserve the ratings in the matrix. Model-based approaches show the advantages of being more
accurate and scalable than memory-based approaches.
Content-based methods [Pazzani and Billsus, 2007] make predictions based on user profile,
item metadata, and other information, and are complimentary to collaborative filtering methods.
Models such as simple matching, SVM classifiers, latent Dirichlet allocation [Blei et al., 2003],
and neural networks have been used to model content similarities. Content-based approaches
are especially admirable when there are very few interactions and when discriminative content
is available.
Evaluation of recommender systems includes online and oine evaluations. Online eval-
uation often uses A/B testing when user satisfaction is measured by metrics like click-through
rates, and page views. Oine evaluation uses accuracy measures including mean square error,
precision, mean average precision, discounted cumulated gains, etc. Beyond recommendation
accuracy, people also care about recommendation diversity, serendipities of the system, recom-
mendation interfaces.
Recently, interest in recommender research has shifted from the traditional explicit feedback
towards implicit feedback data. While explicit recommendation is designed to predict ratings,
implicit recommendation aims for item prediction. In implicit recommendation, systems col-
lect user preferences passively by monitoring user activities instead of asking for explicit rating
feedback. Unique challenges are posed due to dierent problem formulations.
In the next two sections, we describe recommendation with implicit feedback and example
domains and datasets on which we conduct our empirical studies.
2.2 Recommendation with implicit feedback
Depending on the nature of user-item interactions, recommendation can be classified into explicit
and implicit feedback problems. Explicit feedback is actively provided by users. It is usually in
the form of ratings, satisfaction, etc., which explicitly describe user preferences and satisfaction.
13
The well-known example, Netflix challenge [Bell and Koren, 2007], aims to accurately predict
missing ratings between certain users and items.
On the contrary, implicit feedback is often passively collected. The system observes user
activities such as clicks, page views, and purchases. Its goal is to recommend to each user
a ranked list of items that he might prefer. Compared to explicit feedback, implicit feedback
data can usually be collected faster and on a larger scale, thus making it a plausible way for
modern systems to analyze user sentiment and preferences. Meanwhile, the ultimate goal of a
recommender system is to present a small number of interesting or useful items to users instead
of predicting ratings between certain user-item pairs. Recommendation with implicit feedback
is also called implicit recommendation or item recommendation.
Many modern recommender systems are examples of implicit recommendation. Amazon
recommends relevant products to users [Linden et al., 2003]. Audiences receive video recom-
mendations from YouTube [Covington et al., 2016]. Veterans are automatically matched with
appropriate commercial jobs based on their skills.
1
Item recommendation is most commonly addressed as a ranking problem. Given a user,
algorithms compute the relevance scores of all items and return those with the highest scores.
Algorithms need to capture user preferences observed from the limited feedback and generalize
to the entire item set. Oine evaluation of an implicit recommendation algorithm commonly
uses top-k recommendation accuracy with measures including precision, recall, MAP, NDCG,
etc. The problem poses several unique challenges compared to explicit recommendation.
First, item recommendation faces a ranking problem instead of a regression problem as does
explicit recommendation, and the large size of an item set makes it algorithmically and compu-
tationally challenging. It requires flexible models to dierentiate preferences over all items. It
is algorithmically challenging to train models properly, and it takes a long time for training to
converge.
1
http://www.military.com/veteran-jobs/skills-translator/
14
Second, implicit feedback data is inherently noisy. While a large amount of user activities
can be collected, we only guess their preferences and true motives. For example, the observa-
tion that a TV show is being watched does not necessarily indicate the viewer’s interests. He
might happen to simply stay on that channel or he could be asleep. Data in explicit feedback
recommendation is instead more reliable.
Third, there is usually no negative feedback. For example, we observe that a customer pur-
chases a product. This does not necessarily mean that he dislikes all the other products. Such
unbalance and asymmetry make it hard to infer and to truly understand the customer’s prefer-
ences. In explicit recommendation, both positive and negative feedback is collected. The model
can infer the preferences more easily by contrasting positive and negative feedback.
As in explicit recommendation, collaborative filtering and content-based methods are
extensively exploited in implicit recommendation. For example, low rank factorization
models still serve as the basic tools for many implicit algorithms [Rendle et al., 2009,
Rendle and Freudenthaler, 2014, Usunier et al., 2009, Weston et al., 2010, Shi et al., 2012a];
more content-based algorithms are developed to extract meaningful information in order to bet-
ter compute similarities based on content [Van den Oord et al., 2013, Wang et al., 2015b]. To
address the challenges mentioned above, new techniques are also being developed.
Item sampling becomes an important tool in model training for implicit recommendation.
It helps to create negative feedback to sample items for which there was no user interatction.
It enables training with rank-based objective functions [Rendle et al., 2009]. Advanced sam-
pling methods are used to reduce bias [Rendle and Freudenthaler, 2014] and to improve training
eciency [Zhong et al., 2014]. Boosting algorithms, as more adaptive sampling, are also used
[Liu et al., 2015].
Recent studies consider training criteria that are more suitable for practical concerns. Mean
square error was used in the pioneering implicit recommendation work [Hu et al., 2008] to
implicitly capture user preferences. [Rendle et al., 2009] designed a loss as a proxy of AUC
to preserve the item order. In an attempt to focus on top-k recommendation performance, newer
15
loss functions [Usunier et al., 2009, Weston et al., 2010, Shi et al., 2012a] have been developed
so that the quality of the top part of a recommendation list is emphasized.
Deep neural network techniques are being introduced to the recommendation community.
The Restricted Boltzmann Machine showed eciency in explicit recommendation such as the
early Netflix Challenge. In content-based implicit algorithms, convolutional neural networks and
stacked denoising autoencoders have been used to extract feature representations from items to
improve recommendation [Van den Oord et al., 2013, Wang et al., 2015b]. Sequence embedding
models are recently also explored to go beyond matrix factorization in item recommendation in
products [Grbovic et al., 2015], music [Vasile et al., 2016], and videos [Hidasi et al., 2015].
2.3 Domains and datasets
In this section we describe several important datasets on which we validate our proposed
approaches.
2.3.1 Jobs
The problem of matching job seekers to postings has attracted great attention from both
academia [Malinowski et al., 2006] and industry (e.g., XING
2
and LinkedIn
3
) in recent years.
Given a user, the goal of the job recommendation system is to predict those job postings that are
likely to be relevant to the user. In order to fulfill this task, various data sources can be exploited.
We experiment on two datasets from the ACM RecSys Challenge 2016 [Abel et al., 2016].
RS16
RS16 is the dataset used in the RecSys Challenge. It provides around 12 weeks of interaction
data for a subset of users and job items from the social networking and job search website,
2
www.xing.com/
3
www.linkedin.com/
16
Table 2.1: Dataset statistics of RS16.
Data splits Target/all users Active/all items Interactions Impressions
Sizes 150K/ 1.5M 327K / 1.3M 8.8M 202 M
xing.com. The task is to predict the items that a set of target users will positively interact with in
weeks 13 and 14.
The training dataset includes user profiles, job postings, and interactions that users performed
on job posts. Interactions include click, bookmark, reply, and delete. There are 1,367,057 users
and 1,358,098 items. Users and items are described by a rich set of attributes such as job cate-
gories, career level, industry, location, experience, (anonymized) description tokens, etc. Cate-
gorical attributes, such as career level and industry, take several to dozens of values, and descrip-
tor features have a vocabulary size around 100K. The detailed attribute information is shown in
Fig. 2.1.
In addition, the dataset also contains user-item impressions, i.e., information about job post-
ings that were shown to users by the XING system. These impressions help indicate user interests
and predict their future activities. Timestamps of interactions and impressions are given. The
detailed quantitative information of this dataset is shown in Table 2.1.
Figure 2.1: Attribute description inRS16 (U=user; I=item).
Attribute types Attributes
Categorical (U)
career lever, discipline id, industry id,
id, country, region, exp years,
exp in entries class, exp in current
Descriptors (U) job roles, field of studies
Categorical (I)
id, career level, discipline id,
country, region, employment
Numerical (I) latitude, longitude, created at
Descriptor (I) title, tags
Solutions are submitted online (five times maximum per day) and scores are computed and
displayed in a leaderboard. The evaluation measure provided by the challenge organizer is based
on the key performance indicators that XING is using to monitor the quality of the job recom-
mender system.
17
Table 2.2: Dataset statistics of XING.
Data splits Target/all users Items Training interactions Test interactions
Sizes 150K/ 1.5M 327K 2.3M 484K
XING
XING is a dataset we created from RS16. The ground truth interactions that are used in the
online leaderboard are acquired from XING as test data to allow oine evaluation.
To simplify the task and focus on an implicit recommendation problem, we remove impres-
sion data and merge click, bookmark, and reply as a single-type event. This event is also what
we need to predict. Non-active items are removed. We have the data statistics of XING in Table
2.2. User profiles and item metadata stay the same, as inRS16.
2.3.2 Online businesses
Online business is actively playing a transforming role in modern life. Users actively interact
with businesses, and massive data is generated every day. This leads to fruitful research and
applications for studing online user behaviors.
Yelp, a large business review platform, collects user activities like check-in, review, tip, reser-
vation, etc., at millions of local businesses. It also has rich user profiles and business metadata
information. These factors make it a good dataset for studying user preferences and recommen-
dations.
Yelp
TheYelp dataset comes from the Yelp Challenge Dataset.
4
We downloaded the dataset in Febru-
ary 2017. The original Challenge dataset contains information about 4.1M reviews and 947K
tips by 1M users for 144K local businesses in 11 cities across 4 countries. It has 1.1M business
attributes, e.g., hours, parking availability, ambience. It has aggregated check-ins over time for
the businesses. 200,000 pictures from the included businesses are also provided.
4
https://www.yelp.com/dataset challenge
18
Table 2.3: Dataset statistics of Yelp.
Data splits Users Items Training reviews Testing reviews
Sizes 1.0M 144K 1.9M 213K
We convert the Challenge dataset into recommendation related to which business a user
might want to review. There are 1,029,433 users and 144,073 businesses. Interactions here are
defined as reviews. As in [He et al., 2016b] online protocol, we sort all interactions in chrono-
logical order and take the last 10% for testing and the rest for training. There are 1,917,218
reviews in the training split and 213,460 in the testing. The statistics are listed in Table 2.3.
2.3.3 Movies
Movie recommendation is a classical task studied in the recommendation community. Numerous
algorithms benchmarked on movie recommendation tasks to demonstrate the ability of modeling
user preferences and making recommendations.
MovieLens is a web-based recommender system and virtual community that recommends
movies for its users to watch, based on their film preferences using collaborative filtering of
members’ movie ratings and movie reviews.
GroupLens Research provides the rating data sets collected from the MovieLens website for
research use. There are several versions of datasets released at dierent time. Many types of
research have been conducted based on the MovieLens data sets
5
.
Movielens-1m
ML-1m [Harper and Konstan, 2016] was released in February 2003. It contains 1,000,209
anonymous ratings of approximately 3,900 movies. Each user has at least 20 movie ratings.
Ratings are made on a 5-star scale. Timestamps of ratings are recorded and represented in sec-
onds.
5
www.movielens.org
19
Table 2.4: Dataset statistics of movies.
Data Users Items Train Test
ML-1m 6,040 3,883 457,322 117,297
ML-20m 138,493 27,278 7,899,901 2,039,972
In addition to rating records, user demographic information and movie information are pro-
vided. User demographic information includes gender, age, occupation, and zip-code. Movie
information includes a title and genre (e.g., action, adventure, drama, musical, sci-fi, etc.).
To conduct implicit recommendation studies, we transform the data into binary data indicat-
ing whether a user rated a movie above 4:0. Users with less than 10 movie ratings are discarded.
For each user, we use the first 70% of movies that he rates as training data and the remaining
30% as test data.
Movielens-20m
ML-20m [Harper and Konstan, 2016] was generated in October 2016. It describes 5-star rating
and free-text tagging activity from MovieLens. It contains 20,000,263 ratings and 465,564 tag
applications across 27,278 movies. These data were created by 138,493 users between January
09, 1995 and March 31, 2015.
Similar to ML-1m, movie titles and genres are provided. In addition, user tagging activities
provide movies with a set of text tags and their relevance scores. We filter tags to keep each
movie with at most 20 tags (with the highest relevance scores). There is no user demographic
information inML-20m.
The same train/test splitting is performed, as in ML-1m. The dataset statistics of ML-
1m andML-20m are in Table 2.4.
20
Chapter 3
A Batch Learning Framework for Scalable Personalized
Ranking
In this chapter we introduce our approach to scale up item recommendation. We argue that
the current state of the art algorithm does not scale up to large item set due to its intrin-
sic online learning fashion. Instead, our new batch training based algorithm allows itself to
explicitly use parallel computation to accelerate training and to estimate item ranks more accu-
rately. Through empirical studies on three item recommendation tasks, we demonstrate that our
approach achieves significant accuracy improvements. Moreover, it shows clear time eciency
advantages as data scale increases.
3.1 Introduction
The task of personalized ranking is to provide each user with a ranked list of items that he
might prefer. It has received considerable attention in academic research [Hu et al., 2008,
Rendle et al., 2009, He et al., 2016b], and algorithms developed are applied in various
applications in e-commerce [Linden et al., 2003], social networks [Chen et al., 2009], loca-
tion [Liu et al., 2014], etc. However, personalized ranking remains a very challenging task: 1)
The learning objectives of ranking models are hard to directly optimize. For example, the qual-
ity of the model output is commonly evaluated by ranking measures such as NDCG, MAP, and
MRR, which are position-dependent (or rank-dependent) and non-smooth. It makes gradient-
based optimization infeasible and also computationally expensive. 2) The size of an item set that
a ranking task uses can be very large. It is not uncommon to see an online recommender system
21
with millions of items. As a consequence, it increases the diculty of capturing user preferences
over the entire set of items. It also makes it harder to compute or estimate the rank of a particular
item.
Traditional approaches model user preferences with rank-independent algorithms. Pairwise
algorithms convert the learning task into many binary classification problems and optimize the
average classification accuracy. For example, BPR [Rendle et al., 2009] maximizes the proba-
bility of correct prediction of each pairwise item comparison. MMMF [Srebro et al., 2005] min-
imizes a margin loss for each pair in a matrix factorization setting. Listwise algorithms such
as those recently explored by [Hidasi et al., 2015, Covington et al., 2016] treat the problem as a
multi-class classification and use cross-entropy as the loss function.
Despite the simplicity and wide application, these rank-independent methods are not satis-
factory because the quality of results from a ranking system is highly position-dependent. A high
accuracy at the top of a list is more important than that at a low position on the list. However, the
average accuracy targeted by the pairwise algorithms discussed above places equal weights at all
the positions. This mismatch therefore leads to under-exploitation in the prediction accuracy at
the top part. Listwise algorithms, on the other hand, do make an attempt to push items to the top
using a classification scheme. However, its classification criterion also does not match well with
ranking.
Position-dependent approaches are explored to address the above limitations. One
critical question is how to approximate item ranks to perform rank-dependent training.
TFMAP [Shi et al., 2012a] and CLiMF [Shi et al., 2012b] approximate an item rank purely
based on the model score of this item, i.e., a pointwise estimate. Particularly, it models the
reciprocal rank of an item with a sigmoid transformed from the score returned by the model.
TFMAP then optimizes a smoothed modification of MAP, while ClifMF optimizes MRR. This
pointwise estimation is simple, but it is only loosely connected to the true ranks. The estimation
becomes even more unreliable as the itemset size increases.
An arguably more direct approach is to optimize the ranks of relevant items returned
by the model to encourage top accuracy. It requires the computation or estimation of
22
item ranks and modification of the updating strategy. This idea is explored in traditional
learning to rank methods LambdaNet [Burges et al., 2005], LambdaRank [Burges et al., 2007],
etc., where the learning rate is adjusted based on item ranks. In personalized ranking,
WARP [Weston et al., 2010] propose to use a sampling procedure to estimate item ranks. It
repeatedly samples negative items until it finds one that has a higher score. Then the number of
sampling trials is used to estimate item ranks. This stochastic pairwise estimation is intuitive.
WARP is also found to be more competitive than BPR [Hong et al., 2013]. A more recent matrix
factorization model LambdaFM [Yuan et al., 2016] adopts the same rank estimation. However,
this pairwise rank estimator becomes very noisy and unstable as the itemset size increases (as we
will demonstrate). It takes a large number of sampling iterations for each estimation. Moreover,
its intrinsic online learning fashion prevents full utilization of available parallel computation
(e.g., GPUs), making it hard to train large or flexible models which rely on the parallel compu-
tation.
The limitations of these approaches largely come from the stochastic pairwise esti-
mation component. As a comparison, training with batch or mini-batch together with
parallel computation has recently oered opportunities to tackle scalability challenges.
[Hidasi et al., 2015] use a sampled classification setting to train a RNNs-based recommender
system. [Covington et al., 2016] deploy a two-stage classification system in YouTube video rec-
ommendation. Parallel computation (e.g., GPUs) are extensively used to accelerate training and
support flexible models.
In this work we propose a novel framework to do personalized ranking. Our aim is to have
better rank approximations and advocate the top accuracy in large-scale settings. The contribu-
tions of this work are:
We propose rank estimations that are based on batch computation. They are shown to be
more accurate and stable in approximating the true item rank.
23
We propose smooth loss functions that are “rank-sensitive.” This advocates top accu-
racy by optimizing the loss function values. Being dierentiable, the functions are easily
implemented and integrated with back-propagation updates.
Based on batch computation, the algorithms explicitly utilize parallel computation to speed
up training. They lend themselves to flexible models which rely on more extensive com-
putation.
Our experiments on three item recommendation tasks show consistent accuracy improve-
ments over state-of-the-art algorithms. They also show time eciency advantages when
data scale increases.
The remainder of the chapter is organized as follows: We first introduce notations and pre-
liminary methods. We next detail our new methods, followed by discussions on related work.
Experimental results are then presented. We conclude with discussions and future work.
3.2 Notations
In this chapter we will use the letter x for users and the letter y for items. We use unbolded
letters to denote single elements of users or items and bolded letters to denote a set of items.
Particularly, a single user and a single item are, respectively, denoted by x and y, Y denotes
the entire item set. y
x
denotes the positive items of user x—that is, the subset of items that are
interacted by user x. ¯ y
x
Yn y
x
is the irrelevant item set of user x. We omit subscript x when
there is no ambiguity. S =f(x; y)g is the set of observations. The indicator function is denoted
by I. f denotes a model (or model parameters). f
y
(x) denotes the model score for user x and item
y. For example, in a matrix factorization model, f
y
(x) is computed by the inner product of latent
factors of x and y. Given a model f , user x, and its positive items y, the rank of an item y is
defined as
r
y
rank
y
( f; x; y) =
X
y
0
2¯ y
I[ f
y
(x) f
y
0(x)]; (3.1)
24
where we use the same definition as in [Usunier et al., 2009] and ignore the order within positive
items. The indicator function is sometimes approximated by a continuous margin lossj1 f
y
(x)+
f
y
0(x)j
+
wherejtj
+
max(t; 0) 8t2R.
3.3 Position-dependent personalized ranking
Position-dependent algorithms take the ranks of predicted items into account in the model train-
ing. A practical challenge is how to estimate item ranks eciently. As seen in (3.1), the ranks
depend on the model parameters and are dynamically changing. The definition is non-smooth in
model parameters due to the indicator function. The computation of ranks involves comparisons
with all the irrelevant items, which can be costly.
Existing position-dependent algorithms address the challenge by dierent rank approxima-
tion methods. They can be categorized into pointwise and pairwise approximations. We describe
their approaches in the following.
3.3.1 Pointwise rank approximation
Item ranks are approximated in TFMAP [Shi et al., 2012a] and ClifMF [Shi et al., 2012b] using
a pointwise approach. Particularly,
r
y
rank
point
y
( f; x) = 1=( f
y
(x)); (3.2)
where (z) = 1=(1 + e
z
);8z2 R. The approximated rank r
y
is then plugged into an evalua-
tion metric MAP (as in TFMAP) and MRR (as in ClifMF) to make the objective smooth. The
algorithms then use gradient-based methods for optimization.
In (3.2), rank
point
y
is close to 1 when model score f
y
(x) is high and becomes large when
f
y
(x) is low. This connection between model scores and item ranks is intuitive, and implicitly
encourages a good accuracy at the top. However, rank
point
y
is very loosely connected to rank
definition in (3.1). In practice, it does not capture the non-smooth characteristics of ranks. For
25
example, small dierences in model scores can lead to dramatic rank dierences when the item
set is large.
3.3.2 Pairwise rank approximation
An alternative approach used by WARP [Weston et al., 2010], Lamb-
daMF [Yuan et al., 2016] estimates item ranks based on comparisons between a pair of
items. The critical component is an iterative sampling approximation procedure: given a
user x and a positive item y, keep sampling a negative item y
0
2 ¯ y uniformly with replacement
until the condition 1 + f
y
0(x) < f
y
(x) is violated. With the sampling procedure it estimates item
ranks by
r
y
rank
pair
y
( f; x; y) =b
j¯ yj 1
N
c (3.3)
where N is the number of sampling trials to find the first violating example andbzc takes the
maximum integer that is no greater than z.
The intuition behind this estimation is that the number of trials of sampling follows a geo-
metric distribution. Suppose the item’s true rank is r
y
, the probability of having a violating item
is p = r
y
=(j¯ yj 1). The expected number of trials isE(N) = 1=p = (j¯ yj 1)=r
y
.
To promote top accuracy, the estimated item ranks are used to modify updating procedures.
For example, in WARP, they are plugged in a loss function defined as,
L
owa
(x; y) =
owa
[r
y
] =
r
y
X
j=1
j
1
2
:: 0 (3.4)
where
j
j = 1; 2;:: are predefined non-increasing scalars. The function
owa
is derived from
ordered weighted average (OWA) of classification loss [Yager, 1988] in [Usunier et al., 2009].
It defines a penalty incurred by placing r
y
irrelevant elements before a relevant one. Choosing
equal
j
means optimizing the mean rank, while choosing
j>1
= 0;
1
= 1 means optimizing
the proportion of top-ranked correct items. With strictly decreasings, it optimizes the top part
of a ranked list.
26
3.4 Approach
We begin by pointing out several limitations of approaches based on pairwise rank estimations.
First, the rank estimator in (3.3) is not only biased but also has a large variance. Expectation
of the estimator in (3.3) for p of a geometric distribution is approximately p(1 +
P
N
k=2
1
k
(1
p)
k1
) > p. In a ranking example where p = r=N (N population size), it overestimates the rank
seriously when r is small. Moreover, we will demonstrate later that the estimator has high esti-
mation variances. We believe this poor estimation may lead to training ineciency. Second, it
can take a large number of sampling iterations before finding a violating item in each iteration.
This is especially the case after the beginning stage of training. This results in low frequency of
model updating. In practice, prolonged training time is observed. Finally, the intrinsic sequen-
tial learning fashion based on pairwise estimation prevents full utilization of available parallel
computation (e.g., GPUs). This makes it hard to train large or flexible models which highly rely
on the parallel computation.
We address the limitations by combining the ideas of batch computation and rank-dependent
training loss. Particularly, we propose batch rank approximations and generalize (3.4) to smooth
rank-sensitive loss functions. The resulting algorithm gives more accurate rank approximations
and allows back-propagation updates.
3.4.1 Batch rank approximations
In order to have a stable and accurate rank approximation that leads to ecient algorithms. We
exploit the idea of batch training which has been recently actively explored or revisited in areas
such as model design [Covington et al., 2016] and optimization [Chen et al., 2016].
To begin, we define margin rank (mr) as the following:
rank
mr
y
( f; x; y) =
X
y
0
2¯ y
j1 f
y
(x) + f
y
0(x)j
+
; (3.5)
where the margin loss is used to replace the indicator function in (3.1), and the target item y is
compared to a batch of negative items ¯ y. As illustrated in Figure 3.2(a), the margin loss (green
27
Figure 3.1: Illustrations of rank approximations and smooth rank-sensitive loss functions. 3.2(a)
shows dierent approximations of indicator functions. 3.2(b) shows smooth loss functions used
to generalize the loss (3.4).
−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0
x
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
y
step loss approximations
y = 1(x<0)
y = max(1-x, 0)
y = sigm(max(1-x,0))*2-1
y = sigm(-x)
(a) Indicator approximations.
−2 0 2 4 6 8 10 12
x
−2
0
2
4
6
8
10
12
y
rank-sensitive loss f nctions
(r+1)^p-1
log(r+1)
1-exp(-r)
OWA
BPR
(b) Rank-sensitive loss functions.
curve) is a smooth convex upper bound of the original step loss function (or indicator function).
Margin rank sums up the margin errors and characterizes the overall violation of irrelevant items.
Margin rank could be sensitive to “bad” items that have significantly higher scores than the
target item. As seen in Eq. (3.5) or Figure 3.2(a), such a bad item contributes much more than
one in rank estimation. To suppress that eect, we add a sigmoid transformation, i.e.,
rank
smr
y
( f; x; y) =
X
y
0
2¯ y
2(j1 f
y
(x) + f
y
0(x)j
+
) + 1; (3.6)
where (z) = 1=(1 + e
z
);8z2 R. We call this suppressed margin rank (smr). Additionally,
we study a smoother version without margin formulation, i.e., rank
sr
y
( f; x; y) =
P
y
0
2¯ y
( f
y
0(x)
f
y
(x)). Therefore, our rank approximations can be written as
rank
batch
y
( f; x; y) =
X
y
0
2¯ y
˜ r(x; y; y
0
); (3.7)
where ˜ r(x; y; y
0
) takes one of the following three forms:
˜ r(x; y; y
0
) =j1 f
y
(x) + f
y
0(x)j
+
˜ r(x; y; y
0
) = 2(j1 f
y
(x) + f
y
0(x)j
+
) 1
28
˜ r(x; y; y
0
) =( f
y
0(x) f
y
(x))
Note that the rank approximations in (3.7) are computed in a batch manner: model scores
between a user and a batch of items are first computed in parallel; ˜ r are then computed accord-
ingly and summed up. This batch computation allows model parameters to be updated more
frequently. The parallel computation can speed up model training.
The full batch treatment may become infeasible or inecient when the item set gets overly
large. A mini-batch approach is used for that case. Instead of computing the rank based on
the full set Y as in (3.7), the mini-batch version algorithm samples Z, a subset of Y randomly
(without replacement) and computes
rank
mb
y
( f; x; y) =
jYj
jZj
X
y
0
2Z
˜ r(x; y; y
0
)I(y
0
2 ¯ y): (3.8)
Although a sampling step is involved, we argue below that (3.8) does not lead to large vari-
ances as the sampling in pairwise approaches do. First, (3.8) is an unbiased estimator of (sup-
pressed) margin rank. Second, the sampling schemes in the two approaches are dierent. To
have a better idea, we conduct a simulation of two types of sampling and plot standard devia-
tions. Figure 3.2 shows that (3.8) has much smaller variances than (3.3) as long asjZj=jYj is not
too small.
3.4.2 Smooth rank-sensitive loss functions
The rank-based loss function (3.4) operates on item ranks and provides a mechanism to advocate
top accuracy. However, its input has to be an integer. It is also non-smooth. Consequently, it is
not applicable to our batch rank estimators and does not support gradient-based optimization. In
the following, we generalize (3.4) to smooth rank-sensitive loss functions that similarly advocate
top accuracy.
We first observe that a loss function ` would encourage top accuracy when the following
three conditions are satisfied—we then call it rank sensitive (rs):
1. ` is a smooth function of the rank r.
29
Figure 3.2: Relative standard deviations of two types of rank estimators at dierent item ranks.
Simulation is done with item set size N=100,000. Pairwise sample approximation uses estimator
(3.3) and our mini-batch approximation uses (3.8) where 0:001; 0:01; 0:1 refer to sample ratio
jZj=jYj.
10
2
10
3
10
4
10
5
True rank
0
1
2
3
4
5
6
Relative STD
Relative STDs of rank estimators
Pairwise sample approximation
mini-batch approximation 0.001
mini-batch approximation 0.01
mini-batch approximation 0.1
2. ` is increasing, i.e.,`
0
> 0.
3. ` is concave, i.e.,`
00
< 0.
The second condition indicates that the loss increases when the rank of a target item
increases. Thus, given a single item, minimizing the objective would try to push the item to
the top. The third condition means the increase is fast when rank is small and slow when rank
is large. So it is more sensitive to small ranking items. Given more than one item, an algorithm
that minimizes this objective would prioritize on those items with small estimated rank values.
Based on the observation, we study several types of functions `
rs
that satisfy these condi-
tions: polynomial functions, logarithmic functions, and exponential functions, i.e.,
30
`
rs
1
(r) = (1 + r)
p
; 0< p< 1
`
rs
2
(r) = log(r + 1)
`
rs
3
(r) = 1
r
; > 1
It follows standard calculus to verify that these functions satisfy the above conditions. Thus,
they all incur (smoothly) weighed loss based on estimated rank values and advocate top accuracy.
We plot part of these functions in Figure 3.2(b) and compare them to BPR and OWA (with
s = 1; 1=2; 1=3;:::). BPR is equivalent to a linear function which places a uniformly increasing
penalty on the estimated rank. Polynomial and logarithmic functions have a diminishing return
when the estimated rank increases and is unbounded. Exponential functions are bounded by 1,
and the penalty on high rank values is quickly saturated.
3.4.3 Algorithm
An algorithm can be then formulated as minimizing an objective based on the (mini-)batch
approximated rank values and rank-sensitive loss functions. It sums over all pairs of observed
user-item activity. Particularly,
L =
X
(x;y)2S
`
rs
(r
mb
y
( f; x; y)) +( f ); (3.9)
where r
mb
y
is given by (3.8) and`
rs
takes`
rs
i
; i = 1; 2; 3. ( f ) is a model regularization term.
Gradient-based methods are used to solve the optimization problem. The gradient with
respect to the model can be written as
@L
@ f
=
X
`
0
(r)
@r
@ f
; (3.10)
where`
0
(r) takes the form p(1+ r)
p1
(or 1=(r + 1), or
r
) and we ignore the regularization term
for the moment. We call the framework defined in (3.9) Batch-Approximated-Rank-Sensitive
loss (BARS). The details are described in Algorithm 1.
31
Algorithm 1:
Input: Training data S; mini-batch size m; Sample rate q; a learning rate.
Output: The model parameters f.
initialize parameters of model f randomly;
while Objective (3.9) is not converged do
sample a mini-batch of observationsf(x; y)
i
g
m
i=1
;
sample item subset Z from Y, q =jZj=jYj;
compute approximated ranks by (3.8);
update model f parameters:
f = f@`=@ f based on (3.10);
end
Note that in Algorithm 1 computation is conducted in a batch manner. Particularly, it com-
putes model scores between a mini-batch of users (x) and sampled items (Z) in parallel. In every
step it updates parameters of all the users in x and items in Z.
Comparisons to Lambda-based methods. Lambda-based methods such as LambdaNet and
LambdaRank use an updating strategy in the following form
f
i j
=
i j
jNDCG
i j
j; (3.11)
where
i j
is a regular gradient term andjNDCG
i j
j is the NDCG dierence of the ranking results
for a specific user if the positions (ranks) of items i, j get switched, and acts as a scaling term
that is motivated to directly optimize the ranking measure.
Compare (3.10) and (3.11): `
0
(r) replacesjNDCG
i j
j and functions in a similar role. In our
cases,`
0
(r) is decreasing in r. Thus it gives a higher incentive to fix errors in low-ranking items.
However, instead of directly computing the NDCG dierence which can be computationally
expensive in large-scale settings, the proposed algorithm first approximates the rank and then
computes the scaling value through derivative of the loss function.
3.5 Related work
Top accuracy in traditional ranking. Top accuracy is a classical challenge for ranking
problems. One typical approach is to develop smooth surrogates of the non-smooth ranking
32
Table 3.1: Dataset statistics. U: users; I: items; S: interactions.
Data jUj jIj jS
train
j jS
test
j
ML-20m 138,493 27,278 7,899,901 2,039,972
Yelp 1,029,433 144,073 1,917,218 213,460
XING 1,500,000 327,002 2,338,766 484,237
metric.[Weimer et al., 2008] use structure learning and propose smooth convex upper bounds
of a ranking metric. [Taylor et al., 2008] develop a smooth approximation based on connec-
tions between score distribution and rank distribution. Similarly, in a kernel SVM setting,
[Agarwal, 2011] propose a formulation based on infinity norm. [Boyd et al., 2012] generalize
the notion of top k to top-quantile and optimize a convex surrogate of the corresponding rank-
ing loss.
Alternatively, [Burges et al., 2005] start with the average precision loss and modify model
updating steps to promote top accuracy. Similar ideas are developed in [Burges et al., 2007,
Wu et al., 2010]. [Burges et al., 2007] write down the gradient directly rather than deriving it
from a loss. [Wu et al., 2010] work on a boosted tree formulation.
Personalized ranking. Traditional work on personalized ranking does not necessarily focus
on top accuracy. [Hu et al., 2008] first study the task and convert it as a regression problem.
[Rendle et al., 2009] introduce ranking-based optimization and optimize a criterion similar to
AUC. [He et al., 2016b] improves matrix factorization by giving missing items non-uniform
weights and devising an ALS based solver.
To promote top accuracy and deal with large-scale settings, [Shi et al., 2012a] develop
rank approximation based on model score and propose a smooth approximation of MAP.
[Shi et al., 2012b] adopt the same idea and target MRR.
Pairwise algorithms [Weston et al., 2010, Yuan et al., 2016] are then proposed to estimate
ranks through sampling methods. [Weston et al., 2010] update the model based on an operator
ordered weighted average. [Yuan et al., 2016] use a similar idea as in [Burges et al., 2005].
33
Table 3.2: Recommendation accuracy comparisons (in %). Results are averaged over 5 experi-
ments with dierent random seeds. Best and second best numbers are in bold and italic, respec-
tively.
Datasets ML-20m Yelp XING
Metrics P@5 R@30 N@30 P@5 R@30 N@30 P@5 R@30 N@30
POP 6.2 10.0 8.5 0.3 0.9 0.5 0.5 2.7 1.3
BPR 6.1 10.2 8.3 0.1 0.4 0.2 0.3 2.2 0.9
b-BPR 9.3 14.3 12.9 0.9 3.4 1.9 1.3 9.2 4.2
A-WARP 10.1 13.3 13.5 1.3 4.3 2.5 2.6 11.6 6.7
CE 9.6 14.3 13.2 1.4 4.5 2.6 2.5 12.3 6.5
SR-log 9.9 14.5 13.6 1.4 5.2 2.9 2.8 12.3 6.9
MR-poly 10.2 14.8 13.9 1.5 5.2 2.9 2.8 12.5 6.9
MR-log 10.2 14.6 13.9 1.5 5.1 2.9 2.9 12.5 7.1
SMR-log 10.2 14.6 13.9 1.5 5.4 3.0 2.9 12.5 7.1
3.6 Experiments
In this section we conduct experiments on three large-scale real-world datasets to verify the
eectiveness of the proposed methods.
3.6.1 Setup
Dataset
We validate our approach on three public datasets from dierent domains: 1) movie recommen-
dations; 2) business reviews at Yelp; 3) job recommendations from XING.
1
We describe the
datasets in detail below.
MovieLens-20m The dataset has anonymous ratings made by MovieLens users.
2
We transform
the data into binary indicating whether a user rated a movie above 4:0. We discard users with
less than 10 movie ratings and use 70%/30% train/test splitting. Attributes include movie genres
and movie title text.
1
www.xing.com
2
www.movielens.org
34
Yelp dataset comes from Yelp Challenge.
3
We work on recommendations related to which
business a user might want to review. Following the online protocol in [He et al., 2016b], we
sort all interactions in chronological order, take the last 10% for testing and take the rest for
training. Business items have attributes including city, state, categories, hours, and attributes
(e.g., “valet parking,” “good for kids”).
XING contains about 12 weeks of interaction data between users and items on XING. Train/test
splitting follows the RecSys Challenge 2016 [Abel et al., 2016] setup where the last two weeks
of interactions for a set of 150,000 target users are used as test data. Rich attributes are associated
with data like career levels, disciplines, locations, job descriptions etc. Our task is to recommend
to users a list of job posts with which they are likely to interact.
We report dataset statistics in Table 3.1.
Methods
We study multiple algorithms under our learning framework and compare them to various base-
line methods. Particularly, we study the following algorithms:
POP. A naive baseline model that recommends items in terms of their popularity.
BPR [Rendle et al., 2009]. BPR optimizes AUC and is a widely used baseline.
b-BPR. A batch version of BPR. It uses the same logistic loss but updates a target item and a batch
of negative items every step.
A-WARP [Weston et al., 2010, Hong et al., 2013]. A state-of-the-art pairwise personalized ranking
method.
CE. Cross entropy loss is recently used for item recommendation in [Hidasi et al., 2015,
Covington et al., 2016].
SR-log. The proposed algorithm with the smoothed rank approximation without margin formula-
tion (sr) and logarithmic function (log).
3
https://www.yelp.com/dataset challenge. Downloaded Feb 17.
35
MR-poly. The proposed algorithm with the margin rank approximation (mr) and polynomial func-
tion (poly).
MR-log. The proposed algorithm with the margin rank approximation (mr) and logarithmic func-
tion (log).
SMR-log. The proposed algorithm with the suppressed margin rank approximation (smr) and
logarithmic function (log).
BPR and A-WARP are implemented by LightFM [Kula, 2015]. We implemented the other algo-
rithms.
We apply the algorithms to hybrid matrix factorization [Shmueli et al., 2012], a factorization model
that represents users and items as linear combinations of their attribute embedding. Therefore, model
parameters f include factors of users, items, and their attributes.
Early stopping is used on a development dataset split from training for all models. Hyper-parameter
model size is tuned inf10, 16, 32, 48, 64g; learning rate is tuned inf0.5, 1, 5, 10g; when applicable,
dropout rate is 0.5. Batch-based approaches are implemented based on Tensorflow 1.2 on a single GPU
(NVIDIA Tesla P100) ). LightFM runs on Cython with a 5-core CPU (Intel Xeon 3.30GHz).
Metrics
We assess the quality of recommendation results by comparing a model’s recommendations to ground
truth interactions, and report Precision (P), Recall (R) and Normalized Discounted Cumulative Gain
(NDCG) scores. We report scores after removing historical items from each user’s recommendation
list on Yelp and ML-20m datasets because users seldom re-interact with items in these scenarios (Yelp
reviews/movie ratings). This improves performance for all models but does not change relative compari-
son results.
3.6.2 Results
Quality of rank approximations
We first study how well the proposed methods approximate the true item ranks. To do that, we run one
epoch of training on theXING dataset and compute the values in (3.5) and the true item rank. We plot the
value of (3.5) as a function of the true rank in Figure 3.3.
36
Figure 3.3: Approximated rank values compared to the true rank values. 3.4(b) is a zoomed-in
version with error bars.
(a) 0-50000. (b) 0-200.
Figure 3.4(a) shows in a very large range (0-50000) that the estimator in (3.5) is linearly aligned
with the true item rank, especially when true item ranks are small—note that those regions are what we
most care about. We further zoom into the top region (0-200) and plot error bars in addition to function
mean values. In Figure 3.4(b), we see limited variances. For example, the relative standard deviation is
smaller than 0.1, which indicates stable rank approximation. As a comparison, the simulation in Figure
3.2 suggests that stochastic pairwise estimation should lead to a relative standard deviation much more
than 6.
Recommendation accuracy
Recommendation scores are reported in Table 3.2. We have the following observations.
First, vanilla pairwise BPR performs poorly due to a large itemset size. In contrast, batch version b-
BPR bypasses the diculty of sampling negative items and updates model parameters smoothly, achieving
decent accuracies.
Second, A-WARP and CE outperform b-BPR. Both methods target more than averaged precision.
Compared with each other, CE penalizes more on a correct item ranking behind incorrect items, thus
showing consistently better performances in recall.
37
Table 3.3: Dataset/model complexity comparisons.
Datasets ML-20m Yelp XING
Density 2.6e-3 1.4e-5 5.8e-6
# of param. 4.6M 9.3M 12.1M
# of Attr. 11 19 33
jIj 27K 144K 327K
complexity small medium large
Third, the proposed methods consistently outperform A-WARP and CE. Compared to CE, the
improvements suggest the eectiveness of the rank-sensitive loss, which works better than the classifi-
cation loss. Compared to A-WARP, we attribute the improvements to better rank approximations and
possibly other factors like smoother parameter updates.
Finally, we compare the dierent variants of the proposed methods. SR-log underperforms, and it
suggests the benefit of margin formulation. Polynomial loss functions have similar results compared to
logarithmic functions, but require a bit tuning in the hyper-parameter p. Suppress margin rank (SMR)
performs a bit better than MR—probably due to its better rank approximation.
Time eciency
We study the time eciency of the proposed methods and compare them to the pairwise algorithm imple-
mentations. Note that we are not just interested in the comparisons of the absolute numbers because they
involve multiple factors. Rather, we focus on the trend of how time eciency changes when data scale
increases.
We characterize the dataset complexity in Table 3.3 by the density (computed by
#:observations
#:users#:items
), the
number of total parameters, the average number of attributes per user-item pair, and the itemset size. From
Table 3.3, ML-20m has the densest observations, the smallest number of total parameters and attributes
per user-item, and the smallest itemset size. Thus we call it “small” in complexity. Conversely, we
callXING “large” in complexity. Yelp is between the two and is called “medium.”
Two results are reported in Figure 3.4. Figure 3.5(a) shows across dierent datasets the converging
time needed to reach the best models from the two systems: WARP and BARS (ours). WARP takes a
shorter time in both “small” and “medium”, but its running time increases very fast; BARS has a slower
increase in the training time and wins at “large.”
38
Figure 3.4: Training time comparisons between WARP and BARS. Fig. 3.5(a) plots how training
time changes across datasets with dierent scales; Fig. 3.5(b) plots how epoch time changes as
the training progresses.
ML-20m Yelp XING
0
5
10
15
20
25
30
35
Hours
1
5
31
3
10
14
Training time on different tasks
WARP
OURS
(a) Across datasets.
1-5 6-10 11-15 16-20 21-25
0
50
100
150
200
250
300
350
400
450
Minutes
121
176
215
320
427
85 85 85 85 85
Averaged epoch time vs. epochs
WARP
OURS
(b) Across training epochs.
Table 3.4: Comparisons of objective values (obj) and recommendation accuracies (NDCG) on
development set among full batch and sampled batch algorithms. q =jZj=jYj, q = 1:0 means full
batch.
q
ML-20m Yelp XING
obj NDCG obj NDCG obj NDCG
1.0 6.49 16.0 7.94 3.0 6.42 9.9
0.1 6.47 15.9 7.87 3.0 6.40 10.0
0.05 6.47 15.9 7.90 2.9 6.42 9.8
Figure 3.5(b) depicts the averaged epoch training time of the two systems. BARS has a constant
epoch time. In contrast, WARP keeps increasing the training time per epoch. This is expected because
when the model becomes better, it takes a lot more sampling iterations every step.
Robustness to mini-batch size
The full batch algorithm is used in the above experiments. We are also interested in seeing how it performs
with a sampled batch loss. In Table 3.4 we report loss values and NDCG@30 scores on development
split, and compare them to full batch versions. With the sampling proportion 0.1 or 0.05, the sampled
version algorithm gives almost identical results as the full batch version on all datasets. This suggests the
robustness of the algorithm to mini-batch size.
39
3.7 Summary
In this work we address the personalized ranking task and propose a learning framework that exploits the
ideas of batch training and rank-dependent loss functions. The proposed methods allow more accurate
rank approximations and empirically give competitive results.
In designing the framework, we purposely tailored our approach to the use of parallel computation and
support of back-propagation updates. This readily lends itself to flexible models such as deep feedforward
networks, recurrent neural nets, etc. In the future, we are interested in exploring the algorithm in the
training of deep neural networks.
40
Chapter 4
Sequence Modeling on Recommendation with Implicit
Feedback
In this chapter, we introduce our study on implicit recommendation with sequence modeling. We con-
jecture that discriminative information is encoded in the sequence of user feedback and can be used to
improve recommendation accuracy. Three sequence models are thus designed and investigated on four
tasks. Experiments show significant improvement compared to state-of-the-art non-sequence recommen-
dation models. We also conduct exploratory studies that shed lights on how feedback sequence helps.
4.1 Motivation
Recommendation with implicit feedback faces challenges of sparse and noisy observations. Most users
have very limited interactions with extremely small portions of items and often no negative feedback
[Hu et al., 2008]. Moreover, user-item interactions are implicit and hard to interpret [Hu et al., 2008,
Liu et al., 2015].
To address these challenges, traditional matrix factorization (MF) modeling has been extensively
exploited to perform collaborative filtering. [Hu et al., 2008] fit the problem into the MF framework
by converting implicit feedback data into preference scores and minimizing a least-square point-wise
loss. [Rendle et al., 2009, Rendle and Freudenthaler, 2014] developed pairwise rank-based algorithms
together with sampling techniques. They directly optimized returned item ranking. [Usunier et al., 2009,
Weston et al., 2010, Shi et al., 2012a] further adjusted loss functions and sampling techniques to focus on
optimizing the quality of top positions in a ranked list.
During model training, these methods usually rely on dierent assumptions of a user’s unobserved
items (e.g., assigning zero scores and treating as negative samples), which can be restrictive. More-
over, they treat user item interactions as i.i.d. events, thus failing to explicitly model correlations within
implicit feedback. To address these limitations, sequential classification is explored as a new modeling
41
paradigm. [Grbovic et al., 2015] explored sequence classification in E-commerce product recommenda-
tion. It tailored Word2Vec [Mikolov et al., 2013] techniques to predict next item given a user’s activity. It
shows good scalability and eciency.
Despite simplicity and great success, bilinear models (e.g., MF and Word2Vec) are limited
by their power of flexibility and expressiveness. [Hidasi et al., 2015] first applies recurrent neu-
ral networks [Hochreiter and Schmidhuber, 1997] (RNNs) to session-based recommendation. They
devise GRU-based RNNs and demonstrate good performance with one hot encoding item embed-
ding. [Hidasi et al., 2016] extends [Hidasi et al., 2015] by building a parallel structure to take in visual
extracted features in the input layer.
However, [Hidasi et al., 2015, Hidasi et al., 2016] investigated session-based recommendation, a nar-
rower domain of implicit recommendation where users are anonymous and there is strong sequence prop-
erty in user behavior. It is unknown whether or not sequence models are suitable in general implicit
recommendation. Meanwhile, [Hidasi et al., 2015, Hidasi et al., 2016] and [Grbovic et al., 2015] do not
take in profiles or metadata in modeling.
We bridge the gap by empirically studying sequence models on item recommendation. We combine
sequence models with attribute embedding described in Chapter 6. Extensive experiments show that
sequence models are more ecient than non-sequence models, and both sequence modeling and attribute
embedding are essential to improving performance. Exploratory studies are conducted to shed lights on
how sequence feedback helps recommendation.
4.2 Related work
Sequence modeling is explored in E-commerce product recommendation by [Grbovic et al., 2015].
Word2Vec techniques are tailored to product recommendation and showed scalability and eciency.
[Vasile et al., 2016] extends [Grbovic et al., 2015] to incorporate metadata in a multitask learning fash-
ion and shows improved performance on a music dataset.
[Hidasi et al., 2015] first applies recurrent neural networks (RNNs) to session-based recommenda-
tion. They devise GRU-based RNNs and demonstrate good performance with one hot encoding item
embedding. [Hidasi et al., 2016] extends [Hidasi et al., 2015] by building a parallel structure to take in
visual extracted features in the input layer.
Time-sensitive recommendation algorithms also studies sequence modeling.
[Kapoor et al., 2015] builds a Semi-Markov model to predict return items from the history.
42
[Du et al., 2015] designs a low-rank Hawkes Processes based model to predict user recurrent activities.
These algorithms focus on user recurrent behavior patterns instead of feedback correlations. Meanwhile,
they usually have a much smaller item pool.
4.3 Sequence modeling approaches
The most popular recipe for implicit recommendation is to choose ` to be a pairwise loss (e.g.,
[Rendle et al., 2009, Weston et al., 2010]). It takes each observed user-item pair and samples or selects
one or more “negative items” to compute the loss value before summing up the final loss. In addition
to some issues like sampling bias, we believe there are two other major limitations. First, they ignores
correlation of items that have been interacted by the same user by treating user-item pairs as independent.
Second, pairwise loss does not directly target next-step recommendation accuracy, which might make
learning less ecient.
To this end, we choose sequence approaches to directly model recommendation. First, items seen by
a user u are sorted chronologically as a sequencefu : i
1
; i
2
;::; i
T
u
g. In what follows, we describe our three
sequence models for recommendation tasks and attribute embedding. The first two models are inspired
by Word2Vec [Mikolov et al., 2013] algorithms, and we call them Skip-gram-rec (Skip-gram-rec) and
CBOW-rec (CBOW-rec). Finally, we explore a more flexible sequence recommendation model based on
LSTMs (LSTMs-rec).
4.3.1 Skip-gram-rec
To capture item/attribute patterns across the sequence, we first choose f to be
h
n
= f (q
u
;q
i;1
;::;q
i;n
;::;q
i;T
u
) = q
u
+ q
i;n
8n = 1;::; t 1 (4.1)
for each user interaction sequence (length t), and minimize the following objective,
L =
X
u2U
t1
X
n=1
w
X
l=1
log(s
i
n+l
n
) (4.2)
where(s
i
n
) = exp(s
i
n
)=
P
j2I
exp(s
j
n
) is the softmax function and s
i
n
is computed from h
n
as in Eq. 6.7.
The model is trained such that the representation of a user-item pair (with their attribute information
encoded) should be able to predict items appearing in the future up to a fixed time window (w).
43
Note that while it is inspired by the skip-gram [Mikolov et al., 2013] model, l in Eq. 4.2 takes only
positive values, meaning each user-item pair is trained only to predict the future items (instead of future
and past). It better aligns training and testing and empirically gives performance advantage.
4.3.2 CBOW-rec
Skip-gram-rec is able to connect dierent items across a sequence in the training. However, it replies
on only one item (together with the user) to predict the future, which might be an assumption that is too
restrictive. Moreover, Skip-gram-rec struggles to predict several items in a future time window, while
what we care about most is the “next step” item recommendation. This mismatch could lead to inecient
training as well.
We propose CBOW-rec to address these issues. f is chosen to be
h
n
= f (q
u
;q
i;1
;::;q
i;n
;::;q
i;T
u
) = q
u
+
1
m
m
X
l=1
(q
i;r
l
) r
l
2fn w + 1;::; ng;8n = 1;::; t 1 (4.3)
where m items are sampled from a history of window size w. The objective function is
L =
X
u2U
t1
X
n=1
log(s
i
n+1
n
) (4.4)
Compared to Skip-gram-rec, CBOW-rec relies on several sampled items that a user has interacted so far
(together with the user) to predict the next item to interact. Empirically, we also find that it gives a more
accurate recommendation.
4.3.3 LSTMs-rec
The empirical advantages of CBOW-rec over Skip-gram-rec also motivate us to explore even longer-
term sequence properties and more flexible modeling. Thus, we build another sequence recommendation
model based on LSTMs. Specifically, we choose
h
n
;c
n
= LSTM(q
u
+ q
i;n1
;h
n1
;c
n1
) 8n = 1;::; t (4.5)
where a special “START” symbol is used as the very first item, i.e., q
i;0
, and h
0
;c
0
= 0; 0. Similarly as in
A-CBOW (Eq. 4.4), it is trained by optimizing prediction accuracy on every item in the sequence.
44
With hidden states c and h in Eq. 6.2 carrying sequence information, LSTM-rec is able to utilize
longer term observations of user, items, and attributes for future prediction, compared to Skig-gram-
rec and CBOW-rec. Meanwhile, highly non-linear function mappings inside LSTM cells provide more
modeling flexible in attribute embedding.
4.4 Comparison with non-sequence models
Datasets and Evaluation
We validate our approach on four large-scale datasets from three dierent domains: 1) job recommenda-
tion from XING
1
; 2) business reviews at Yelp; 3) movie recommendation. We describe the datasets in
detail below. Data statistics is list in Table 4.1. More detailed information about the attributes is described
in Sec.2.3.
XING contains about 12 weeks of interaction data between the users and the items on XING. Train/test
splitting follows the RecSys Challenge 2016 [Abel et al., 2016] setup where the interactions for a set of
150,000 target users in weeks 13 and 14 are used as test data. Rich attributes are associated with users
and items like career levels, disciplines, locations, job descriptions and so on. Our task is to recommend
to users a list of job posts that they are likely to interact with.
Yelp dataset comes from Yelp Challenge.
2
We work on recommendation related to which business a user
might want to review. As in [He et al., 2016b] online protocol, we sort all interactions in chronological
order and take the last 10% for testing and the rest for training. Business items have attributes including
city, state, categories, hours, and attributes (e.g., “valet parking,” “good for kids”).
Movielens-1m, MovieLens-20m[Harper and Konstan, 2016] Both datasets have anonymous ratings
made by MovieLens users.
3
We transform the data into binary indicating whether a user rated a movie
above 4:0. We discard users with less than 10 movie ratings and use 70%/30% train/test splitting.
Attributes include movie genres, movie title text, user profiles (age, gender, occupation).
1
www.xing.com
2
https://www.yelp.com/dataset challenge. Downloaded in Feb 17.
3
www.movielens.org
45
Table 4.1: Statistics of data sets.
Data jUj jIj jS
train
j jS
test
j
XING 1,500,000 327,002 2,338,766 484,237
Yelp 1,029,433 144,073 1,917,218 213,460
ML-1m 6,040 3,883 457,322 117,297
ML-20m 138,493 27,278 7,899,901 2,039,972
Table 4.2: Recommendation accuracy comparisons (in %). Best and second-best single model
results are in bold (e.g., 8.73) and italic (e.g., 7.41), respectively. Scores on Yelp, ML-1m, ML-
20m are calculated after removing history items from recommendation for all models. (P:P@5;
R:R@30; N:NDCG@30)
Dataset XING Yelp ML-1m ML-20m
Metrics P R N P R N P R N P R N
POP 0.49 2.74 1.27 0.27 0.92 0.51 7.75 11.1 9.59 6.15 10.0 8.49
WARP 2.59 8.32 5.57 1.25 4.37 2.47 9.78 18.0 14.0 9.79 14.2 13.4
A-WARP 2.63 11.6 6.74 1.31 4.30 2.49 10.0 18.0 14.2 10.1 13.3 13.5
NHMF 2.78 12.9 7.23 1.40 4.54 2.64 10.2 16.2 13.5 10.2 14.6 13.9
SG 2.50 12.3 6.51 1.40 4.77 2.66 15.2 23.5 19.6 11.3 17.0 15.4
CBOW 2.78 13.5 7.41 1.56 5.40 3.04 16.9 24.7 21.0 12.6 19.0 17.2
LSTMs 3.43 14.5 8.73 1.78 6.38 3.57 19.6 29.4 24.4 17.5 28.2 24.0
LSTMs* 3.57 15.0 9.00 1.90 6.65 3.77 20.5 30.0 25.1 18.5 33.6 26.2
lift (%) 35.7 29.3 33.5 45.0 52.2 51.4 105 66.7 76.8 83.2 137 94.1
Results
Recommendation performances of dierent models are reported in Table 3.2. We highlight our three
observations here.
First, A-WARP achieves better NDCG@30 scores than WARP on all four datasets, but often with
small margins. This suggests that such simple models might not be ecient enough to fully utilize the
attributes. While NHMF has the same model formulation, it generally outperforms A-WARP (winning
10 entries out of 12 over WARP and A-WARP). We attribute this to its better regularization by Dropout.
Second, CBOW-rec outperforms all baseline models significantly and consistently on all datasets.
Compare Eq. 4.4 to that of NHMF where CBOW-rec and NHMF only dier in whether to use items inter-
acted recently, and we easily see the crucial benefit brought by sequence modeling. Skip-gram-rec also
has good performance on movie datasets but clearly under-performs CBOW-rec. We believe this is due
to the issues we mentioned earlier.
46
Table 4.3: LSTMs-rec relative scores on original and “manipulated”XING datasets. “Orig” denotes the
complete sequence. x
N
denotes the manipulated data set obtained by randomly sampling sub-sequence
proportional to N times.
Sampling Original x
1
x
2
x
4
x
8
Score 1.0 0.97 0.94 0.86 0.84
Finally, LSTMs-rec achieve dramatic improvements over CBOW-rec, making it the best single
model, before an ensemble of them help further boost accuracies. We highlight the significant improve-
ment with the relative score lifts over the best of baseline models (i.e., POP, WARP, A-WARP) in the last
row of Table 4.2.
4.5 How sequence modeling helps
To investigate how item sequences influence LSTMs-rec’ performance, we conducted additional exper-
iments on XING and report two of our findings below. Scores on development set are used in these
experiments. We find they are strongly correlated with test scores.
4.5.1 Should we sample feedback or not?
When applying sequence modeling to the recommendation problem, we implicitly assume that sequence
or order provides additional information beyond that provided by item frequency alone. To test the validity
of this assumption, we generated new training data through sampling sub-sequences in which items in a
user’s item sequence were dropped out with certain probability. At the same time, on average, item
appearance frequency remained unchanged in data with more sampled sub-sequences.
Experimental results with the new generated training data are shown in Table 4.3. First, increas-
ing sub-sequence sampling leads to decreasing scores (from x
1
to x
8
); second, original data set (full
sequences) gives the best score. These results suggest that item sequences do indeed provide additional
information.
4.5.2 What if data scale changes?
Another way to test our assumption is to increase (or decrease) the number of observed sequences and to
compare performance between non-sequence and sequence models. While both types of models observe
47
Figure 4.1: Results on XING when increasing proportion of data is used (proportionf0.2, 0.44,
0.76, 1.0g). LSTMs-rec (red) performance improves steadily while that of NHMF (blue) does not.
(a) Precision@5 (b) Recall@30
(c) NDCG (d) SCORE
the same total number of user-item pairs, sequence models might have a chance to extract more useful
information from the data.
We train our models NHMF and LSTMs-rec onXING and still evaluate on target users. We gradually
increase the training observations from those of target users to those of a super set of users until all users
are included. The percentages of total interaction used for four experiments are 20%, 44%, 76%, and
100%, respectively.
Results are reported in Figure 4.1. We see that across scores returned by dierent metrics, (“SCORE’
is as used in RecSys Challenge 2016 [Abel et al., 2016].) LSTMs-rec improves steadily when data scale
increases (more sequences observed). On the contrary, NHMF in general does not improve. This validates
our hypothesis that LSTMs-rec successfully utilizes helpful sequence information and benefits from that.
From another perspective, it suggests sequence models may even be better suited when we have larger
scale recommendation from implicit feedback.
48
4.6 Summary
In this chapter, we argue that the correlation within the feedback contains useful information to capture
user preferences. We empirically study three sequence models. For the first time, we study recurrent
neural networks in the general item recommendation domain. We demonstrate our sequence models
significantly improve state-of-the-art model performances.
Furthermore, exploratory studies are conducted to answer the question of how sequence feedback
helps recommender systems and how sequence models behave with change of data scale. Our results
indicate that the feedback sequences do contain additional discriminative information rather than just
item frequency, and that sequence models tend to have better performance when data scale increases,
while non-sequence models do not.
49
Chapter 5
Temporal Learning Approaches to a Job Recommender
System
In this chapter we introduce one novel temporal learning method applied to a job recommender system.
The method is designed to learn a set of temporal reweighting coecients associated with time. The
method is first used in a model that recommends historical items. It is then integrated in a hybrid matrix
factorization (HMF) model training. It is an alternative fast temporal learning methods. Despite simplic-
ity, we observe improved performance in both models and better scalability in the HMF model.
5.1 Motivation
It is important to model temporal dynamics in recommender systems. Prior works observe clear
improvements by incorporating temporal information into recommendation algorithms [Rendle, 2010,
Koren, 2010, Koenigstein et al., 2011, Lee et al., 2009]. Temporal learning helps to capture important
factors such changes of item popularities, evolution of user preferences, user periodic interests, etc.
Time Aware Models [Koren, 2010, Rendle, 2010, Koenigstein et al., 2011] explicitly model temporal
dynamics in neighborhood models and factorization models. Biases or latent factors of users and items are
modeled as a sum of a static component and a time-variant component. The static component represents
the “average” popularity or interests and the time-variant one captures the change over time.
Tensor factorization based models [Chi and Kolda, 2012] are also widely used to model temporal
dynamics. They incorporate temporal information by generalizing matrix factorization to include other
factors like time, location, and other features [Bhargava et al., 2015, Karatzoglou et al., 2010]. Time fac-
tors, often represented by one separate dimension of the tensors, interact with other latent factors to fit
user feedback [Xiong et al., 2010].
The above approaches focus on predicting new items and do not specifically make an eort to rec-
ommend items that have appeared in the history. However, the temporal patterns that users return to an
50
interacted item is very important, too. To address that, [Kapoor et al., 2015] builds a semi-Markov model
to predict return items from the history. [Du et al., 2015] designs a low-rank Hawkes Processes based
model to model user recurrent activities.
In designing our system for a job recommendation task, one drawback of the above approaches is
the computational cost. Additional factors in the models increase the computation in both model training
and testing; the increase is often significant. It is desirable to model the temporal dynamics but without
increasing the running time, or ideally reducing the time. We turn to the tradition temporal reweighting
technique. These techniques do not introduce many additional parameters and add to computational cost.
For example, [Abel et al., 2011] analyzes temporal dynamics in twitter profiles with time-based weight-
ing and observes clear improvement. [Lee et al., 2009] does empirical study on the eect of temporal
dynamics in e-commerce recommendation. Their reweighting scheme also shows improvement.
However, the weights in those techniques are often pre-defined based on domain knowledge or tuned
on a separate data split. Pre-defined weights can be suboptimal when there is enough data. Tuning
becomes computationally impractical once there are more than a few weight coecients. We need a
data-driven approach to find these weights automatically.
To this end, we develop a simple yet eective approach to learn temporal weight coecients from
the data. It learns the weights by carefully constructing a set of positive-negative item pairs and fitting a
linear ranking model which takes the historical temporal counts as input. The learned weights are applied
in two models. Firstly, it is immediately used to recommend returned items. Second, it is incorporated
in a hybrid matrix factorization model training. On the job recommendation task we study, this approach
helps both models to achieve significant improvements. It also shows much better time eciency on the
factorization model.
5.1.1 Job recommendation
The problem of matching job seekers to postings [Malinowski et al., 2006] has attracted lots of attention
from both academia and industry (e.g., XING
1
and LinkedIn) in recent years. Recsys Challenge 2016 is
organized around a particular flavor of this problem. Given the profile of users, job postings (items), and
their interaction history on XING, the goal is to predict a ranked list of items of interest to a user.
1
https://www.xing.com/
51
5.1.2 Task
RecSys Challenge 2016 provides 12 weeks of interaction data for a subset of users and job items from the
social networking and job search website - XING.com. The task is to predict the items that a set of target
users will positively interact with (click, bookmark or reply) in the week 13 and 14.
Data Set. Users and items are described by a rich set of categorical or numerical features or descriptor
features. Categorical features take several to dozens of values and descriptor features have a vocabulary
size around 100K. Observation including positive interactions and impressions (items shown to users by
XING’s existing recommendation system) at dierent weeks are also available. The data statistics is in
Table 2.1 and attribute description is in Fig. 2.1.
5.1.3 Evaluation metric
Task and Evaluation Metric. Given a user, the goal of this challenge is to predict a ranked list of items
from the active item set. The evaluation score is a sum over scores of each user S (u), which is defined as
follows:
S(u) =20 (P@2 + P@4 + R + UserSuccess)
+10 (P@6 + P@20)
where P@N denotes the precision at N, R is the recall, and UserSuccess equals 1 if there is at least one
item correctly predicted for that user.
5.2 Approaches
5.2.1 Temporal observations
Our approach is motivated by our two observations of the dataset.
users have a strong tendency to re-interact with items that they already did in the past. Statistically,
on average 2 out of 7 items from the first 14 weeks re-appear in the 15
th
week’s interaction list
Factorization models give improved accuracy when we apply decaying weights on interactions that
happen long time ago.
52
These observations indicate the important of the temporal eects and encourage us to find a set of weights
associated with time to characterize the exactly appropriate time decaying eects.
5.2.2 Linear ranking model
Given a user u and an item i, the historical interactions between u and i before time t is represented
as M
i;u;t
2 N
KT
, where T is number of time stamps from time 1 to time t, K is number of types of
interactions (e.g., click, bookmark, or reply), and M
i;u;t
(k;) is the number of k-type interactions at time
2f1; ; tg. Given a user u, a na¨ ıve model is to rank each item simply based on the aggregation adoption
of history, which leads to
S (u; i; t) =
X
k
t
X
=1
M
i;u;t
(k;)
where S (u; i; t) evaluates how likely user u is to re-interact with an item i given their historical interactions.
However, not every historical interaction by a user has the same importance. For example, a user
may prefer re-clicking an item from previous day over one clicked 10 weeks ago. As we conjectured that
the importance of user-item interactions depends on the time of interaction, we propose a time reweighed
linear ranking model, given a user u, item i and a particular time t. This is defined as:
S (u; i; t) = wM
T
u;i;t
where w is the coecient associated with time, with w(k;) indicating the relative contribution of k-type
interactions at time.
Caution needs to be taken in order to infer parameter w as we are facing an extremely unbalanced
problem – there are way less unobserved items than observed ones. Moreover, factors beyond tempo-
ral ones are also playing a significant role such as user-item similarities. To address that, we carefully
construct triplet constraints
T =
u prefers to re-interacting with i
1
to i
2
at time
:
(u; i
1
; i
2
;)2T only when the following two conditions are satisfied:
u interacted with i
1
, i
2
before time;
u only interacted with i
1
but not i
2
at time.
53
With the first condition, we assume the user has equal preferences on the two items. It helps exclude
user-item similarity factor. When the second condition is satisfied, we attribute the preference to the
temporal eects and fit the ranking model to the constraint S (u; i
1
;)> S (u; i
2
;). Particularly, we obtain
the solution of w by minimizing an objective function that incurs a smoothed hinge loss when such a
constraint is violated.
Recommend items from the history
Given the observation that users have a strong tendency to re-interact with items that they already did in
the past, it is very plausible to recommend the old items to users. Similarly, items that appeared in the
“impression” list are also preferred. It motivates us to consider the set of items from past interactions and
impressions as a candidate set for user.
In order to rank items from this candidate set, we apply the model in Sec. 5.2.2. TheT set consists
of two types of item pairs seen by a user at time:
both items are in the user interaction list before
both items are not in a user’s interaction list before but in the impression list.
The item pairs in other cases are not used because temporal factors may not be the major influence factor.
Temporal reweighting based factor models
In order to recommend new items – not seen by a user but seen by others or newly appearing items, we
exploit factorization models.
Our approach starts with hybrid matrix factorization technique [Kula, 2015]. To briefly review, we
model each user/item as a sum of the representations of its associated features and learn a d-dimensional
representation for each feature value (together with a 1-dimensional bias). Let~ x
U
j
/~ x
I
j
denote the embed-
ding (i.e., vectors of factors) of the user/item feature j,~ q
u
/~ q
i
denote the embedding of user u /item i, and
b
U
j
=b
I
j
denote the user/item bias for feature j. Then
~ q
u
=
X
j2 f
u
~ x
U
j
;~ q
i
=
X
j2 f
i
~ x
I
j
; b
u
=
X
j2 f
u
b
U
j
; b
i
=
X
j2 f
i
b
I
j
(5.1)
The model prediction score for pairfu, ig is then given by
S (u; i) =~ q
u
~ q
i
+ b
u
+ b
i
(5.2)
54
The model is trained by minimizing the sum of a loss on S (u; i) and the observed ground truth t(u; i),
L =
X
fu;ig2I
`(S (u; i); t(u; i)) (5.3)
where I is set of interactions between user u and item i,` is chosen to be Weighted Approximately Ranked
Pairwise (WARP) loss [Usunier et al., 2009, Weston et al., 2010], which in our case empirically performs
better than other loss functions (e.g. Bayesian Personal Ranking [Rendle et al., 2009]).
Temporal reweighted matrix factorization
Our approach is grounded on the assumption that the time factor plays an important role in determining
the user’s future preference. To this end, we place a non-negative weight associated with time on the loss,
which leads to the following equation:
L
0
=
X
fu;i;g2I
`(S (u; i); t(u; i;))
() (5.4)
Here the reweighting term
depends on the time when the user-item interaction happens, which captures
the contribution from interactions over time. Additionally, some zero weight
reduces training set size
to speed up training and could possibly help prevent over-fitting.
In general
can be learned jointly with other embedding parameters in the model. However, here we
fix
as the learned weights w from Model 5.2.2. Particularly, we normalize the weights w and cut o
at small value of w. Cutting o on w would significantly reduce the number of training data and reduce
training time.
5.3 Experiments
We take user-item interaction data from the 26
th
to the 44
th
weeks as training data and validate our model
on the 45
th
week. Submitted results come from models re-trained on data from 26
th
to 45
th
week under
the same hyper-parameters. We observe very strong correlation between validation and test scores for all
our models and thus mostly report validation scores below due to submission quota.
55
Table 5.1: Scores in thousands (K) based on history interactions.
Models Rand TSort TRank
INTS 266 284 299
IMPS 324 375 380
INTS+IMPS 463 509 524
Figure 5.1: Weights learned in Model 5.2.2 for interactions 5.2(a) and impressions 5.2(b). K=4
for INTS and K=1 for IMPS, where type 1 denotes user-item impression pairs, and type 2,3,4
denote click, bookmark and reply, resp.
Time (weeks)
1 2 3 4 5 6 7 8 9 10
Weights
×10
-3
-0.5
0
0.5
1
1.5
2
2.5
Type 1
Type 2
Type 3
Type 4
(a) interactions
Time (weeks)
1 2 3 4 5 6 7 8 9 10
Weights
-0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Type 1
(b) impressions
Recommend from history
Our model in Section 5.2.2 (TRank) is compared to two baseline models: randomized score (Rand) and
recency-based sorting (TSort) that sorts items by the latest time they appear in the history. The results on
historical “interactions”(INTS), “impressions”(IMPS), and their combinations (INTS+IMPS) are reported
in Table 5.1. TRank clearly outperforms the other two in all the three cases. Figure 5.1 shows the
learned weights w associated with the designed temporal features. The coecients are decaying with
time in both Figure 5.2(a) and 5.2(b) across dierent types of interactions, indicating that more recent
interactions have a larger statistical impact over the users’ future preferences. Furthermore, although
recency is important, simply using the latest time performs worse thanTRank, which smoothly combines
the most recent interactions with historical interactions using the learned weights w.
56
Table 5.2: Scores (K) and training time (h: hours) by both hybrid matrix factorization and tem-
poral reweighted matrix factorization models.
Models HMF THMF
Fea d S
all
S
new
T S
all
S
new
T
No
16 235 61 8.8 269 65 2.8
32 301 71 3.4 320 75 1.5
48 313 78 7.7 326 84 1.7
64 330 76 3.3 340 86 0.7
Yes
16 311 124 74 361 146 34
32 326 125 26 381 148 14
48 354 128 76 378 144 12
Table 5.3: Scores (K) byTHMF with some “impression”s as additional observation inputs.
Observations INTS INTS + IMPS
S
all
381 438
S
new
148 164
Recommend via matrix factorization
We compare hybrid matrix factorization model (HMF) and the model in 5.2.2 (THMF) with dierent number
of latent factor d, without and with features. Two important measures are used: S
all
and S
new
. S
all
is the
challenge score and S
new
is the score after removing all history user-item pairs. We found S
new
more
important in model ensemble and chose in our experiments to early stop model training at the best S
new
.
As shown in Table 5.2, THMF models achieve significant improvements on S
all
and S
new
for all d,
with and without features. Meanwhile, the time comparison shows that the best models achieved by
THMF require significantly less training time.
Finally, we use items in the “impression” list in last week and treat them as “interactions” (with 0:01
down-weight). This boosts performance as seen in Table 5.3.
5.4 Summary
In this chapter, a temporal reweighting approach is developed for a job recommender system. It learns the
weights by carefully constructing a set of positive-negative item pairs and fitting a linear ranking model
which takes the historical temporal counts as input. The method is applied in two models. Firstly, it is
immediately used to recommend returned items. Second, it is applied in a hybrid matrix factorization
model. Experiments on a job recommendation task demonstrates its clear improvements brought by
57
the temporal approach than their non-temporal counter-parts on both models. It also shows better time
eciency on the factorization model.
58
Chapter 6
Incorporating Heterogeneous Attributes in
Recommender Systems
In this chapter, we introduce our techniques to incorporate heterogeneous attributes in recommender sys-
tems. We study recommendation tasks with attributes (metadata, profile) available. These attributes are
often heterogeneous in domains and data types. We proposed techniques to embed the attributes in a
neural network framework to eciently incorporate the attributes. The techniques are applied in dier-
ent models and help achieve significant recommendation accuracy improvement. The learned attribute
representation also exhibits the semantic structures of data.
6.1 Motivation
Attributes (or metadata, profile) of users and items are becoming increasingly prevalent. For instance,
LinkedIn user profiles contain information including user location, education, expertise, work experience,
etc. Yelp business attributes contain business names, hours, categories, service, etc. These attributes are
often helpful and indicative in the analysis why users choose to interact with items. It is thus important to
take into account these attributes in system design.
Content based approaches [Pazzani and Billsus, 2007] use these attributes to make recommendation.
They compute item similarities and recommend similar items based on users’ past activities. These
approaches are especially admirable when there are very few interactions and when discriminative con-
tent is available. While simple similarity measures such as cosine and Pearson similarity sometimes give
descent results, it in general remains challenging to find an appropriate similarity metric. Moreover, they
do not capture user personal preferences and individual item dierences beyond the attributes—which is
usually done by collaborative filtering.
To address that, factor models have been adapted to combine contents and user activi-
ties. [Fang and Si, 2011, Saveski and Mantrach, 2014] incorporate attributes in matrix co-factorization.
59
Instead of only modeling user-item interactions, they jointly model multiple data relations (e.g., interac-
tions and attribute relations) by minimizing a sum of several matrices’ losses. Latent factors are asso-
ciated with both identities and attributes, and are jointly inferred from the data. [Shmueli et al., 2012,
Kula, 2015] represent a user or an item as a linear combination of its identity and attributes. It trains
the model by factorizing only one user-item interaction matrix so that the latent factors of attributes
also help to explain the feedback. Recently, more flexible models (e.g., topic models and pre-trained
neural networks) have been used to model attributes of text [Bansal et al., 2016, Kim et al., 2016],
vision [He and McAuley, 2015], and music [Van den Oord et al., 2013]. The learned embedding from
these flexible models are then fixed and used in a downstream matrix factorization recommender system.
Figure 6.1: An illustrating example where a job seeker interacts with a sequence of job posts.
Rich heterogeneous attributes are associated with both the job seeker and job posts. The system
is asked to recommend new posts to this user.
Internet and IT Professional Zürich senior, cloud,
architect, .. Cloud, c++ …... Post 1034 t=1 Age: 30 math, cs Bachelor Berlin …... Internet Student/Intern Munich data, engineers,.. NLP, Spark,
PostgreSQL,... …... Post 882 t=2 Online media Professional Munich Ruby on Rails global, online,
market,.. …... Post 2893 t=3 Internet Professional Berlin website,
development, .. Javascript, css,
HTML,... …... Post 25 t=4 ... Schmidt However, these methods do not address the challenge of heterogeneity of the attributes very well.
Attributes are often heterogeneous and come in dierent domains and data types. For example (as in
Fig.6.1), attributes may include a user’s age, location, education, and a job’s industry, title description,
employment, online tags; their data types include real numbers, categorical tokens, text tokens, etc. To
develop ecient approaches to incorporating attributes, we investigate three major challenges in real
scenarios—variable lengths, sparseness, and sequential dependency. Variable lengths: Due to the natures
of dierent attributes, users or items may have dierent lengths of attributes. A user may have academic
degrees in more than one discipline; a job post may vary in the number of tags; missing values are very
common. It is inecient to simply combine the observed attributes. Sparseness: The possible attributes
of items include tags from the Internet, text tokens, demographic information, etc. The entire attribute
60
vocabulary can be very large, but each attribute appears only a limited number of times. To add diculty,
a large part of the attributes may be irrelevant to our task of interest. These factors pose further challenges
to model regularization. Sequential dependency: In cases where a user interacts with a sequence of items
(or services), attributes may help encode user sequential behaviors. As evidence, attributes of nearby
items in a sequence may overlap or share common characteristics. For example as in Fig.6.1, job post
attributes collectively suggest the user’s current interest in a junior-level position in an IT-related area. It
is desirable to make use of such attribute dependencies.
Existing methods have diculty addressing such challenges. For example, Matrix co-factorization
algorithms [Fang and Si, 2011, Saveski and Mantrach, 2014] simultaneously model multiple relations but
some relations are noisy and not relevant to the final task. It is also hard to scale when there are many types
of attributes. Hybrid matrix factorization [Shmueli et al., 2012, Kula, 2015] learns task-driven embed-
ding but does not deal well with variable lengths of attributes. They are also specifically designed for
matrix factorization models which is limited in their expressiveness and flexibility. Topic model and
pre-trained neural networks approaches [Bansal et al., 2016, Kim et al., 2016, He and McAuley, 2015,
Van den Oord et al., 2013] typically focus on a single type of attributes. Meanwhile, they suer from
domain adaptation issues as they are not learned end-to-end.
In this chapter we sort to design a generic way to incorporate heterogeneous attributes and address the
mentioned challenges. Particularly, we would like the approach to satisfy the following requirements: 1)
It automatically infers attributes in the context of the given recommendation task 2) It deals with variable
lengths of attributes. 3) It can be employed in both simple matrix factorization as well as more flexible
models. 4) It helps improve recommendation accuracies.
6.2 Approaches
In this section we introduce our proposed model HA-RNN. After defining the problem formulation and
notation, we first present an existing approach based on RNNs in item recommendation. Then we describe
our proposed model that combines sequence modeling and attribute embedding and deals with hetero-
geneity challenges.
6.2.1 Problem formulation and notation
We are given a user set U, an item set I, and their interactions S =f(u; i; t)ju2 U; i2 Ig where t records the
time that interaction (u; i) takes place. Let A
U
; A
I
denote the attribute set associated with users and items.
61
Figure 6.2: The overview of the proposed HA-RNN model. The base recommendation model is
recurrent neural networks where item sequence (with the user) is fed as input and the model is
trained to predict the next item. The representations for the input (Q
U
andQ
I
) and for computing
item scores in the output layer (Q
I
) are combination of identity and attributes. To compute
Q
I
, embedding of the multi-hot attributes (e.g.,
M
1
) are first averaged before combined with
categorical ones (e.g.,
C
1
;
C
2
); identity is regarded as categorical attribute in the figure) and
numerical ones (e.g.,
N
1
). The same item representation is shared in both input and output
layers for enhanced signals and model regularization. Computation of Q
U
is omitted in the
figure.
f f f i
n
q
i
u i
n-1
Q
I
Q
U
Q
I
Q
I
... ... Φ
C1 Φ
C2 Φ
M1 Φ
N1 q
n-1
h
n
i
n-1
i
n+1
q
n-2
q
n
u, i
n-2
u, i
n
softmax Parameters associated with users, items, user attributes, and item attributes are denoted by e
U
2 R
jUjd
,
e
I
2R
jIjd
,
U
2R
jA
U
jd
, and
I
2R
jA
I
jd
, respectively— given the model dimensionality d. Subscripting
means taking one corresponding slice of the matrix. For example,e
I
i
2R
d
denotes a vector representation
of item i;
U
a
2 R
d
(a2 A
U
) denotes embedding of one user attribute a. Superscript U or I is omitted
when there is no ambiguity. In the context of a sequence, integer subscripts denote sequence positions.
A recommender system needs to return a scoring function g such that s(i) = g(iju; U; I; S; A
U
; A
I
)
captures user-item preferences where an item preferred by a user to another receives a higher score. The
recommendation then takes the highest scored item (as shown in Eq. 6.1).
ˆ y = arg max
i2I
s(i) (6.1)
In model training the scores are compared to the ground truth observations. Dierent loss func-
tions are developed [Hu et al., 2008, Rendle et al., 2009, Weston et al., 2010]. In this work we use cross-
entropy classification loss as in [Grbovic et al., 2015, Hidasi et al., 2015].
62
6.2.2 Identity embedding via sequential recommendation
We begin with a sequence recommendation approach. First, items seen by a user u are sorted chrono-
logically as a sequence (u : i
1
; i
2
;::). Then, as in a sequence generation problem, an RNNs-based model
learns identity embedding of users and items by fitting the model to predict the next appearing item given
all the observed ones. The model formulation can be written as the following,
h
n
= f (q
n1
;h
n1
) 8n = 1;::; t (6.2)
s
n
(i
n
) = h
T
n
w
i
n
(6.3)
where f is an RNN-cell (e.g., LSTM, GRU); s(i
n
) is computed from the inner product between latent
state h
n
and item matrixw2R
djIj
.A special “START” symbol is used as the very first item, i.e., q
i
0
, and
h
0
= 0.
In this work we focus on q and w. As the input layer and the output layer of networks, they connect
sequence models with data observations. In [Hidasi et al., 2015], the input vector q simply takes item
embedding,
q
n
= e
i
n
; (6.4)
and w is a separate set of item parameters from e
I
. While the model is able to capture sequential depen-
dencies between items, it does not use attributes at all as neither q norw involve attributes.
6.2.3 Joint attribute embedding
We first extend the model to incorporate attributes in the input layer—both item identity and attributes
contribute to the input representation. Particularly,
q
i
= e
i
+
X
j2attr(i)
j
(6.5)
where attr(i) returns the set of attributes of item i. When user attributes are also available, we similarly
define q
u
and thus have q
n
= q
u
+ q
i
n
.
This model, however, is inecient in practice. It suers from the mentioned challenge—variable
attribute lengths. For example, when an item has more attributes than others, its second summand in
Eq. 6.5 tends to have larger magnitude, and the estimation becomes harder. As another example, if an
63
Table 6.1: An attribute division example. How attributes in Figure 6.1 are divided into three
kinds.
Divisions Examples
Categorical industry(Online media), location(Munich)...
Multi-hot degrees(math, cs), tags(NLP, Spark, c++)...
Numerical age(30)
item has more attributes for one type but less for another, the mismatch is harmful when the model tries
to compare representations of two items.
6.2.4 A hierarchical attribute combination
To deal with the challenge, we propose a division of heterogeneous attributes into three distinct kinds:
categorical (C), multi-hot (M), and numerical (N). We illustrate the division by an example in Table 6.1
and defer the detailed definition to the appendix. We point out that multi-hot attributes—which can have
more than one value for each attribute—and missing values are the causes of variable lengths of attributes.
Missing values can be replaced by “unk” as special attribute tokens.
With the division and a bit of abuse of notationC;M;N, we design a hierarchical way to combine
the attributes and have the representation as follows:
q
i
= e
i
+
n
C
X
j=1
C
j
(i)
+
n
M
X
j=1
1
jM
j
(i)j
X
k2M
j
(i)
k
+
n
N
X
j=1
N
j
(i)
(6.6)
whereC
j
(i) (orM
j
(i);N
j
(i)) returns item i’s jth categorical (or multi-hot, numerical) attribute(s). Numer-
ical attribute values are first clustered and then replaced by their cluster center indices. They are then
treated as categorical values. n
C
(or n
M
; n
N
) denotes the total number of attribute types belonging toC
(orM;N). q
u
is computed similarly.
Compared to (6.5), multi-hot attribute embedding is no longer summed together. Rather, the mean
of embedding within each type of multi-hot attribute is computed before embeddings across types are
combined. In this way, values of one multi-hot attribute are regarded as a “single attribute”, and we have
a better control of the scales of input vector q.
64
6.2.5 Shared attribute embedding in output layer
With (6.2)(6.3)(6.6), RNNs embed attributes as part of model input. However, as most conventional
RNNs, attributes are not involved in the output prediction stage. To address the attribute sparseness chal-
lenge, we first explore incorporating attributes in the output layer in addition to the input layer. Intuitively,
use of attributes in these two network components enhances attribute signals in model training.
We extend the output layer to involve attribute embedding in computing item scores. Particularly, we
discardw and compute the score for item i by
s(i) = h
T
e
0
i
+
n
C
X
j=1
h
T
0
C
j
(i)
+
n
M
X
j=1
k2M
j
(i)
h
T
0
k
+
n
N
X
j=1
h
T
0
N
j
(i)
8i2 I
(6.7)
where2fmean; maxg. First, attribute preference scores are computed by dot product between model
latent vector h and attribute vectors. Then the attribute scores within each multi-hot type take mean or
max. Intuitively, mean-pooling suggests that each attribute value contributes equally, while max-pooling
suggests that one particularly favorable attribute value dominates. Finally, scores across dierent attribute
types are summed (or averaged) to produce item scores.
Note that item identity and attributes now appear both in input and output layers of the model. A
natural question is whether or not we should use separate embedding parameters for the two components.
Dierent from common practice such as in Word2Vec, we decide the input and output layers share the
same embedding parameters, i.e.,
e
0
= e;
0
=: (6.8)
This reduces the number of total parameters and, more importantly, adds additional constraints to
embedding parameters. We expect this model regularization to help the model achieve better generaliza-
tion.
HA-RNN. With (6.2)(6.3)(6.6)(6.7)(6.8), we have our complete model HA-RNN. As in Figure 6.2,
attributes and identities are coupled in the sequence model through both input and output layers. Given
an item sequence, the model first looks up and combines the attribute embeddings and then feeds the
representation into the networks. Attribute parameters are updated together with network parameters
via back-propagation. Compared to the vanilla sequential recommendation model, HA-RNN treats the
65
union of identity and attributes as a sequence element and tries to capture sequential dependencies in both
identities and attributes. Compared to simple attribute averaging, HA-RNN carefully designs input and
output layers to deal with heterogeneity challenges.
6.3 Related work
Attribute incorporation. Low rank factorization models are extended to incorporate attributes in rec-
ommendation. Matrix co-factorization (MCF) [Fang and Si, 2011, Saveski and Mantrach, 2014] mini-
mizes a sum of several matrices’ losses to capture multiple data relation. Hybrid matrix factorization
(HMF) [Shmueli et al., 2012, Kula, 2015] uses linear combination of attribute embedding to represent
use or item and then factorize only one interaction matrix. Tensor factorization [Karatzoglou et al., 2010,
Bhargava et al., 2015] models attributes as part of tensors. These approaches are based on bi-linear mod-
els and are often limited in the model expressiveness and flexibility.
Recently, more flexible models (e.g., topic models and pre-trained neural networks) have
been used to model attributes of text [Bansal et al., 2016], vision [He and McAuley, 2015], and
music [Van den Oord et al., 2013]. The learned embedding from these flexible models are then fixed
and used in a downstream matrix factorization recommender system. As a comparison, our approach tar-
gets heterogeneous types of attributes; meanwhile, it learns attribute embedding in a task-driven fashion
to avoid domain adaption issues.
Graph-based algorithms like graph random walk are used to learn attributes [Christoel et al., 2015,
Levin et al., 2016]. They construct graph nodes as identities and attributes, and graph edges as events and
attribute relations. Attributes and identities are then embedded in the same space.
Sequence modeling. Sequence models have proved useful in recommender systems. Low rank
factorization has been eectively combined with Markov chain [Rendle et al., 2010, Cheng et al., 2013],
Hidden Semi-Markov model [Kapoor et al., 2015], Hawkes processes [Du et al., 2015].
Recently, Word2Vec training techniques are tailored to recommendation models. It shows scala-
bility and eciency in E-commerce product recommendation [Grbovic et al., 2015]. [Vasile et al., 2016]
extends [Grbovic et al., 2015] to incorporate metadata in a multitask learning fashion and shows improved
performance on a music dataset.
[Hidasi et al., 2015] first applies RNNs to session-based recommendation. It devises GRU-based
RNNs and demonstrates good performance with one hot encoding item embedding. [Hidasi et al., 2016]
extends [Hidasi et al., 2015] by building a parallel structure to take in visual extracted features in the
66
input layer. [Liu et al., 2016] extends recurrent neural networks in context-aware recommendation by
introducing context-dependent network components.
6.4 Experiments
This section discusses how we conduct extensive experiments to validate the eectiveness of our
approach. First, we introduce the experimental setup. Then we present the recommendation accuracy
results comparing dierent variants of HA-RNN and compare these with other state-of-the-art models.
This is followed by a qualitative analysis of learned attribute embedding. Finally, we review our studies
on how the sequence component helps recommendation.
6.4.1 Setup
We validate our approach on two large-scale datasets with rich heterogeneous attributes : 1) job recom-
mendation from XING
1
; 2) business reviews at Yelp. We describe the datasets in detail below.
XING contains about 12 weeks of interaction data between the users and the items on XING. Train/test
splitting follows the RecSys Challenge 2016 [Abel et al., 2016] setup where the interactions for a set of
150,000 target users in weeks 13 and 14 are used as test data. Rich attributes are associated with users
and items like career levels, disciplines, locations, job descriptions and so on. Our task is to recommend
to users a list of job posts that they are likely to interact with.
Yelp dataset comes from Yelp Challenge.
2
We work on recommendation related to which business a user
might want to review. As in [He et al., 2016b] online protocol, we sort all interactions in chronological
order and take the last 10% for testing and the rest for training. Business items have attributes including
city, state, categories, hours, and attributes (e.g. “valet parking,” “good for kids”).
More detailed information about the attributes is listed in Sec.2.3.
Evaluation
We assess the quality of recommendation results by comparing the models’ recommendations to ground
truth interactions, and report precision, recall, MAP, and normalized discounted cumulative gain
(NDCG). For precision and recall metrics, we report results at positions 2, 10, and 30 (P@2:Prec@2,
1
www.xing.com
2
https://www.yelp.com/dataset challenge. Downloaded in Feb 17.
67
Table 6.2: Recommendation accuracy and training time comparisons between MIX and HET on
dierent models.
Attr. Time P@2 P@10 P@30 R@2 R@10 R@30 MAP NDCG
XING
MIX >144h 0.0452 0.0221 0.0122 0.0396 0.0891 0.1394 0.0486 0.0806
HET 70h 0.0522 0.0241 0.0128 0.0451 0.0962 0.1453 0.0540 0.0873
Yelp
MIX >144h 0.0152 0.0119 0.0090 0.0052 0.0204 0.0469 0.0093 0.0265
HET 82h 0.0205 0.0154 0.0117 0.0080 0.0281 0.0638 0.0131 0.0357
Table 6.3: Recommendation accuracy comparisons of dierent attribute embedding integrations.
Methods P@2 P@10 P@30 R@2 R@10 R@30 MAP NDCG
XING
no attributes 0.0406 0.0169 0.0091 0.0335 0.0597 0.0908 0.0365 0.0587
only input 0.0499 0.0233 0.0126 0.0431 0.0924 0.1414 0.0516 0.0842
only output 0.0515 0.0236 0.0127 0.0445 0.0936 0.1427 0.0530 0.0858
input+output 0.0522 0.0241 0.0128 0.0451 0.0962 0.1453 0.0540 0.0873
Yelp
no attributes 0.0183 0.0138 0.0104 0.0069 0.0248 0.0551 0.0113 0.0313
only input 0.0175 0.0138 0.0106 0.0063 0.0248 0.0560 0.0111 0.0314
only output 0.0180 0.0138 0.0104 0.0067 0.0250 0.0557 0.0112 0.0313
input+output 0.0205 0.0154 0.0117 0.0080 0.0281 0.0638 0.0131 0.0357
P@10:Prec@10, P@30:Prec@30,R@2:Rec@2, R@10:Rec@10, R@30:Rec@30.). We report MAP and
NDCG at location 30. Note that we report scores after removing historical items from each user’s rec-
ommendation list on the Yelp dataset because in these scenarios (Yelp reviews), users seldom re-interact
with items. This improves performance for all models, but does not change relative comparison results.
Models
We compare our models with models that show variant abilities in incorporating
attributes and utilizing sequence in item recommendation. Baseline non-sequence mod-
els POP, b-BPR, and WARP [Weston et al., 2010] and sequence models CBOW-
rec [Grbovic et al., 2015] and RNN [Hidasi et al., 2015] do not incorporate attributes. Non-sequence
model A-WARP [Kula, 2015] incorporate attributes. We also experimented with dierent variants of our
proposed HA-RNN to investigate the roles of dierent model components. We detail the models and
hyper-parameter tuning below. Hyper-parameters tuning and early stopping are done on a development
dataset split from training for all models.
68
POP. A naive baseline model that recommends items in terms of their popularity.
b-BPR. Batch version of BPR [Rendle et al., 2009]. We found vanilla BPR works very poorly here
due to the diculty in sampling items (very large item set). b-BPR uses the same logistic loss but
updates a target item and a batch of negative items every step.
WARP[Weston et al., 2010][Hong et al., 2013]. One state-of-the-art algorithm for item recommen-
dation .
CBOW-rec[Grbovic et al., 2015]. A sequence recommendation approach based on Word2Vec
techniques. We also experimented with its alternative approach based on Skip-gram. CBOW-
rec works better. We omit here for brevity.
RNN. An RNNs based sequence approach [Hidasi et al., 2015].
A-WARP. A factorization model that represents users and items as linear combinations of their
attribute embedding [Shmueli et al., 2012].
NHMF. Our own implemented HMF. It uses cross-entropy as a training algorithm instead of the
rank loss used in A-WARP.
We use LightFM [Kula, 2015] for bpr, WARP, and A-WARP. We implement the rest models our-
selves. AdaGrad is used as the optimizer. Parameter tuning: model dimension d is tuned inf10, 16, 32,
48, 64g except for RNNs inf64, 128, 256g; dropout rate 0.5; window size 5 for CBOW-rec and skip-gram.
6.4.2 Recommendation accuracy
Eectiveness of multi-hot treatment
To begin, we investigate the eectiveness of multi-hot treatment in our attribute embedding. We call
our combination (6.6) HET. We compare this to (6.5) (we call MIX) which simply takes the average
embedding of all attribute values. Recommendation accuracies and training time are reported in Table
6.2.
In terms of recommendation accuracy, our multi-hot treatment HET brings significant improvement
to HA-RNN. In both datasets, HET significantly outperforms MIX. Particularly in Yelp, the improve-
ment is dramatic (41% relative gains on MAP and 34% on NDCG). This may be due to the fact that
variable lengths of attributes are really harmful to flexible models such as RNNs. Our HET deals with
69
this heterogeneity and provides within-category normalization. This turns out to be very critical in model
regularization.
In terms of training time, we see that it takes much longer to train HA-RNN with MIX than HET.
This is expected as the MIX model has to additionally adjust the embedding scales while HET does not.
We stopped the training process of HA-RNN-MIX after 6 days.
Eectiveness of output layer
To evaluate the eectiveness of the output attribute embedding layer, we explore four variants of LSTMs-
rec: 1) ‘no attributes’: No attributes, but identity is used—this degenerates to a recommendation model
based on standard RNNs as in [Hidasi et al., 2015]; 2) ‘only input’: attributes used at input layer; 3) ‘only
output’: attributes used at output layer; 4) ‘input+output’: attributes used at both input and output layers.
Experiment results are reported in Table 6.3, from which we make two observations.
First, LSTMs-rec performs poorly on both datasets when no attributes are used (‘no attributes’).
On XING, the result is even worse than that of non-sequence model NHMF (We omit the result here
due to space limit). On the contrary, with attribute embedding (‘input+output’), the LSTMs-rec scores
are significantly boosted. These results suggest that sequence modeling alone is not good enough, and
attributes do help to improve accuracy.
Second, the output layer design for attribute embedding turns out to be very important. On XING,
only incorporating attributes at the output layer (‘only ouput’) gives better results than only at the input
layer (‘only input’). Moreover, it works best to learn attributes at both layers (‘input+output’). On
Yelp, embedding attributes either at input layer or output layer alone seems hardly helpful; however,
by embedding attributes at both layers we manage to significantly improve the score. This verifies our
assumption that the output layer embedding brings additional regularization.
We also empirically verified the eectiveness of sharing attribute parameters across input and output
layers. We omit the results due to space limit.
Performance comparisons
We compare HA-RNN with other state-of-the-art models and report the results in Table 6.4. We interpret
the table as follows. First, attribute embedding is vital in XING—the scores of A-WARP and LSTMs-
rec are significantly better than those of b-BPR, WARP, CBOW-rec, and RNN. Recommendation does
70
Table 6.4: Recommendation accuracy comparisons with other state-of-the-art models. Best and
second best single model scores are in bold and italic, respectively.
Methods P@2 P@10 P@30 R@2 R@10 R@30 MAP NDCG
XING
POP 0.0077 0.0034 0.0021 0.0079 0.0154 0.0274 0.0062 0.0127
b-BPR 0.0159 0.0111 0.0083 0.0133 0.0423 0.0920 0.0197 0.0420
WARP 0.0368 0.0179 0.0092 0.0309 0.0608 0.0832 0.0346 0.0557
CBOW 0.0367 0.0164 0.0088 0.0298 0.0571 0.0870 0.0329 0.0543
RNN 0.0406 0.0169 0.0091 0.0335 0.0597 0.0908 0.0365 0.0587
A-WARP 0.0362 0.0194 0.0109 0.0325 0.0749 0.1163 0.0398 0.0674
HA-RNN 0.0522 0.0241 0.0128 0.0451 0.0962 0.1453 0.0540 0.0873
HA-RNN* 0.0537 0.0252 0.0134 0.0459 0.0995 0.1502 0.0555 0.0900
Yelp
POP 0.0023 0.0022 0.0018 0.0008 0.0039 0.0092 0.0017 0.0051
b-BPR 0.0097 0.0082 0.0067 0.0033 0.0139 0.0342 0.0062 0.0188
WARP 0.0139 0.0112 0.0088 0.0047 0.0184 0.0437 0.0084 0.0247
CBOW 0.0165 0.0125 0.0097 0.0059 0.0219 0.0499 0.0100 0.0283
RNN 0.0183 0.0138 0.0104 0.0069 0.0248 0.0551 0.0113 0.0313
A-WARP 0.0142 0.0117 0.0098 0.0046 0.0193 0.0430 0.0087 0.0249
HA-RNN 0.0205 0.0154 0.0117 0.0080 0.0281 0.0638 0.0131 0.0357
HA-RNN* 0.0221 0.0164 0.0123 0.0083 0.0301 0.0665 0.0140 0.0377
benefit from sequence approaches as RNN performs better than WARP; LSTMs-rec does better than A-
WARP. Second, sequence approach is critical inYelp as we see CBOW-rec, RNN, and HA-RNN clearly
beat WARP and A-WARP. Finally, the best accuracy is achieved when HA-RNN combines both attribute
embedding and sequence modeling. The improvements are significant. On XING dataset, CA-RNN
single model relatively improves Precision@2, Recall@30, MAP, and NDCG by 29%, 25%, 36%, and
30%; onYelp dataset, the improvements are 12%, 16%, 16%, and 14%.
6.4.3 A qualitative analysis of learned embedding
We give our preliminary analysis on the embedding learned by LSTMs-rec. We plot the 2-D visualization
of part of learned attribute embedding (see Fig. 6.3). On XING, we observed an interesting hierarchical
clustering eect of categorical attribute embedding (see Fig. 6.3). First, the same type of attribute embed-
ding tends to cluster. For example, user career levels such as student/intern, manager, executive, etc.,
form a cluster—executives, and senior executive understandably stand a bit farther from the rest. Second,
embedding clusters across types tends to stay close when they have close semantic meaning. For example,
(anonymous) attributes of type industry and discipline are closely related and thus have close distances.
71
This is encouraging since we started from heterogeneous attributes and managed to infer their semantic
structures.
On Yelp, we do not observe a similar clustering structure across attributes. However, we see that
embedding reflects the correlation of business types in the reviews. The top 30 nearest neighbors of
juice bars & smoothies (distances are computed in the embedding space) are highlighted in PCA 2-D
visualization: macarons, cardio classes, Meditation Centers, cafes, wine tasting room, ... This suggests
that people who review a business of type juice bars & smoothies should likely review businesses of type
cardio classes or wine tasting room. This information might be used in cases like recommending a new
business to a Yelp customer. Some other examples (including fromXING) are listed in Table 6.5.
Figure 6.3: 2-D visualization of embedded attributes of XING’s categorical attributes by t-SNE
72
Table 6.5: Examples of attributes and their nearest neighbors in the embedding space (from
Yelp andXING).
Attribute Nearest Neighbors
korean vietnamese,taiwanese,japanese,..
children’s baby gear&furniture,children’s museums,
clothing maternity wear,cosmetics&beauty supply,..
bachelor master,phd,student/intern,industry6,..
exp<1 exp1-3, exp3-5, exp5-10, exp10-15, ..
6.5 Summary
In this chapter we explore the eectiveness of combining heterogeneous attribute embedding and
sequence modeling in recommendation with implicit feedback. To this end, we build a neural network
framework to incorporate attributes and apply sequence models to embed attributes and to perform rec-
ommendation. Through empirical studies on four large-scale datasets, we find that rich discriminative
information is encoded in heterogeneous attributes and item sequences. By combining attribute embed-
ding and flexible sequence models, we are able to capture information and improve recommendation
performance. Qualitative analysis shows the learned attribute embedding captures data semantic struc-
tures.
73
Chapter 7
Learn to Combine Modalities in Multimodal Deep
Learning
Combining complementary information from multiple modalities is intuitively appealing for improving
the performance of learning-based approaches. However, it is challenging to fully leverage dierent
modalities due to practical challenges such as varying levels of noise and conflicts between modalities.
Existing methods do not adopt a joint approach to capturing synergies between the modalities while
simultaneously filtering noise and resolving conflicts on a per sample basis. In this chapter we propose
a novel deep neural network based technique that multiplicatively combines information from dierent
source modalities. Thus the model training process automatically focuses on information from more
reliable modalities while reducing emphasis on the less reliable modalities. Furthermore, we propose
an extension that multiplicatively combines not only the single-source modalities, but a set of mixtured
source modalities to better capture cross-modal signal correlations. We demonstrate the eectiveness
of our proposed technique by presenting empirical results on three multimodal classification tasks from
dierent domains. The results show consistent accuracy improvements on all three tasks.
7.1 Introduction
Signals from dierent modalities often carry complementary information about dierent aspects of
the object, event, or activity of interest. Therefore, learning-based methods that combine informa-
tion from multiple modalities are, in principle, capable of more robust inference. For example, a per-
son’s visual appearance and the type of language they use both carry information about their age. In
the context of user profiling in a social network, it helps to predict users’ gender and age by model-
ing both users’ profile pictures and their posts. A natural generalization of this idea is to aggregate
signals from all available modalities and build learning models on top of the aggregated information,
ideally allowing the learning technique to figure out the relative emphases to be placed on dierent
74
modalities for a specific task. This idea is ubiquitous in existing multimodal techniques, including
early and late fusion [Snoek et al., 2005, Gunes and Piccardi, 2005], hybrid fusion [Atrey et al., 2010],
model ensemble [Dietterich, 2000], and more recently—joint training methods based on deep neural net-
works [Ngiam et al., 2011, W¨ ollmer et al., 2010, Neverova et al., 2016]. In these methods, features (or
intermediate features) are put together and are jointly modeled to make a decision. We call them additive
approaches due to the type of aggregation operation. Intuitively, they are able to gather useful information
and make predictions collectively.
However, it is practically challenging to learn to combine dierent modalities. Given multiple input
modalities, artifacts such as noise may be a function of the sample as well as the modality; for example,
a clear, high-resolution photo may lead to a more confident estimation of age than a lower quality photo.
Also, either signal noise or classifier vulnerabilities may result in decisions that lead to conflicts between
modalities. For instance, in the example of user profiling, some users’ gender and age can be accurately
predicted by a clear profile photo, while others with a noisy or otherwise unhelpful (e.g., cartoon) profile
photo may instead have the most relevant information encoded in their social network engagement—such
as posts and friend interactions, etc. In such a scenario, we refer to the aected modality, in this case the
image modality, as a weak modality. We emphasize that this weakness can be sample-dependent, and is
thus not easily controlled with some global bias parameters. An ideal algorithm should be robust to the
noise from those weak modalities and pick out the relevant information from the strong modalities on a
per sample basis, while at the same time capturing the possible complementariness among modalities.
We point out that the existing additive approaches do not fully address the challenges mentioned
earlier. Their basic underlying assumptions are 1) every modality is always potentially useful and should
be aggregated, and 2) the models (e.g., a neural network) on top of aggregated features can be trained
well enough to recover the complex function mapping to a desired output. While in theory the second
assumption should hold, i.e., the learning models should be able to determine the quality of each modality
per sample if given a suciently large amount of data, they are, in practice, dicult to train and regularize
due to the finiteness of available data.
In this work, we propose a new multiplicative multimodal method which explicitly models the fact
that on any particular sample not all modalities may be equally useful. The method first makes deci-
sions on each modality independently. Then the multimodal combination is done in a dierentiable and
multiplicative fashion. This multiplicative combination suppresses the cost associated with the weak
modalities and encourages the discovery of truly important patterns from informative modalities. In this
way, on a particular sample, inferences from weak modalities gets suppressed in the final output, but,
75
even more importantly perhaps, they are not forced to generate a correct prediction (from noise!) during
training. This accommodation of weak modalities helps to reduce model overfitting, especially to noise.
As a consequence, the method eectively achieves an automatic selection of the more reliable modali-
ties and ignores the less reliable ones. The method is also end-to-end and enables jointly training model
components on dierent modalities.
Furthermore, we extend the multiplicative method with the ideas of additive approaches to increase
model capacity. The motivation is that certain unknown mixtures of modalities may be more useful than
a good single modality. The new method first creates dierent mixtures of modalities as candidates,
and they make decisions independently. Then the multiplicative combination automatically selects more
appropriate candidates. In this way, the selection operates on ”modality mixtures” instead of just a single
modality. This mixture-based approach enables structured discovery of the possible correlations and
complementariness across modalities and increases the model capacity in the first step. A similar selection
process applied in the second step ignores irrelevant and/or redundant modal mixtures. This again helps
control model complexity and avoid excessive overfitting.
We validate our approach on classification tasks in three datasets from dierent domains: image
recognition, physical process classification, and user profiling. Each task provides more than one modality
as input. Our methods consistently outperform the other existing, state-of-the-art multimodal methods.
In summary, the key contributions of this chapter are as follows:
The multimodal classification problem is considered with a focus on addressing the challenge of
weak modalities.
A novel deep learning combination method that automatically selects strong modalities per sample
and ignores weak modalities is proposed and experimentally evaluated. The method works with
dierent neural network architectures and is jointly trained in an end-to-end fashion.
A novel method to automatically select mixtures of modalities is presented and evaluated. This
method increases model capacity to capture possible correlations and complementariness across
modals.
Experimental evaluations on three real-world datasets from dierent domains show that the new
methods consistently outperform existing multimodal methods.
76
Figure 7.1: Illustration of dierent deep neural network based multimodal methods. (a) A gen-
der prediction example with text (a fake userid) and dense feature (fake profile information)
modality inputs. (b) Additive combination methods train neural networks on top of aggregate
signals from dierent modalities; Equal errors are back-propagated to dierent modality models.
(c) Multiplicative combination selects a decision from a more reliable modality; Errors back to
the weaker modality are suppressed. (d) Multiplicative modality mixture combination first addi-
tively creates mixture candidates and then selects useful modality mixtures with multiplicative
combination procedure.
7.2 Background
To set the context for our work, we now describe two existing types of popular multimodal approaches in
this section. We begin first with notations and then describe traditional approaches, followed by existing
deep learning approaches.
Notations We use M to indicate the number of modalities available in total. We denote each input
modality/signal as a dense vectorv
m
2R
d
m
;8m = 1; 2;::; M. For example, given M = 3 modalities in the
user profiling task,v
1
is the profile image represented as a vector,v
2
is the posted text representation, and
v
3
encodes the friend network information. We consider a K-way classification setting where y denotes
the labels. p
k
m
denotes the prediction probability of the kth class from the mth modality, and p
k
denotes
the model’s final prediction probability of the kth class. Throughout the paper, superscripts are used with
indices for classes and subscripts used for modalities.
77
7.2.1 Traditional approaches
Early Fusion Early fusion methods create a joint representation of input features from multiple modali-
ties. Next, a single model is trained to learn the correlation and interactions between low level features of
each modality. We denote the single model as h. The final prediction can be written as
p = h([v
1
;::;v
m
]); (7.1)
where we use concatenation here as a commonly seen example of jointly representing modal features.
Early fusion could be seen as an initial attempt to perform multimodal learning. The training pipeline
is simple as only one model is involved. It usually requires the features from dierent modalities to be
highly engineered and preprocessed so that they align well or are similar in their semantics. Furthermore,
it uses one single model to make predictions, which assumes that the model is well suited for all the
modalities.
Late Fusion Late fusion uses unimodal decision values and fuses them with a fusion mechanism F (such
as averaging [Shutova et al., 2016], voting [Morvant et al., 2014], or a learned model [Glodek et al., 2011,
Ramirez et al., 2011].) Suppose model h
i
is used on modal i (i = 1;::; M,) the final prediction is
p = F(h
1
(v
1
);:::; h
m
(v
m
)) (7.2)
Late fusion allows the use of dierent models on dierent modalities, thus allowing more flexibility.
It is easier to handle a missing modality as the predictions are made separately. However, because late
fusion operates on inferences and not the raw inputs, it is not eective at modeling signal-level interactions
between modalities.
7.2.2 Multimodal deep learning
Due to the superior performance and computationally tractable representation capability (in vector spaces)
in multiple domains such as visual, audio, and text, deep neural networks have gained tremendous pop-
ularity in multimodal learning tasks [Ngiam et al., 2011, Ouyang et al., 2014, Wang et al., 2015a]. Typi-
cally, domain-specific neural networks are used on dierent modalities to generate their representations
and the individual representations are merged or aggregated. Finally, the prediction is made on top
of aggregated representation usually with another a neural network to capture the interactions between
78
modalities and learn complex function mapping between input and output. Addition (or average) and
concatenation are two common aggregation methods, i.e.,
u =
X
m
f
m
(v
m
) (7.3)
or
u = [ f
1
(v
1
);::; f
1
(v
m
)] (7.4)
where f is considered a domain specific neural network and f
m
: R
d
m
! R
d
(m = 1;::; M). Given the
combined vector output u2R
d
(orR
P
d
m
), another network g computes the final output.
p = g(u) where g :R
d
!R
K
(7.5)
The network structure is illustrated in Figure 7.1(b) . The arrows are function mappings or computing
operations. The dotted boxes are representations of single and combined modality features. We call them
additive combinations because their critical step is to add modality hidden vectors (although often in a
nonlinear way).
In Section 5, we present related work in areas such as learning joint multimodal representations using
a shared semantic space. Those approaches are not directly applicable to our task where we are aim to
predict latent attributes, not merely the observed identities of the sample.
7.3 A multiplicative combination layer
The additive approaches discussed above make no assumptions regarding the reliability of dierent modal
inputs. As such, their performance critically relies on the single network g to figure out the relative
emphases to be placed on dierent modalities. From a modeling perspective, the aim is to recover the
function mapping between the combined representation u and the desired outputs. This function can be
complex in real scenarios. For instance, when the signals are similar or complementary to each other, g is
supposed to merge them to make a strengthened decision; when signals conflict with each other, g should
filter out the unreliable ones and make a decision based primarily on more reliable modalities. While
in theory g—often parameterized as a deep neural network—has the capability to recover an arbitrary
function given a suciently large amount of data (essentially, unlimited), it can be, in practice, very
79
dicult to train and regularize given data constraints in real applications. As a result, model performance
degrades significantly.
Our aim is to design a more (statistically) ecient method by explicitly assuming that some modalities
are not as informative as others on a particular sample. As a result, they should not be fed into a single
network for training. Intuitively, it is easier to train a model on the input of a good modality rather than a
mix of good ones and bad ones. Here we dierentiate modalities to be informative modalities (good) and
weak modalities (bad). Note that the labels informative and noisy are applied in respect to each particular
sample.
To begin, let every modality make its own independent decision with its modal-specific model (e.g.,
p
i
= g
i
(v
i
).) Their decisions are combined by taking an average. Specifically, we have the following
objective function,
L
ce
=`
y
ce
; `
y
ce
=
M
X
i=1
log p
y
i
(7.6)
where y denotes the true class index, and we call`
y
a class loss as it is part of the loss function associated
with a particular class. In the testing stage, the model predicts the class with the smallest class loss, i.e.,
ˆ y = arg min
y
`
y
ce
: (7.7)
This relatively standard approach allows us to train one model per modality. However, when weak
modals exist, the objective (7.7) would significantly increase. By minimizing (7.7), it forces every model
based on its modal to perform well on the training data. This could lead to severe overfitting as the noisy
modal simply does not contain the information required to make a correct prediction, but the loss function
penalizes it heavily for incorrect predictions.
7.3.1 Combine in a multiplicative way
To mitigate against the problem of overfitting, we propose a mechanism to suppress the penalty incurred
on noisy signals from certain modalities. A cost on a modality is down-weighted when there exists other
good modalites for this example. Specifically, a modality is good (or bad) when it assigns a high (or
low) probability to the correct class. A higher probability indicates more informative signals and stronger
confidence. With that in mind, we design a down-weighting factor as follows,
80
q
i
= [
Y
j,i
(1 p
j
)]
=(M1)
(7.8)
where we omit the class index superscripts on p and q for brevity; is a hyper parameter to control the
strength of down-weighting and are chosen by cross-validation. Then the new training criterion becomes
L
mul
=`
y
mul
; `
y
mul
=
M
X
i=1
q
y
i
log p
y
i
: (7.9)
The scaling factor [
Q
j,i
(1 p
j
)]
=(M1)
represents the average prediction quality of the remaining
modalities. This term is close to 0 when some p
j
are close to 1. When those modalities ( j , i) have
confident predictions on the correct class, the term would have a small value, thus suppressing the cost on
the current modality (p
i
). Intuitively, when other modalities are already good, the current modality (p
i
)
does not have to be equally good. This down-weighting reduces the training requirement on all modalities
and reduces overfitting. [Jin et al., 2016] uses this term to ensemble dierent layers of a convolution
network in image recognition task. We introduce important hyper-parameter to control the strength of
these factors. Larger values give a stronger suppressing eect and vice versa. During the testing, we
follow a similar criterion in (7.7) (replacing`
ce
with`
mul
.)
We call this strategy a multiplicative combination due to the use of multiplicative operations in (7.8).
During the training, the process always tries to select some modalities that give the correct prediction and
tolerate mistakes made by other modalities. This tolerance encourages each modality to work best in its
own areas instead of on all examples.
We emphasize that implements a trade o between ensemble and non-smoothed multiplicative
combination. When=0, we have q = 1:0 and predictions from dierent modals are averaged; when=1,
there is no smoothing at all on (1 p
j
) terms so that a good modal will strongly down weight losses from
other modals.
The proposed combination can be implemented as the last layer of a combination neural network as
it is dierentiable. Errors in (7.9) can be back-propagated to dierent components of the model such that
the model can be trained jointly.
7.3.2 Boosted multiplicative training
Despite it providing a mechanism to selectively combine good and bad modalities, the multiplicative
layer as configured above has some limitations. Specifically, there is a mismatch between minimizing the
81
objective function and maximizing the desired accuracy. To illustrate that, we take one step back and
look at the standard cross entropy objective function (7.6) (then M = 1). We have exp(`
1
) + exp(`
2
) =
p
1
+ p
2
= 1 when Y = 2. Let’s call it normalized. It makes intuitive sense to minimize only `
y
in the
training phase so that we have`
y
<`
y
0
(y
0
, y)—thus maximizing the accuracy.
However, if we look at (7.9), the same normalization does not apply any more due to the complication
of multiple modalities (M > 1) and the introduction of the down-weighting factors q
i
. Therefore, it
does not guarantee that minimizing `
1
leads to driving `
1
< `
2
or vice versa. There are two important
consequences of this mismatch. First, the method may stop minimizing the class losses on the correct
classes when it is still incorrect. Second, it may work on reducing the class losses that already have
correct predictions.
A tempting naive approach When addressing the issue, a temptive approach is to normalize the class
losses similar to normalizing a probability vector. A deeper consideration reveals the pitfall inherent in
that temptation—normalizing class losses does not make sense because the class losses in an objective
function are error surrogates which usually serve as the upper bound of the training errors. While it makes
sense to minimize the surrogates on the correct classes, but it is pointless, perhaps counterproductive, to
maximize the losses on the wrong classes. What regular normalization techniques do is to maximize
the gap between losses in the correct and wrong classes—eectively minimizing the correct ones and
maximizing the wrong ones. Experimental results validate the analysis presented above.
Boosting extension We propose a modification of the objective function in (7.9) to address the issue.
Rather than always placing a loss on the correct class, we place a penalty only when the class loss values
are not the smallest among all the classes. This creates a connection to the prediction mechanism in
(7.7). If the prediction is correct, there is no need to further reduce the class loss on that instance; if the
prediction is wrong, the class loss should be reduced even if the loss value is already relatively small. To
increase the robustness, we add a margin formulation where the loss on the correct class should be smaller
by a margin. Thus, the objective we use is as follows,
L =`
y
(1
Y
8y
0
,y
1(`
y
mul
+<`
y
0
mul
)); (7.10)
where the bracket part in the right-hand side of (7.10) computes whether the loss associated with the
correct class is the smallest (by a margin). The margin is chosen in experiment by cross validation.
The new objective function only aims to minimize the class losses which still need improvement.
For those examples that already have correct classification, the loss is counted as zero. Therefore, the
82
objective only adjusts the losses that lead to wrong prediction. It makes model training and desired
prediction accuracy better aligned.
Boosting connection There is a clear connection between the new objective (7.10) and boosting ideas
if we consider the examples where (7.7) makes wrong predictions as hard examples and others as easy
examples. The objective (7.10) looks at only the hard examples and directs eorts to improve the losses.
The hard examples change during the training process, and the algorithm adapts to that. Therefore, we
call the new training criterion boosted multiplicative training.
7.4 Select modal mixtures
The multiplicative combination layer explicitly assumes modals are noisy and automatically select good
modals. One limitation is that the models g
i
(i = 1;::; M) are trained primarily based on a single modal
(although they do receive back-propagated errors from the other modals through joint training). This
prevents the method from fully capturing synergies across modals. In a twitter example, a user’s follower
network and followee network are two modalities that are dierent but closely related. They jointly
contribute to predictions concerning the user’s interests, etc. The multiplicative combination in Section
7.3 would not be ideal in capturing such correlations. On the other hand, additive methods are able to
capture model correlation more easily by design (although they do not explicitly handle modality noise
and conflicts).
7.4.1 Modality mixture candidates
Given the complementary qualities of the additive and multiplicative approaches, it is desirable to harness
the advantages of both. To achieve that goal, we propose a new method. At a high level, we want our
methods to first have the capability to capture all possible interactions between dierent modalities and
then to filter out noises and pick useful signals.
In order to be able to model interactions of dierent modalities, we first create dierent mixtures of
modalities. Particularly, we enumerate all possible mixtures from the power set of the set of modality
features. On each mixture, we apply the additive operation to extract higher-level feature representations
as follows,
u
c
=
X
k2M
c
;M
c
f1;2;::;Mg
f (v
k
) (7.11)
83
where M
c
contains one or more modalities. Thus we have u
c
as the representation of the mixture of
modalities in set M
c
. It gathers signals from all the modalities in M
c
. Since there are 2
M
1 dierent
non-empty M
c
, there are 2
M
1 u
c
, and each u
c
looks into the mix of a dierent modality mixture. We
call each u
c
a mixture candidate as we believe not every mixture is equally useful; some mixtures may
be very helpful to model training while others could even be harmful.
Given the generated candidates, we make predictions based on each of them independently. Con-
cretely, as in the additive approach, a neural network is used to make prediction p
c
as follows,
p
c
= g
c
(u
c
); (7.12)
. where p
c
is the prediction result from an individual mixture. Dierent p
c
may not agree with each other.
It remains to have a mechanism to select which one to believe or how to combine p
c
.
7.4.2 Mixture selections
Among the combination candidates generated above, it is not clear which mixtures are strong and which
are weak due to the way of enumerating proposals. One simple way is to average predictions from all can-
didates. However, it loses the ability to discriminate between dierent modalities and again takes modali-
ties as equally useful. From the modeling perspective, it is similar to simply doing additive approaches to
modalities in the beginning. Our goal is to automatically select strong candidates and ignore weak ones.
To achieve that, we apply the multiplicative combination layer (7.9) in Section 7.3 to the selection of
mixture candidates in (7.12), i.e.,
`
y
=
jM
c
j
X
c=1
q
y
c
log p
y
c
(7.13)
where q
c
is defined similarly.
Equation (7.13) follows (7.9) except that each model here is based on a mixture candidate instead of
a single modality.
With (7.10) (7.11) (7.12) (7.13) our method pipeline can be illustrated in Fig. 7.1(d). It first additively
creates modality mixture candidates. Such candidates can be features from one single modal and can also
be mixed features from multiple modals. These candidates by design make it more straightforward to
consider signal correlation and complementariness across modals. However, it is unknown which candi-
date is good for an example. Some candidates can be redundant and noisy. The method then combines the
84
prediction of dierent mixtures multiplicatively. The multiplicative layer enables candidate selection in
an automatic way where strong candidates are picked while weak ones are ignored without dramatically
increasing the entire objective function. As a whole, the model is able to pick the most useful modalities
and modality mixtures with respect to our prediction task.
7.5 Related work
Multimodal learning Traditional multimodal learning methods include early fusion (i.e., fea-
ture based), late fusion (i.e., decision based), and hybrid fusion [Atrey et al., 2010]. They
also include model based fusion such as multiple kernel learning [G¨ onen and Alpaydın, 2011,
Bucak et al., 2014, Gehler and Nowozin, 2009], graphical model based approaches[Nefian et al., 2002,
Ghahramani and Jordan, 1996, Gurban et al., 2008], etc.
Deep neural networks are very actively explored in multimodal fusion [Ngiam et al., 2011]. They
have been used to fuse information for audio-visual emotion classification [W¨ ollmer et al., 2010], gesture
recognition [Neverova et al., 2016], aect analysis [Kahou et al., 2016], and video description genera-
tion [Jin and Liang, 2016]. While the modalities used, architectures, and optimization techniques might
dier, the general idea of fusing information in joint hidden layer of a neural network remains the same.
Multiplicative combination technique Multiplicative combination is widely explored in machine learn-
ing methods. [Changpinyo et al., 2013] uses an OR graphical model to combine similarity probabilities
across dierent feature components. The probabilities of dissimilarity between pairs of objects go through
multiplication to generate the final probability of be dissimilar, thus picking out the most optimistic com-
ponent. [Jin et al., 2016] ensembles multiple layers of a convolution network with a down-weighting
objective function which is a specialized instance of our (7.8). Our objective is more general and flexible.
Furthermore, we developed boosted training strategy and modal combination to address multimodal clas-
sification challenges. [Lin et al., 2017] develops focal loss to address class imbalance issue. In its single
modal setting, it down-weights the loss by one minus its own probability.
Attention techniques [Ba et al., 2014, Chan et al., 2016, Mnih et al., 2014] are also treated as mul-
tiplicative methods to combine multiple modalities. Features from dierent modal are dynamically
weighted before mixed together. The multiplicative operation is performed at the feature level instead
of decision level.
Other multimodal tasks. There are other multimodal tasks where the ultimate task is not clas-
sification. These include such various image captioning tasks. In [Vinyals et al., 2015] a CNN
85
image representation is decoded using an LSTM language model. In [Jia et al., 2015], gLSTM incor-
porates the image data together with sentence decoding at every time step fusing visual and sen-
tence data in a joint representation. Joint multimodal representation learning is also used for
visual and media question answering [Gao et al., 2015, Malinowski et al., 2015, Xu and Saenko, 2016],
visual integrity assessment [Jaiswal et al., 2017], and personalized recommendation [Hidasi et al., 2016,
Liu and Natarajan, 2017].
7.6 Experiments
We validate our methods on three datasets from dierent domains: image recognition, physical process
classification, and user profiling. On these tasks, we are given more than one modality inputs and try to
best use these modalities to achieve good generalization performance. Our code is publicly available
1
.
7.6.1 Setup
CIFAR-100 Image Recognition
The CIFAR-100 dataset [Krizhevsky and Hinton, 2009] contains 50,000 and 10,000 color images with
size of 32 32 from 100 classes for training and testing purposes, respectively. As observed
by [Yang and Ramanan, 2015, Hariharan et al., 2015], dierent layers of a convolutional neural net-
work (CNN) contain dierent signals of an image (dierent abstraction levels) that may be useful
for classification on dierent examples. [Jin et al., 2016] takes three layers of networks in networks
(NINs) [Lin et al., 2013] and demonstrates recognition accuracy improvements. In our experiments, the
features from three dierent layers of a CNN are regarded as three dierent modalities.
Network architecture We use Resnet [He et al., 2016a] as the network in our experiments as it signifi-
cantly outperforms NINs on this task and meanwhile Resnet also has the block structure so that it makes
our choice of modality easier and more natural. We experimented with network architecture Resnet-32,
Resnet-110. On both networks there are three residual units. We take the hidden states of the three units
as modalities. We follow [Jin et al., 2016] and weight the losses of dierent layers by (0.3, 0.3, 1.0). Our
implementations are based on [Gomez et al., ].
Methods We experimented dierent methods: (1) Vanilla Resnet (“Base”)[Gomez et al., ] predicts the
image class only based on the last layer output; i.e., there is only one modality. (2) Resnet-Add (“Add”)
1
https://github.com/skywaLKer518/MultiplicativeMultimodal
86
concatenates the hidden nodes of three layers and builds fully connected neural networks (FCNNs) on
top of the concatenated features. We tuned the network structure and found a two layer network with
hidden 256 nodes gives the best result. (3) Resnet-Mul (“Mul”) multiplicatively combines predictions
from the hidden nodes of three layers. (4) Resnet-MulMix (“MulMix”) uses multiplicative modal mixture
combination on the three hidden layers. It uses default value 0:5. (5) Resnet-MulMix* (“MulMix*”) is
the same as MulMix except is tuned between 0 and 1.
Training details We strictly follow [Gomez et al., ] to train Resnet and all other models. Specifically,
we use SGD momentum with a fixed schedule learning ratef0.1, 0.01, 0.001g and terminate at 80000
iterations. We use 100 as batch size and choose weight decay 0.0002 on all network weights.
HIGGS Classification
HIGGS [Baldi et al., 2014] is a binary classification problem to distinguish between a signal process
which produces Higgs bosons and a background process which does not. The data has been produced
using Monte Carlo simulations. We have two feature modalities—low level and high level features. Low
level features are 21 features which are kinematic properties measured by the particle detectors in the
accelerator. High level features are anothor 7 features that are functions of the first 21 features; they are
derived by physicists to help discriminate between the two classes. Details of feature names are in the
original paper [Baldi et al., 2014]. We follow the setup in [Baldi et al., 2014] and use the last 500,000
examples as a test set.
“HIGGS-small” and “HIGGS-full”. To investigate the algorithm behaviors under dierent data scales,
we also randomly down-sample 1/3 of the examples from the entire training split. This creates another
subset which we call ”HIGGS-Small.”
Network architecture We use feed-forward deep networks on each modality. We fol-
low [Baldi et al., 2014] to use 300 hidden nodes in our networks and tried dierent number of layers.
L2 weight decay is used with coecient 0.00002. Dropout is not used as it hurts the performance in our
experiments. Network weights are randomly initialized and SGD with 0.9 momentum is used during the
training.
Methods We experimented with single modal prediction, late fusion of two modalities, and modal com-
bination methods similar to what we describe above in CIFAR100.
87
Gender Classification
Gender The dataset we use contains 7.5 million users from Snapchat app, with registered userid and user
activity logs (e.g., story posts, friend networks, message sent, etc.) It also contains the inferred user first
names produced by an internal tool. The task is to predict user genders. Users’ inputs in Bitmoji app are
used as the ground truth. We randomly shue the data and use 6 million samples as training, 500K as
development, and 1 million for testing.
There are three modalities in this dataset: userid as a short text, inferred user first name as a letter
string, and dense features extracted from user activities (e.g., the count of message sent, the number of
male or female friends, the count of stories posted.)
Gender-6 and Gender-22 We experimented with two versions of the dataset. The versions diered in the
richness of user activity features. The first one has 6 highly engineered features (gender-6) and the other
has 22 features (gender-22).
Network architecture We use FCNNs to model dense features. We tune the architecture and eventually
use a 2 layer network with 2000 hidden nodes. We use (character based) Long Short-Term Memory Net-
works (LSTMs) [Hochreiter and Schmidhuber, 1997] to model text string. The text string is fed into the
networks one character by one character and the hidden representation of the last character is connected
with FCNNs to predict the gender. We find the vanilla single layer LSTMs outperforms or matches other
variants including (multi-layer, bidirectional [Graves et al., 2013], attention-based LSTMs). We believe
it is due to the fact that we have suciently large amount of data. We also experimented character based
Convolutional Neural Networks (char-CNN) [Kim, 2014, Zhang et al., 2015] and CNNs+LSTMs for text
modeling and found LSTMs perform slightly better.
Training details: our tuned LSTMs has 1 layer with hidden size 512. It uses
ADAM [Kingma and Ba, 2014] to train with learning rate 0.0001 and learning rate decay 0.99. Gra-
dients are truncated at 5.0. We stop model training when there is no improvement on the development set
for consecutive 15 evaluations.
Methods In addition to methods described in CIFAR100 and HIGGS, we also experimented with an
attention based combination methods [Moon et al., 2018].
88
Table 7.1: Test error rates/AUC comparisons on CIFAR100, HIGGS, and gender tasks.
MulMix uses default value 0.5. MulMix* tunes between 0 and 1. Experimental results
are from 5 random runs. The best and second best results in each row are in bold and italic,
respectively.
Base Fuse Add [1] Mul Mulmix Mulmix*
cifar100, resnet-32
Err
30.3 0.2
30.0 [2]
- 29.4 0.4 29.3 0.2 27.8 0.3 27.3 0.4
cifar100, resnet110
Err
26.5 0.3
26.4 [2]
- 27.2 0.4 25.3 0.4 25.1 0.2 24.7 0.3
higgs-small
Err 23.3 22.5 22.3 0.1 21.8 0.1 21.4 0.1 21.2 0.1
AUC 84.8 85.9 86.2 0.1 86.5 0.1 87.1 0.1 87.2 0.1
higgs-full
Err 21.7 20.6 20.0 0.1 20.1 0.1 19.6 0.2 19.4 0.1
AUC 86.6 88.0
88.6 0.1
88.5 [3]
88.3 0.1 88.8 0.2 89.1 0.1
gender-6
Err 15.4 7.97 6.07 0.02 6.05 0.02 5.90 0.02 5.86 0.02
gender-22
Err 10.1 5.15 3.85 0.03 3.83 0.03 3.70 0.02 3.66 0.01
7.6.2 Results
Accuracy comparisons
The test accuracy comparisons are reported in Table 7.1.
2
CIFAR100 Compared to the vanilla Resnet model (Base), additive modal combination (Add) does not
necessarily help improve test error. Particularly, it helps Resnet-32 but not Resnet-110. It might be due to
overfitting on Resnet-110 as there is already much more parameters.
Multiplicative training (Mul), on the contrary, helps reduce error rates in both models. It demonstrates
better capability of extracting signals from dierent modalities.
Further, MulMix and MulMix*, which is designed to combine the strengths of additive and multi-
plicative combination, give significant boost in accuracy on both models.
HIGGS Either fusion model or additive combination gives significant error rate reduction compared to
single modal. This is expected as it is very intuitive to aggregate low and high level feature modals.
2
[1] [Ngiam et al., 2011]; [2] [Gomez et al., ]; [3] [Baldi et al., 2014].
89
Figure 7.2: Comparisons to results from deeper networks. Error rates and standard deviations
from fusion networks with hidden layer structures are reported and compared to our models
(i.e., MulMix, and MulMix*). Simply going deep in networks does not necessarily help improve
generalization. Experimental results are from 5 random runs.
1 2 3
# of layers
24
25
26
27
28
29
error rates
CIFAR100 (resnet110)
h=128
h=256
mulmix
mulmix*
(a) CIFAR100
3 4 5 6
# of layers
19.2
19.4
19.6
19.8
20.0
20.2
error rates
HIGGS (full)
h=300
h=500
mulmix
mulmix*
(b) HIGGS
2 3 4
# of layers
3.65
3.70
3.75
3.80
3.85
3.90
3.95
4.00
4.05
error rates
gender (22)
h=500
h=300
mulmix
mulmix*
(c) gender
Compared to Add, Multiplicative combination has clearly better results on higgs-small but slightly
worse results on higgs-full. This can be explained by the fact models are more prone to overfit on smaller
datasets and multiplicative training does reduce the overfit.
Finally, MulMix and MulMix* give significant boost on both small and full datasets.
Gender It gives the most dramatic improvements to combine multiple modalities here due to the high
level noise in each modalities. Add achieves less than half error rates than what the best single modal
could achieve. As a comparison, Mul has similar (slightly better) results. This suggests the two methods
might work with similar mechanisms. However, MulMix and MulMix* clearly outperform Add and Mul,
showing the benefits of combining two types of combination strategies.
Compared to deeper fusion networks
As the new approach, especially MulMix (or MulMix*), introduces additional parameters in fusion net-
works in each mixture, one natural question is whether the improvements simply come from the increased
number of parameters. We answer the question by running additional experiments with additive combi-
nation (Add) models with deeper fusion networks.
The results are plotted in Figure 7.2. We see on CIFAR100 and gender, networks with increased depth
lead to worse results. The can be either due to increased optimization diculty or overfitting. On HIGGS,
increased depth first leads to slight improvements and then the error rates go up again. We see even the
results at the optimal network depth are not as good as our approach. Overall the figures show that it is
the design rather than the depth of the fusion networks that holds their performance. On the contrary our
approaches are explicitly designed to extract signals selectively and collectively across modals.
90
Figure 7.3: Error rates and standard deviations under dierent values. Optimal results do not
appear at either 0 or 1. Experimental results are from 5 random runs.
0.1 0.3 0.5 0.8 1.0
beta
24.0
24.5
25.0
25.5
26.0
26.5
error rates
CIFAR100 (resnet110)
mul
mulmix
(a) CIFAR100
0.1 0.3 0.5 0.8 1.0
beta
19.2
19.4
19.6
19.8
20.0
20.2
20.4
20.6
20.8
error rates
HIGGS (full)
mul
mulmix
(b) HIGGS
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
beta
3.6
3.8
4.0
4.2
4.4
4.6
error rates
gender (22)
mul
mulmix
(c) gender
Table 7.2: Error rate results of boosted training (MulMix) on HIGGS-full and gender-22.
* =1.0 (vanilla) =1.0 (boosted)
HIGGS (full) 19.5 0.1 19.8 0.3 19.5 0.1
gender-22 3.66 0.01 3.72 0.02 3.67 0.01
Multiplicative combination or ensemble
Our loss function in (7.8) implements a trade-o between modal ensemble and multiplicative combina-
tion. =0 makes it an ensemble of dierent modals (or modal mixtures) while =1 makes the model a
non-smoothed multiplicative combination. To understand the exact working mechanism and to achieve
the best results, we tune between 0 and 1 and plot corresponding error rates on dierent tasks in Figure
7.3.
We observe that the optimal results do not appear at either end of the spectrum. On the con-
trary, smoothed multiplicative combinations with optimals achieve significantly better results than pure
ensemble or multiplicative combination. On CIFAR100 and HIGGS, we see optimal values 0.3 and 0.8
respectively and they are consistent across models Mul and MulMix. On gender, Mul clearly favors
to be 1 as each single modal is very noisy and it makes less sense to evenly ensemble predictions from
dierent modal.
We do not have a clear theory how to choose automatically. Our hypothesis is that smaller leads to
stronger regularization (due to smoothed scaling factors) while larger gives more flexibility in modeling
(highly non-linear combination). As a result, we recommend choosing smaller when the original models
overfit and larger when the models underfit.
91
Table 7.3: Gender-22 error analysis: Mistakes that multimodal methods make where individual
modals do not (we call “over-learn”); Mul and MulMix* improve Add on that. The overall
improvement is very close to the improvement from “over-learn”.
Add Mul MulMix* improve
overall 3.85 3.83 3.66 0.19
Over-learn 2.90 2.87 2.72 0.18
Boosted training
We also validate the eectiveness of boosted training technique. We find that when in (7.8) is not tuned,
boosted training significantly improves the results. Table 7.2 shows MulMix( = 1) test errors on HIGGS
and gender. Boosted training helps MulMix( = 1:0) achieve almost identical results as MulMix*. It is
interesting to see the second and fourth columns have very close numbers. We conjecture the smoothing
eect of makes the “mismatch’ issue discussed in Section 7.3.2 less severe.
Additional experiments
Where the improvements are made? We are interested in seeing where the improvements are made on
this prediction task. It is known that ensemble-like methods help correct predictions on examples where
individual classifiers make wrong predictions. However, they also make mistakes on the examples where
individual classifiers are correct. This is in general due to overfitting and we call it ”over learning.”
We expect our methods to reduce errors of ”over learning” due to its regularization mechanism – we
tolerate incorrect predictions from a weak modality while preserving its correct predictions. We analyze
the errors only on the examples where individual modals could make correct predictions. More clearly,
we evaluate the errors on the examples which at least one single modal predicts correctly.
The result is reported in Table 7.3. We see Mul and MulMix* both make less “over learning” mistakes.
Interestingly, the improvement of MulMixhere (0.18) is very close to the improvement on the entire
dataset. It suggests our new methods do prevent individual modalities from ”over learning.”
Compared to attention models We also tried attention methods [Moon et al., 2018] where attention
modules are applied to each modal before additively combined. We experimented on gender prediction
because on this task it is most common to see missing modals. The results are reported in Table 7.4. We
do not observe clear improvements.
92
Table 7.4: Test errors of attention models on gender tasks.
Add Add-Attend
gender-6 6.07 0.02 6.07 0.01
gender-22 3.85 0.03 3.86 0.03
Table 7.5: Examples of MulMix* prediction results from single modals and modality mixtures
in gender prediction task. A, B, C denote single modals. On every sample modal mixtures are
sorted by their prediction probabilities of the correct class (numbers in the table). Blue (red)
color indicates the probability leads to a correct (incorrect) prediction.
Truth A B C Modal mixtures (sorted by the probability of the correct class)
M .23 .52 .79 BC .86 C .79 ABC .70 AC .60 B .52 A .23 AB .20
M .65 .75 .26 AB .78 B .75 A .65 ABC .63 BC .57 AC .43 C .26
F .57 .40 .66 AC .70 C .66 A .57 ABC .51 BC .49 B .40 AB .40
F .57 .66 .72 ABC .89 BC .78 AC .77 AB .72 C .72 B .66 A .57
Compared to CLDL[Jin et al., 2016] on CIFAR100 Specific to image recognition domain,
CLDL[Jin et al., 2016] is one specialization of our “Mul” approach based on NIN [Lin et al., 2013]. We
implemented CLDL on Resnet. The error rates on two models are 29.6 0.5 and 25.8 0.4, respectively.
Qualitative analysis of MulMix results To understand how the multiplicative modal mixture combina-
tion method works in practice, we present its prediction results on a few gender prediction samples in
Table 7.5. We print out the prediction probabilities of the correct class for dierent modalities as well as
dierent modal mixtures. Blue (red) color indicates the probability that would lead to a correct (incor-
rect) prediction. To see the usefulness of modal mixtures, we sort the modal mixtures (including single
modalities) by the prediction probability on the correct class.
We see that given single modals with predictions of varying quality, modal mixtures generated
by MulMix+ give predictions with dierent reliabilities as well. In general, better modal mixtures
(appearing more on the left in the sorted row) often come from good single modalities (which give correct
prediction and with relatively high confidence); Examples include BC mixture in user 1 and AC mixture
in user 3. On the contrary, worse modal mixtures (appearing more on the right) often contain a really
weak modal. Examples include AB in user 1, AC in user 2, and AB in user 3. These predictions based on
modal structures are fed into the multiplicative combination step to generate the final score.
93
7.7 Summary
This chapter investigates new ways to combine multimodal data that accounts for heterogeneity of modal
signal strength across modalities, both in general and at a per-sample level. We focus on addressing the
challenge of “weak modalities”: some modalities may provide better predictors on average, but worse
for a given instance. To exploit these facts, we propose multiplicative combination techniques to tolerate
errors from the weak modalities, and help combat overfitting. We further propose multiplicative com-
bination of modal mixtures to combine the strength of proposed multiplicative combination and exist-
ing additive combination. Our experiments on three dierent domains demonstrate consistent accuracy
improvements over the state-of-the-art in those domains, thereby demonstrating the fact that our new
framework represents a general advance that is not limited to a specific domain or problem.
94
Chapter 8
Conclusions
8.1 Summary of contributions
The main contribution of this thesis is that of advancing item recommendation performance by addressing
scalability challenges. Particularly, we proposed machine learning solutions to addressing two major
types of challenges: increasing data volume and rich format signals. To tackle the former, we developed
several ecient ranking algorithms to handle the large item set, to capture complex patterns in sparse
input signals. To tackle the latter, we advanced content-based approaches by modeling heterogeneous
attributes and by fusing information from multiple modalities.
8.1.1 Ecient rankers
In Chapter 3 we focus on recommendation from large item sets. Existing methods have diculty in
accurately estimating ranks of target items. This gives degraded recommendation accuracy and makes
training agonizingly slow. Our key observation is that the pairwise sampling algorithm component in
existing methods holds back the performance and may be replaced with a new batchwise algorithm. Our
newly designed batch-based rank approximations turn out to combine the advantages of stable estimation
and use of parallel computation. In addition, we make the training end-to-end by designing smooth loss
functions that automatically encourage top accuracy, which is desired in ranking metrics. Note that this
approach is not limited to a particular type of model, such as matrix factorization, but supports a variety
of models thanks to its end-to-end training fashion.
In Chapter 4 we focus on sequence properties of user feedback. Rather than traditional ranking meth-
ods to incorporate sequence information, we innovate by applying recurrent neural networks to model user
behavior as a sequence generation problem. This is the first time recurrent neural networks are explored in
general recommendation domains. We find that the new approach significantly outperforms non-sequence
95
methods like matrix factorization variants and existing sequence methods based on word2vec type of
models.
We further investigate the reasons why sequence models help recommendation. We conduct addi-
tional experiments to corrupt the sequence information and to modify training datasets in size. We find
that the order of the input events is vital to achieving good performance—which suggests that important
information is encoded in the sequence of user behaviors. We also find that sequence models show addi-
tional advantages with more users in the training set. This further confirms that user activity patterns exist
in event sequence (rather than i.i.d. user-item pairs) and suggests the advantages of sequence models in
large-scale settings.
In Chapter 5 we develop simple yet eective time re-weighting methods to explicitly incorporate
temporal factors in recommendation. Temporal factors are important, but are often hidden behind other
factors. We designed a pairwise sampling learning method to carefully construct a training dataset to
isolate sample triplets that are mostly controlled by temporal factors. Our method is explicit and easy to
interpret. It is applicable to dierent models as well. Our experiments on two recommendation models
demonstrate both an accuracy boost and a significant reduction in training time.
8.1.2 Content-based methods
In Chapter 6 we advance content-based methods by developing new techniques to embed heterogeneous
attributes. We improve standard attribute embedding techniques to account for special challenges of het-
erogeneous attributes by exploiting recurrent neural networks and additional regularizations. Our tech-
niques are orthogonal to existing methods and bring additional performance improvements. Our eval-
uation also shows that embedding does an automatic structure discovery in the heterogeneous attribute
domains.
In Chapter 7 we improve multimodal deep learning methods to eciently fuse information from
dierent modalities. Existing methods apply a similar “additive” idea to aggregate signals from dierent
modalities and fuse information with a single fusion network. This ignores varying levels of noise in
multimodal learning and is limited in practice. We work on new methods to combine modalities in
general as well as on a per-sample basis. Our methods give special attention to certain weak modalities
and promote robustness in them. We validate our methods on datasets from three dierent domains and
consistently show significant improvements.
96
8.2 Conclusions and perspectives
This thesis has demonstrated the benefits of building learning algorithms to address item recommendation
scalability challenges. It has proposed and evaluated a series of innovative approaches to substantiate this
claim. Our work may inspire other researchers to continue eots on this topic.
For example, in Chapter 3 our methods are purposely designed to support a variety of models and
recommendation scenarios. They are not limited in the hybrid matrix factorization but readily applicable
to other models such as neural-network-based methods. We expect to see our algorithm help improve
results on other models as well.
In Chapter 7 our algorithms work to combine input from multiple modalities. However, the algorithms
can also be seen as a way to conduct ensemble learning from multiple weak model results. Given that our
methods trade o between model capability and regularization, we expect to see the methods work on a
wide range of ensemble tasks as well.
Also, in Chapter 3 our rank-sensitive objective functions do not have a specific preference for posi-
tions of interest. However, this might be of a certain interest in practical applications. For example, an
online retailer may particularly focus on the NDCG value at position 20 due to the maximum content
display. It is possible to extend our approaches to account for such special needs. We expect to see future
research tailered to the objective functions.
In Chapter 5 our method is one alternative way to model temporal factors in addition to other temporal
methods. Our method is easy to interpret and apply to dierent models. As a comparison, Hawkes
process-based methods such as [Dai et al., 2016] also demonstrate competitive performance. We expect
to see comparisons between these two methods.
Finally, in Chapter 7 our multimodal deep learning approach was evaluated on classification tasks.
While ranking problems behind recommendation tasks are closely related to classification, additional
modifications in objective functions and training procedure are needed to tailor it to recommendation
tasks. We expect to see future study along this line.
8.3 Supporting publications
This thesis is supported by the following publications.
97
Liu, Kuan and Shi, Xing and Kumar, Anoop and Zhu, Linhong and Natarajan, Prem. Temporal
learning and sequence modeling for a job recommender system. Proceedings of the Recommender
Systems Challenge (RecSys Challenge), 2016
Liu, Kuan and Natarajan, Prem. WMRB: Learning to Rank in a Scalable Batch Training
Approach. Poster Proceedings of 11th ACM Conference on Recommender Systems (RecSys),
2017.
Liu, Kuan and Shi, Xing and Natarajan, Prem. Sequential Heterogeneous Attribute Embed-
ding for Item Recommendation. 2017 IEEE International Conference on Data Mining Workshops
(ICDMW), 2017.
Liu, Kuan and Natarajan, Prem. A Batch Learning Framework for Scalable Personalized Ranking.
2018 AAAI Conference on Artificial Intelligence (AAAI), 2018.
Liu, Kuan and Shi, Xing and Natarajan, Prem. A Sequential Embedding Approach for Item Rec-
ommendation with Heterogeneous Attributes. In submission to IEEE Transactions on Knowledge
and Data Engineering (TKDE), 2018.
Liu, Kuan and Li, Yanen and Xu, Ning and Natarajan, Prem. Learn to Combine Modalities in
Multimodal Deep Learning. In submission to Neural Information Processing Systems (NIPS),
2018.
98
Bibliography
[Abel et al., 2016] Abel, F., Bencz´ ur, A., Kohlsdorf, D., Larson, M., and P´ alovics, R. (2016). Recsys
challenge 2016: Job recommendations. Proceedings of the 2016 International ACM Recommender
Systems.
[Abel et al., 2011] Abel, F., Gao, Q., Houben, G.-J., and Tao, K. (2011). Analyzing temporal dynamics
in twitter profiles for personalized recommendations in the social web. In Proceedings of the 3rd
International Web Science Conference, page 2. ACM.
[Agarwal, 2011] Agarwal, S. (2011). The infinite push: A new support vector ranking algorithm that
directly optimizes accuracy at the absolute top of the list. In Proceedings of the 2011 SIAM Interna-
tional Conference on Data Mining, pages 839–850. SIAM.
[Atrey et al., 2010] Atrey, P. K., Hossain, M. A., El Saddik, A., and Kankanhalli, M. S. (2010). Multi-
modal fusion for multimedia analysis: a survey. Multimedia systems, 16(6):345–379.
[Ba et al., 2014] Ba, J., Mnih, V ., and Kavukcuoglu, K. (2014). Multiple object recognition with visual
attention. arXiv preprint arXiv:1412.7755.
[Baldi et al., 2014] Baldi, P., Sadowski, P., and Whiteson, D. (2014). Searching for exotic particles in
high-energy physics with deep learning. Nature communications, 5:4308.
[Baltrunas et al., 2011] Baltrunas, L., Ludwig, B., and Ricci, F. (2011). Matrix factorization techniques
for context aware recommendation. In Proceedings of the fifth ACM conference on Recommender
systems, pages 301–304. ACM.
[Bansal et al., 2016] Bansal, T., Belanger, D., and McCallum, A. (2016). Ask the gru: Multi-task learn-
ing for deep text recommendations. In Proceedings of the 10th ACM Conference on Recommender
Systems, pages 107–114. ACM.
[Barrag´ ans-Mart´ ınez et al., 2010] Barrag´ ans-Mart´ ınez, A. B., Costa-Montenegro, E., Burguillo, J. C.,
Rey-L´ opez, M., Mikic-Fonte, F. A., and Peleteiro, A. (2010). A hybrid content-based and item-based
collaborative filtering approach to recommend tv programs enhanced with singular value decomposi-
tion. Information Sciences, 180(22):4290–4311.
[Bayer et al., 2016] Bayer, I., He, X., Kanagal, B., and Rendle, S. (2016). A generic coordinate descent
framework for learning from implicit feedback. arXiv preprint arXiv:1611.04666.
[Bell and Koren, 2007] Bell, R. M. and Koren, Y . (2007). Lessons from the netflix prize challenge. Acm
Sigkdd Explorations Newsletter, 9(2):75–79.
[Bhargava et al., 2015] Bhargava, P., Phan, T., Zhou, J., and Lee, J. (2015). Who, what, when, and where:
Multi-dimensional collaborative recommendations using tensor factorization on sparse user-generated
data. In Proceedings of the 24th International Conference on World Wide Web, pages 130–140. ACM.
[Blei et al., 2003] Blei, D. M., Ng, A. Y ., and Jordan, M. I. (2003). Latent dirichlet allocation. Journal
of machine Learning research, 3(Jan):993–1022.
[Bobadilla et al., 2009] Bobadilla, J., Serradilla, F., Hernando, A., et al. (2009). Collaborative filtering
adapted to recommender systems of e-learning. Knowledge-Based Systems, 22(4):261–265.
[Boyd et al., 2012] Boyd, S., Cortes, C., Mohri, M., and Radovanovic, A. (2012). Accuracy at the top.
In Advances in neural information processing systems, pages 953–961.
99
[Bucak et al., 2014] Bucak, S. S., Jin, R., and Jain, A. K. (2014). Multiple kernel learning for visual
object recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence,
36(7):1354–1369.
[Burges et al., 2005] Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., and Hul-
lender, G. (2005). Learning to rank using gradient descent. In Proceedings of the 22nd international
conference on Machine learning, pages 89–96. ACM.
[Burges et al., 2007] Burges, C. J., Ragno, R., and Le, Q. V . (2007). Learning to rank with nonsmooth
cost functions. In Advances in neural information processing systems, pages 193–200.
[Castro-Schez et al., 2011] Castro-Schez, J. J., Miguel, R., Vallejo, D., and L´ opez-L´ opez, L. M. (2011).
A highly adaptive recommender system based on fuzzy logic for b2c e-commerce portals. Expert
Systems with Applications, 38(3):2441–2454.
[Chan et al., 2016] Chan, W., Jaitly, N., Le, Q., and Vinyals, O. (2016). Listen, attend and spell: A
neural network for large vocabulary conversational speech recognition. In Acoustics, Speech and
Signal Processing (ICASSP), 2016 IEEE International Conference on, pages 4960–4964. IEEE.
[Changpinyo et al., 2013] Changpinyo, S., Liu, K., and Sha, F. (2013). Similarity component analysis.
In Advances in Neural Information Processing Systems, pages 1511–1519.
[Chen et al., 2009] Chen, J., Geyer, W., Dugan, C., Muller, M., and Guy, I. (2009). Make new friends,
but keep the old: recommending people on social networking sites. In Proceedings of the SIGCHI
Conference on Human Factors in Computing Systems, pages 201–210. ACM.
[Chen et al., 2016] Chen, J., Monga, R., Bengio, S., and Jozefowicz, R. (2016). Revisiting distributed
synchronous sgd. arXiv preprint arXiv:1604.00981.
[Cheng et al., 2013] Cheng, C., Yang, H., Lyu, M. R., and King, I. (2013). Where you like to go next:
Successive point-of-interest recommendation. In IJCAI, volume 13, pages 2605–2611.
[Chi and Kolda, 2012] Chi, E. C. and Kolda, T. G. (2012). On tensors, sparsity, and nonnegative factor-
izations. SIAM Journal on Matrix Analysis and Applications, 33(4):1272–1299.
[Christoel et al., 2015] Christoel, F., Paudel, B., Newell, C., and Bernstein, A. (2015). Blockbusters
and wallflowers: Accurate, diverse, and scalable recommendations with random walks. In Proceedings
of the 9th ACM Conference on Recommender Systems, pages 163–170. ACM.
[Covington et al., 2016] Covington, P., Adams, J., and Sargin, E. (2016). Deep neural networks for
youtube recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems,
pages 191–198. ACM.
[Crespo et al., 2011] Crespo, R. G., Mart´ ınez, O. S., Lovelle, J. M. C., Garc´ ıa-Bustelo, B. C. P., Gayo, J.
E. L., and De Pablos, P. O. (2011). Recommendation system based on user interaction data applied to
intelligent electronic books. Computers in Human Behavior, 27(4):1445–1449.
[Dai et al., 2016] Dai, H., Wang, Y ., Trivedi, R., and Song, L. (2016). Recurrent coevolutionary latent
feature processes for continuous-time recommendation. In Proceedings of the 1st Workshop on Deep
Learning for Recommender Systems, pages 29–34. ACM.
[Delgado and Ishii, 1999] Delgado, J. and Ishii, N. (1999). Memory-based weighted majority prediction.
In SIGIR Workshop Recomm. Syst. Citeseer.
[Dietterich, 2000] Dietterich, T. G. (2000). Ensemble methods in machine learning. In International
workshop on multiple classifier systems, pages 1–15. Springer.
[Du et al., 2015] Du, N., Wang, Y ., He, N., Sun, J., and Song, L. (2015). Time-sensitive recommendation
from recurrent user activities. In Advances in Neural Information Processing Systems, pages 3492–
3500.
100
[Fang and Si, 2011] Fang, Y . and Si, L. (2011). Matrix co-factorization for recommendation with rich
side information and implicit feedback. In Proceedings of the 2nd International Workshop on Infor-
mation Heterogeneity and Fusion in Recommender Systems, pages 65–69. ACM.
[Gao et al., 2015] Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., and Xu, W. (2015). Are you talking to
a machine? dataset and methods for multilingual image question. In Advances in neural information
processing systems, pages 2296–2304.
[Gehler and Nowozin, 2009] Gehler, P. and Nowozin, S. (2009). On feature combination for multiclass
object classification. In Computer Vision, 2009 IEEE 12th International Conference on, pages 221–
228. IEEE.
[Ghahramani and Jordan, 1996] Ghahramani, Z. and Jordan, M. I. (1996). Factorial hidden markov mod-
els. In Advances in Neural Information Processing Systems, pages 472–478.
[Glodek et al., 2011] Glodek, M., Tschechne, S., Layher, G., Schels, M., Brosch, T., Scherer, S.,
K¨ achele, M., Schmidt, M., Neumann, H., Palm, G., et al. (2011). Multiple classifier systems for
the classification of audio-visual emotional states. In Aective Computing and Intelligent Interaction,
pages 359–368. Springer.
[Gomez et al., ] Gomez, A. N., Ren, M., Urtasun, R., and Grosse, R. B. The reversible residual network:
Backpropagation without storing activations.
[G¨ onen and Alpaydın, 2011] G¨ onen, M. and Alpaydın, E. (2011). Multiple kernel learning algorithms.
Journal of machine learning research, 12(Jul):2211–2268.
[Graves et al., 2013] Graves, A., Mohamed, A.-r., and Hinton, G. (2013). Speech recognition with deep
recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international
conference on, pages 6645–6649. IEEE.
[Grbovic et al., 2015] Grbovic, M., Radosavljevic, V ., Djuric, N., Bhamidipati, N., Savla, J., Bhagwan,
V ., and Sharp, D. (2015). E-commerce in your inbox: Product recommendations at scale. In Proceed-
ings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
pages 1809–1818. ACM.
[Gunes and Piccardi, 2005] Gunes, H. and Piccardi, M. (2005). Aect recognition from face and body:
early fusion vs. late fusion. In Systems, Man and Cybernetics, 2005 IEEE International Conference
on, volume 4, pages 3437–3443. IEEE.
[Gurban et al., 2008] Gurban, M., Thiran, J.-P., Drugman, T., and Dutoit, T. (2008). Dynamic modal-
ity weighting for multi-stream hmms inaudio-visual speech recognition. In Proceedings of the 10th
international conference on Multimodal interfaces, pages 237–240. ACM.
[Hariharan et al., 2015] Hariharan, B., Arbel´ aez, P., Girshick, R., and Malik, J. (2015). Hypercolumns for
object segmentation and fine-grained localization. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 447–456.
[Harper and Konstan, 2016] Harper, F. M. and Konstan, J. A. (2016). The movielens datasets: History
and context. ACM Transactions on Interactive Intelligent Systems (TiiS), 5(4):19.
[He et al., 2016a] He, K., Zhang, X., Ren, S., and Sun, J. (2016a). Deep residual learning for image
recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages
770–778.
[He and McAuley, 2015] He, R. and McAuley, J. (2015). Vbpr: visual bayesian personalized ranking
from implicit feedback. arXiv preprint arXiv:1510.01784.
101
[He et al., 2016b] He, X., Zhang, H., Kan, M.-Y ., and Chua, T.-S. (2016b). Fast matrix factorization for
online recommendation with implicit feedback. In Proceedings of the 39th International ACM SIGIR
conference on Research and Development in Information Retrieval, pages 549–558. ACM.
[Hidasi et al., 2015] Hidasi, B., Karatzoglou, A., Baltrunas, L., and Tikk, D. (2015). Session-based
recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939.
[Hidasi et al., 2016] Hidasi, B., Quadrana, M., Karatzoglou, A., and Tikk, D. (2016). Parallel recurrent
neural network architectures for feature-rich session-based recommendations. In Proceedings of the
10th ACM Conference on Recommender Systems, pages 241–248. ACM.
[Hidasi and Tikk, 2012] Hidasi, B. and Tikk, D. (2012). Fast als-based tensor factorization for context-
aware recommendation from implicit feedback. In Joint European Conference on Machine Learning
and Knowledge Discovery in Databases, pages 67–82. Springer.
[Hochreiter and Schmidhuber, 1997] Hochreiter, S. and Schmidhuber, J. (1997). Long short-term mem-
ory. Neural Computation, 9(8).
[Hong et al., 2013] Hong, L., Doumith, A. S., and Davison, B. D. (2013). Co-factorization machines:
modeling user interests and predicting individual decisions in twitter. In Proceedings of the sixth ACM
international conference on Web search and data mining, pages 557–566. ACM.
[Hu et al., 2008] Hu, Y ., Koren, Y ., and V olinsky, C. (2008). Collaborative filtering for implicit feedback
datasets. In Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on, pages 263–272.
Ieee.
[Huang and Bian, 2009] Huang, Y . and Bian, L. (2009). A bayesian network and analytic hierarchy
process based personalized recommendations for tourist attractions over the internet. Expert Systems
with Applications, 36(1):933–943.
[Huang et al., 2007] Huang, Z., Zeng, D., and Chen, H. (2007). A comparison of collaborative-filtering
recommendation algorithms for e-commerce. IEEE Intelligent Systems, 22(5).
[Jaiswal et al., 2017] Jaiswal, A., Sabir, E., AbdAlmageed, W., and Natarajan, P. (2017). Multimedia
semantic integrity assessment using joint embedding of images and text. In Proceedings of the 2017
ACM on Multimedia Conference, pages 1465–1471. ACM.
[Jamali and Ester, 2010] Jamali, M. and Ester, M. (2010). A matrix factorization technique with trust
propagation for recommendation in social networks. In Proceedings of the fourth ACM conference on
Recommender systems, pages 135–142. ACM.
[Jia et al., 2015] Jia, X., Gavves, E., Fernando, B., and Tuytelaars, T. (2015). Guiding the long-short term
memory model for image caption generation. In Computer Vision (ICCV), 2015 IEEE International
Conference on, pages 2407–2415. IEEE.
[Jin and Liang, 2016] Jin, Q. and Liang, J. (2016). Video description generation using audio and visual
cues. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, pages
239–242. ACM.
[Jin et al., 2016] Jin, X., Chen, Y ., Dong, J., Feng, J., and Yan, S. (2016). Collaborative layer-wise
discriminative learning in deep neural networks. In European Conference on Computer Vision, pages
733–749. Springer.
[Johnson, 2014] Johnson, C. C. (2014). Logistic matrix factorization for implicit feedback data.
Advances in Neural Information Processing Systems, 27.
[Kahou et al., 2016] Kahou, S. E., Bouthillier, X., Lamblin, P., Gulcehre, C., Michalski, V ., Konda, K.,
Jean, S., Froumenty, P., Dauphin, Y ., Boulanger-Lewandowski, N., et al. (2016). Emonets: Multimodal
deep learning approaches for emotion recognition in video. Journal on Multimodal User Interfaces,
10(2):99–111.
102
[Kapoor et al., 2015] Kapoor, K., Subbian, K., Srivastava, J., and Schrater, P. (2015). Just in time rec-
ommendations: Modeling the dynamics of boredom in activity streams. In Proceedings of the Eighth
ACM International Conference on Web Search and Data Mining, pages 233–242. ACM.
[Karatzoglou et al., 2010] Karatzoglou, A., Amatriain, X., Baltrunas, L., and Oliver, N. (2010). Multi-
verse recommendation: n-dimensional tensor factorization for context-aware collaborative filtering. In
Proceedings of the fourth ACM conference on Recommender systems, pages 79–86. ACM.
[Kim et al., 2016] Kim, D., Park, C., Oh, J., Lee, S., and Yu, H. (2016). Convolutional matrix factor-
ization for document context-aware recommendation. In Proceedings of the 10th ACM Conference on
Recommender Systems, pages 233–240. ACM.
[Kim, 2014] Kim, Y . (2014). Convolutional neural networks for sentence classification. arXiv preprint
arXiv:1408.5882.
[Kingma and Ba, 2014] Kingma, D. and Ba, J. (2014). Adam: A method for stochastic optimization.
arXiv preprint arXiv:1412.6980.
[Koenigstein et al., 2011] Koenigstein, N., Dror, G., and Koren, Y . (2011). Yahoo! music recommenda-
tions: modeling music ratings with temporal dynamics and item taxonomy. In Proceedings of the fifth
ACM conference on Recommender systems, pages 165–172. ACM.
[Koren, 2010] Koren, Y . (2010). Collaborative filtering with temporal dynamics. Communications of the
ACM, 53(4):89–97.
[Koren et al., 2009] Koren, Y ., Bell, R., V olinsky, C., et al. (2009). Matrix factorization techniques for
recommender systems. Computer, 42(8):30–37.
[Krizhevsky and Hinton, 2009] Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of fea-
tures from tiny images.
[Kula, 2015] Kula, M. (2015). Metadata Embeddings for User and Item Cold-start Recommendations.
arXiv preprint arXiv:1507.08439.
[Lee et al., 2010] Lee, S. K., Cho, Y . H., and Kim, S. H. (2010). Collaborative filtering with ordinal
scale-based implicit ratings for mobile music recommendations. Information Sciences, 180(11):2142–
2155.
[Lee et al., 2009] Lee, T. Q., Park, Y ., and Park, Y .-T. (2009). An empirical study on eectiveness of
temporal information as implicit ratings. Expert systems with Applications, 36(2):1315–1321.
[Levin et al., 2016] Levin, R., Abassi, H., and Cohen, U. (2016). Guided walk: A scalable recommenda-
tion algorithm for complex heterogeneous social networks. In Proceedings of the 10th ACM Confer-
ence on Recommender Systems, pages 293–300. ACM.
[Li et al., 2016] Li, H., Hong, R., Lian, D., Wu, Z., Wang, M., and Ge, Y . (2016). A relaxed ranking-
based factor model for recommender system from implicit feedback.
[Liang et al., 2016] Liang, D., Altosaar, J., Charlin, L., and Blei, D. M. (2016). Factorization meets the
item embedding: Regularizing matrix factorization with item co-occurrence. In Proceedings of the
10th ACM Conference on Recommender Systems, pages 59–66. ACM.
[Lin et al., 2013] Lin, M., Chen, Q., and Yan, S. (2013). Network in network. arXiv preprint
arXiv:1312.4400.
[Lin et al., 2017] Lin, T.-Y ., Goyal, P., Girshick, R., He, K., and Doll´ ar, P. (2017). Focal loss for dense
object detection. arXiv preprint arXiv:1708.02002.
[Linden et al., 2003] Linden, G., Smith, B., and York, J. (2003). Amazon. com recommendations: Item-
to-item collaborative filtering. IEEE Internet computing, 7(1):76–80.
103
[Liu et al., 2009] Liu, H., Hu, J., and Rauterberg, M. (2009). Music playlist recommendation based on
user heartbeat and music preference. In Computer Technology and Development, 2009. ICCTD’09.
International Conference on, volume 1, pages 545–549. IEEE.
[Liu and Natarajan, 2017] Liu, K. and Natarajan, P. (2017). A batch learning framework for scalable
personalized ranking. arXiv preprint arXiv:1711.04019.
[Liu et al., 2016] Liu, Q., Wu, S., Wang, D., Li, Z., and Wang, L. (2016). Context-aware sequential
recommendation. arXiv preprint arXiv:1609.05787.
[Liu et al., 2014] Liu, Y ., Wei, W., Sun, A., and Miao, C. (2014). Exploiting geographical neighborhood
characteristics for location recommendation. In Proceedings of the 23rd ACM International Confer-
ence on Conference on Information and Knowledge Management, pages 739–748. ACM.
[Liu et al., 2015] Liu, Y ., Zhao, P., Sun, A., and Miao, C. (2015). A boosting algorithm for item recom-
mendation with implicit feedback. In IJCAI, volume 15, pages 1792–1798.
[Malinowski et al., 2006] Malinowski, J., Keim, T., Wendt, O., and Weitzel, T. (2006). Matching people
and jobs: A bilateral recommendation approach. In Proceedings of the 39th Annual Hawaii Interna-
tional Conference on System Sciences (HICSS’06), volume 6, pages 137c–137c.
[Malinowski et al., 2015] Malinowski, M., Rohrbach, M., and Fritz, M. (2015). Ask your neurons: A
neural-based approach to answering questions about images. In Proceedings of the 2015 IEEE Inter-
national Conference on Computer Vision (ICCV), pages 1–9. IEEE Computer Society.
[Mei et al., 2011] Mei, T., Yang, B., Hua, X.-S., and Li, S. (2011). Contextual video recommendation by
multimodal relevance and user feedback. ACM Transactions on Information Systems (TOIS), 29(2):10.
[Mikolov et al., 2013] Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Ecient estimation of
word representations in vector space. arXiv preprint arXiv:1301.3781.
[Mnih et al., 2014] Mnih, V ., Heess, N., Graves, A., et al. (2014). Recurrent models of visual attention.
In Advances in neural information processing systems, pages 2204–2212.
[Moon et al., 2018] Moon, S., Neves, L., and Carvalho, V . (2018). Multimodal named entity recognition
for short social media posts. arXiv preprint arXiv:1802.07862.
[Morvant et al., 2014] Morvant, E., Habrard, A., and Ayache, S. (2014). Majority vote of diverse clas-
sifiers for late fusion. In Joint IAPR International Workshops on Statistical Techniques in Pattern
Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), pages 153–162. Springer.
[Nanopoulos et al., 2010] Nanopoulos, A., Rafailidis, D., Symeonidis, P., and Manolopoulos, Y . (2010).
Musicbox: Personalized music recommendation based on cubic analysis of social tags. IEEE Trans-
actions on Audio, Speech, and Language Processing, 18(2):407–412.
[Nefian et al., 2002] Nefian, A. V ., Liang, L., Pi, X., Xiaoxiang, L., Mao, C., and Murphy, K. (2002).
A coupled hmm for audio-visual speech recognition. In Acoustics, Speech, and Signal Processing
(ICASSP), 2002 IEEE International Conference on, volume 2, pages II–2013. IEEE.
[Neverova et al., 2016] Neverova, N., Wolf, C., Taylor, G., and Nebout, F. (2016). Moddrop: adaptive
multi-modal gesture recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence,
38(8):1692–1706.
[Newman, 2005] Newman, M. E. (2005). Power laws, pareto distributions and zipf’s law. Contemporary
physics, 46(5):323–351.
[Ngiam et al., 2011] Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., and Ng, A. Y . (2011). Multimodal
deep learning. In Proceedings of the 28th international conference on machine learning (ICML-11),
pages 689–696.
104
[N´ u˜ nez-Vald´ ez et al., 2012] N´ u˜ nez-Vald´ ez, E. R., Lovelle, J. M. C., Mart´ ınez, O. S., Garc´ ıa-D´ ıaz, V .,
De Pablos, P. O., and Mar´ ın, C. E. M. (2012). Implicit feedback techniques on recommender systems
applied to electronic books. Computers in Human Behavior, 28(4):1186–1193.
[Ono et al., 2007] Ono, C., Kurokawa, M., Motomura, Y ., and Asoh, H. (2007). A context-aware movie
preference model using a bayesian network for recommendation and promotion. In International
Conference on User Modeling, pages 247–257. Springer.
[Oramas et al., 2017] Oramas, S., Nieto, O., Sordo, M., and Serra, X. (2017). A deep multimodal
approach for cold-start music recommendation. arXiv preprint arXiv:1706.09739.
[Ouyang et al., 2014] Ouyang, W., Chu, X., and Wang, X. (2014). Multi-source deep learning for human
pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 2329–2336.
[Paterek, 2007] Paterek, A. (2007). Improving regularized singular value decomposition for collaborative
filtering. In Proceedings of KDD cup and workshop, volume 2007, pages 5–8.
[Pazzani and Billsus, 2007] Pazzani, M. and Billsus, D. (2007). Content-based recommendation systems.
The adaptive web, pages 325–341.
[Porcel and Herrera-Viedma, 2010] Porcel, C. and Herrera-Viedma, E. (2010). Dealing with incomplete
information in a fuzzy linguistic recommender system to disseminate information in university digital
libraries. Knowledge-Based Systems, 23(1):32–39.
[Porcel et al., 2009] Porcel, C., Moreno, J. M., and Herrera-Viedma, E. (2009). A multi-disciplinar rec-
ommender system to advice research resources in university digital libraries. Expert Systems with
Applications, 36(10):12520–12528.
[Porcel et al., 2012] Porcel, C., Tejeda-Lorente, A., Mart´ ınez, M., and Herrera-Viedma, E. (2012). A
hybrid recommender system for the selective dissemination of research resources in a technology
transfer oce. Information Sciences, 184(1):1–19.
[Ramirez et al., 2011] Ramirez, G. A., Baltruˇ saitis, T., and Morency, L.-P. (2011). Modeling latent dis-
criminative dynamic of multi-dimensional aective signals. In Aective Computing and Intelligent
Interaction, pages 396–406. Springer.
[Rendle, 2010] Rendle, S. (2010). Time-variant factorization models. In Context-Aware Ranking with
Factorization Models, pages 137–153. Springer.
[Rendle and Freudenthaler, 2014] Rendle, S. and Freudenthaler, C. (2014). Improving pairwise learn-
ing for item recommendation from implicit feedback. In Proceedings of the 7th ACM international
conference on Web search and data mining, pages 273–282. ACM.
[Rendle et al., 2009] Rendle, S., Freudenthaler, C., Gantner, Z., and Schmidt-Thieme, L. (2009). Bpr:
Bayesian personalized ranking from implicit feedback. In Proceedings of the twenty-fifth conference
on uncertainty in artificial intelligence, pages 452–461. AUAI Press.
[Rendle et al., 2010] Rendle, S., Freudenthaler, C., and Schmidt-Thieme, L. (2010). Factorizing per-
sonalized markov chains for next-basket recommendation. In Proceedings of the 19th international
conference on World wide web, pages 811–820. ACM.
[Sarwar et al., 2001] Sarwar, B., Karypis, G., Konstan, J., and Riedl, J. (2001). Item-based collaborative
filtering recommendation algorithms. In Proceedings of the 10th international conference on World
Wide Web, pages 285–295. ACM.
[Saveski and Mantrach, 2014] Saveski, M. and Mantrach, A. (2014). Item cold-start recommendations:
learning local collective embeddings. In Proceedings of the 8th ACM Conference on Recommender
systems, pages 89–96. ACM.
105
[Serrano-Guerrero et al., 2011] Serrano-Guerrero, J., Herrera-Viedma, E., Olivas, J. A., Cerezo, A., and
Romero, F. P. (2011). A google wave-based fuzzy recommender system to disseminate information in
university digital libraries 2.0. Information Sciences, 181(9):1503–1516.
[Shi et al., 2012a] Shi, Y ., Karatzoglou, A., Baltrunas, L., Larson, M., Hanjalic, A., and Oliver, N.
(2012a). Tfmap: optimizing map for top-n context-aware recommendation. In Proceedings of the 35th
international ACM SIGIR conference on Research and development in information retrieval, pages
155–164. ACM.
[Shi et al., 2012b] Shi, Y ., Karatzoglou, A., Baltrunas, L., Larson, M., Oliver, N., and Hanjalic, A.
(2012b). Climf: learning to maximize reciprocal rank with collaborative less-is-more filtering. In
Proceedings of the sixth ACM conference on Recommender systems, pages 139–146. ACM.
[Shmueli et al., 2012] Shmueli, E., Kagian, A., Koren, Y ., and Lempel, R. (2012). Care to comment?:
recommendations for commenting on news stories. In Proceedings of the 21st international conference
on World Wide Web, pages 429–438. ACM.
[Shutova et al., 2016] Shutova, E., Kiela, D., and Maillard, J. (2016). Black holes and white rabbits:
Metaphor identification with visual features. In Proceedings of the 2016 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language Technologies,
pages 160–170.
[Snoek et al., 2005] Snoek, C. G., Worring, M., and Smeulders, A. W. (2005). Early versus late fusion
in semantic video analysis. In Proceedings of the 13th annual ACM international conference on Mul-
timedia, pages 399–402. ACM.
[Srebro et al., 2005] Srebro, N., Rennie, J., and Jaakkola, T. S. (2005). Maximum-margin matrix factor-
ization. In Advances in neural information processing systems, pages 1329–1336.
[Tan et al., 2011] Tan, S., Bu, J., Chen, C., and He, X. (2011). Using rich social media information
for music recommendation via hypergraph model. In Social media modeling and computing, pages
213–237. Springer.
[Taylor et al., 2008] Taylor, M., Guiver, J., Robertson, S., and Minka, T. (2008). Softrank: optimizing
non-smooth rank metrics. In Proceedings of the 2008 International Conference on Web Search and
Data Mining, pages 77–86. ACM.
[Usunier et al., 2009] Usunier, N., Buoni, D., and Gallinari, P. (2009). Ranking with ordered weighted
pairwise classification. In Proceedings of the 26th annual international conference on machine learn-
ing, pages 1057–1064. ACM.
[Van den Oord et al., 2013] Van den Oord, A., Dieleman, S., and Schrauwen, B. (2013). Deep content-
based music recommendation. In Advances in Neural Information Processing Systems, pages 2643–
2651.
[Vasile et al., 2016] Vasile, F., Smirnova, E., and Conneau, A. (2016). Meta-prod2vec: Product embed-
dings using side-information for recommendation. In Proceedings of the 10th ACM Conference on
Recommender Systems, pages 225–232. ACM.
[Vinyals et al., 2015] Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015). Show and tell: A neural
image caption generator. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference
on, pages 3156–3164. IEEE.
[Wang et al., 2015a] Wang, D., Cui, P., Ou, M., and Zhu, W. (2015a). Deep multimodal hashing with
orthogonal regularization. In IJCAI, volume 367, pages 2291–2297.
[Wang et al., 2015b] Wang, H., Wang, N., and Yeung, D.-Y . (2015b). Collaborative deep learning for rec-
ommender systems. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pages 1235–1244. ACM.
106
[Wang et al., 2015c] Wang, X., Zhao, Y .-L., Nie, L., Gao, Y ., Nie, W., Zha, Z.-J., and Chua, T.-S. (2015c).
Semantic-based location recommendation with multimodal venue semantics. IEEE Transactions on
Multimedia, 17(3):409–419.
[Weimer et al., 2007] Weimer, M., Karatzoglou, A., Le, Q. V ., and Smola, A. (2007). Maximum margin
matrix factorization for collaborative ranking. Advances in neural information processing systems,
pages 1–8.
[Weimer et al., 2008] Weimer, M., Karatzoglou, A., Le, Q. V ., and Smola, A. J. (2008). Cofi rank-
maximum margin matrix factorization for collaborative ranking. In Advances in neural information
processing systems, pages 1593–1600.
[Weston et al., 2010] Weston, J., Bengio, S., and Usunier, N. (2010). Large scale image annotation:
learning to rank with joint word-image embeddings. Machine learning, 81(1):21–35.
[W¨ ollmer et al., 2010] W¨ ollmer, M., Metallinou, A., Eyben, F., Schuller, B., and Narayanan, S. (2010).
Context-sensitive multimodal emotion recognition from speech and facial expression using bidirec-
tional lstm modeling. In Proc. INTERSPEECH 2010, Makuhari, Japan, pages 2362–2365.
[Wu et al., 2010] Wu, Q., Burges, C. J., Svore, K. M., and Gao, J. (2010). Adapting boosting for infor-
mation retrieval measures. Information Retrieval, 13(3):254–270.
[Xiong et al., 2010] Xiong, L., Chen, X., Huang, T.-K., Schneider, J., and Carbonell, J. G. (2010). Tem-
poral collaborative filtering with bayesian probabilistic tensor factorization. In Proceedings of the 2010
SIAM International Conference on Data Mining, pages 211–222. SIAM.
[Xu and Saenko, 2016] Xu, H. and Saenko, K. (2016). Ask, attend and answer: Exploring question-
guided spatial attention for visual question answering. In European Conference on Computer Vision,
pages 451–466. Springer.
[Yager, 1988] Yager, R. R. (1988). On ordered weighted averaging aggregation operators in multicriteria
decisionmaking. IEEE Transactions on systems, Man, and Cybernetics, 18(1):183–190.
[Yang and Ramanan, 2015] Yang, S. and Ramanan, D. (2015). Multi-scale recognition with dag-cnns.
In Computer Vision (ICCV), 2015 IEEE International Conference on, pages 1215–1223. IEEE.
[Yu et al., 2012] Yu, H.-F., Hsieh, C.-J., Si, S., and Dhillon, I. (2012). Scalable coordinate descent
approaches to parallel matrix factorization for recommender systems. In Data Mining (ICDM), 2012
IEEE 12th International Conference on, pages 765–774. IEEE.
[Yu et al., 2004] Yu, K., Schwaighofer, A., Tresp, V ., Xu, X., and Kriegel, H.-P. (2004). Probabilis-
tic memory-based collaborative filtering. IEEE Transactions on Knowledge and Data Engineering,
16(1):56–69.
[Yu et al., 2006] Yu, Z., Zhou, X., Hao, Y ., and Gu, J. (2006). Tv program recommendation for multiple
viewers based on user profile merging. User modeling and user-adapted interaction, 16(1):63–82.
[Yuan et al., 2016] Yuan, F., Guo, G., Jose, J. M., Chen, L., Yu, H., and Zhang, W. (2016). Lambdafm:
learning optimal ranking with factorization machines using lambda surrogates. In Proceedings of the
25th ACM International on Conference on Information and Knowledge Management, pages 227–236.
ACM.
[Za´ ıane, 2002] Za´ ıane, O. R. (2002). Building a recommender agent for e-learning systems. In Comput-
ers in Education, 2002. Proceedings. International Conference on, pages 55–59. IEEE.
[Zhang et al., 2006] Zhang, S., Wang, W., Ford, J., and Makedon, F. (2006). Learning from incomplete
ratings using non-negative matrix factorization. In Proceedings of the 2006 SIAM International Con-
ference on Data Mining, pages 549–553. SIAM.
107
[Zhang et al., 2015] Zhang, X., Zhao, J., and LeCun, Y . (2015). Character-level convolutional networks
for text classification. In Advances in neural information processing systems, pages 649–657.
[Zheng et al., 2010] Zheng, V . W., Zheng, Y ., Xie, X., and Yang, Q. (2010). Collaborative location and
activity recommendations with gps history data. In Proceedings of the 19th international conference
on World wide web, pages 1029–1038. ACM.
[Zhong et al., 2014] Zhong, H., Pan, W., Xu, C., Yin, Z., and Ming, Z. (2014). Adaptive pairwise pref-
erence learning for collaborative recommendation with implicit feedbacks. In Proceedings of the 23rd
ACM International Conference on Conference on Information and Knowledge Management, pages
1999–2002. ACM.
108
Abstract (if available)
Abstract
Recommendation with implicit feedback aims to propose items to users that are useful and relevant. It has enormous applications in fields like e-commerce, social networks, music, television, and so on. Existing recommendation methods face scalability challenges from increasingly large data volumes and rich format signals. In particular, the challenges include how to efficiently train ranking models on a large user set or item set, how to capture complex user behavior patterns from sparse input signals, and how to incorporate rich side information that is of different formats and from different domains. ❧ This thesis investigates scalable machine learning algorithms to improve recommendation accuracy as well as model ability to handle large-scale data. It is innovative in designing novel ranking algorithms to deal with recommendations from large item sets and to model sequential and temporal properties of user feedback. It also advances state-of-the-art content-based recommendation approaches by modeling heterogeneous attributes and efficiently fusing signals from multiple modalities. Empirical studies on data from different domains show that scalable and flexible learning approaches can efficiently help to extract useful information from sparse feedback and enhance recommendation performance.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Neural sequence models: Interpretation and augmentation
PDF
Transfer learning for intelligent systems in the wild
PDF
Improving machine learning algorithms via efficient data relevance discovery
PDF
Deep learning models for temporal data in health care
PDF
Fair Machine Learning for Human Behavior Understanding
PDF
Learning controllable data generation for scalable model training
PDF
Learning distributed representations from network data and human navigation
PDF
Algorithm and system co-optimization of graph and machine learning systems
PDF
Modeling, learning, and leveraging similarity
PDF
Data scarcity in robotics: leveraging structural priors and representation learning
PDF
Advanced machine learning techniques for video, social and biomedical data analytics
PDF
Beyond parallel data: decipherment for better quality machine translation
PDF
Interactive learning: a general framework and various applications
PDF
Simulation and machine learning at exascale
PDF
Tensor learning for large-scale spatiotemporal analysis
PDF
Artificial Decision Intelligence: integrating deep learning and combinatorial optimization
PDF
Building straggler-resilient and private machine learning systems in the cloud
PDF
Scalable optimization for trustworthy AI: robust and fair machine learning
PDF
Non-traditional resources and improved tools for low-resource machine translation
PDF
Federated and distributed machine learning at scale: from systems to algorithms to applications
Asset Metadata
Creator
Liu, Kuan
(author)
Core Title
Scalable machine learning algorithms for item recommendation
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publication Date
07/26/2018
Defense Date
05/07/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
item recommendation,machine learning,OAI-PMH Harvest,recommender system
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Natarajan, Prem (
committee chair
), Knight, Kevin (
committee member
), Narayanan, Shri (
committee member
)
Creator Email
kuanl@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-22806
Unique identifier
UC11671425
Identifier
etd-LiuKuan-6502.pdf (filename),usctheses-c89-22806 (legacy record id)
Legacy Identifier
etd-LiuKuan-6502.pdf
Dmrecord
22806
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Liu, Kuan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
item recommendation
machine learning
recommender system